リポジトリ提供の戦略

Due largely to the simplicity of the overall design of the Subversion repository and the technologies on which it relies, creating and configuring a repository are fairly straightforward tasks. There are a few preliminary decisions you'll want to make, but the actual work involved in any given setup of a Subversion repository is pretty straightforward, tending towards mindless repetition if you find yourself setting up multiples of these things.

Some things you'll want to consider up front, though, are:

In this section, we'll try to help you answer those questions.

リポジトリ構成の計画

While Subversion allows you to move around versioned files and directories without any loss of information, and even provides ways of moving whole sets of versioned history from one repository to another, doing so can greatly disrupt the workflow of those who access the repository often and come to expect things to be at certain locations. So before creating a new repository, try to peer into the future a bit; plan ahead before placing your data under version control. By conscientiously 「laying out」 your repository or repositories and their versioned contents ahead of time, you can prevent many future headaches.

Let's assume that as repository administrator, you will be responsible for supporting the version control system for several projects. Your first decision is whether to use a single repository for multiple projects, or to give each project its own repository, or some compromise of these two.

There are benefits to using a single repository for multiple projects, most obviously the lack of duplicated maintenance. A single repository means that there is one set of hook programs, one thing to routinely backup, one thing to dump and load if Subversion releases an incompatible new version, and so on. Also, you can move data between projects easily, and without losing any historical versioning information.

The downside of using a single repository is that different projects may have different requirements in terms of the repository event triggers, such as needing to send commit notification emails to different mailing lists, or having different definitions about what does and does not constitute a legitimate commit. These aren't insurmountable problems, of course—it just means that all of your hook scripts have to be sensitive to the layout of your repository rather than assuming that the whole repository is associated with a single group of people. Also, remember that Subversion uses repository-global revision numbers. While those numbers don't have any particular magical powers, some folks still don't like the fact that even though no changes have been made to their project lately, the youngest revision number for the repository keeps climbing because other projects are actively adding new revisions. [27]

折衷策をとることもできます。たとえば、お互いに どの程度深く関係しているかによってプロジェクトをグループ化する ことができます。それぞれのリポジトリにいくつかのプロジェクトを持たせる ことで、少ない数のリポジトリを管理することもできます。この方法では データを共有したいプロジェクトは簡単にそうすることができますし、 新しいリビジョンがリポジトリに追加されると、開発者は そのような新しいリビジョンは、自分のプロジェクトか、少なくともそれに 関係しているプロジェクトの誰かがやったものだということがわかります。

After deciding how to organize your projects with respect to repositories, you'll probably want to think about directory hierarchies within the repositories themselves. Because Subversion uses regular directory copies for branching and tagging (see 第4章), the Subversion community recommends that you choose a repository location for each project root—the 「top-most」 directory which contains data related to that project—and then create three subdirectories beneath that root: trunk, meaning the directory under which the main project development occurs; branches, which is a directory in which to create various named branches of the main development line; tags, which is a collection of tree snapshots that are created, and perhaps destroyed, but never changed. [28]

たとえば、リポジトリが以下のようであるとして:

/
   calc/
      trunk/
      tags/
      branches/
   calendar/
      trunk/
      tags/
      branches/
   spreadsheet/
      trunk/
      tags/
      branches/
   …

それぞれのプロジェクトルートがリポジトリ中のどこにあるかは問題には なりません。もしリポジトリに唯一のプロジェクトがある場合は それぞれのプロジェクトルートを置くための論理的な場所はプロジェクト ごとのリポジトリのルートになります。もし複数のプロジェクトが ある場合は、リポジトリ内部のグループ中にそれを配置したいかも知れません、 おそらく同じサブディレクトリ中の似たような目標や共有するコードと 一緒にプロジェクトを置くか、あるいは名前の辞書順にグループ化するか、 などです。配置は以下のようになるでしょう:

/
   utils/
      calc/
         trunk/
         tags/
         branches/
      calendar/
         trunk/
         tags/
         branches/
      …
   office/
      spreadsheet/
         trunk/
         tags/
         branches/
      …

Lay out your repository in whatever way you see fit. Subversion does not expect or enforce a particular layout—in its eyes, a directory is a directory is a directory. Ultimately, you should choose the repository arrangement that meets the needs of the people who work on the projects that live there.

In the name of full disclosure, though, we'll mention another very common layout. In this layout, the trunk, tags, and branches directories live in the root directory of your repository, and your projects are in subdirectories beneath those, like:

/
   trunk/
      calc/
      calendar/
      spreadsheet/
      …
   tags/
      calc/
      calendar/
      spreadsheet/
      …
   branches/
      calc/
      calendar/
      spreadsheet/
      …

There's nothing particularly incorrect about such a layout, but it may or may not seem as intuitive for your users. Especially in large, multi-project situations with many users, those users may tend to be familiar with only one or two of the projects in the repository. But the projects-as-branch-siblings tends to de-emphasize project individuality and focus on the entire set of projects as a single entity. That's a social issue though. We like our originally suggested arrangement for purely practical reasons—it's easier to ask about (or modify, or migrate elsewhere) the entire history of a single project when there's a single repository path that holds the entire history—past, present, tagged, and branched—for that project and that project alone.

リポジトリをどこにどのように格納するかの決定

Before creating your Subversion repository, an obvious question you'll need to answer is where the thing is going to live. This is strongly connected to a myriad of other questions involving how the repository will be accessed (via a Subversion server or directly), by whom (users behind your corporate firewall or the whole world out on the open Internet), what other services you'll be providing around Subversion (repository browsing interfaces, e-mail based commit notification, etc.), your data backup strategy, and so on.

We cover server choice and configuration in 第6章, but the point we'd like to briefly make here is simply that the answers to some of these other questions might have implications that force your hand when deciding where your repository will live. For example, certain deployment scenarios might require accessing the repository via a remote filesystem from multiple computers, in which case (as you'll read in the next section) your choice of a repository back-end data store turns out not to be a choice at all because only one of the available back-ends will work in this scenario.

Addressing each possible way to deploy Subversion is both impossible, and outside the scope of this book. We simply encourage you to evaluate your options using these pages and other sources as your reference material, and plan ahead.

リポジトリ保存形式の選択

As of version 1.1, Subversion provides two options for the type of underlying data store—often referred to as 「the back-end」 or, somewhat confusingly, 「the (versioned) filesystem」—that each repository uses. One type of data store keeps everything in a Berkeley DB (or BDB) database environment; repositories that use this type are often referred to as being 「BDB-backed」. The other type stores data in ordinary flat files, using a custom format. Subversion developers have adopted the habit of referring to this latter data storage mechanism as FSFS [29] —a versioned filesystem implementation that uses the native OS filesystem directly—rather than via a database library or some other abstraction layer—to store data.

表 5.1. 「Repository 保存形式の比較」 に Berkeley DB と FSFS リポジトリの比較表があります。

表 5.1. Repository 保存形式の比較

分類機能Berkeley DBFSFS
信頼性データの整合性適切に配置されていれば非常に信頼できる。Berkeley DB 4.4 には自動リカバリあり旧バージョンでは、非常にまれだが、データを破壊するバグがあった
リポジトリの壊れやすさ非常に壊れやすい。リポジトリが壊れたりパーミッショ ンの問題が起こった場合、データベースは「中途半端な」状態になり、ジャーナル復帰処理が必要まったく気にならない
アクセシビリティリードオンリーでマウントできるかいいえはい
プラットフォームに独立した保存形式かいいえはい
ネットワークファイルシステムでも使えるか一般的に、いいえはい
グループパーミッション制御umask 問題に注意。一ユーザのみでアクセスするのが一番よいumask の問題を回避できる
スケーラビリティリポジトリのディスク使用量多い (特にログファイルを削除しない場合)少ない
リビジョンツリーの数データベースのため問題なしOS のファイルシステムが古い場合、一つのディレクトリ中に数千エントリあるとうスケールしない。
大量のファイルがあるディレクトリ遅い速い
パフォーマンス最新リビジョンのチェックアウト有意な差異なし有意な差異なし
大量のコミット全体的に遅いが、コミットにかかる時間全体に対して均等全体的に速いが、最終処理はクライアントのタイムアウトにつながる可能性あり

There are advantages and disadvantages to each of these two back-end types. Neither of them is more 「official」 than the other, though the newer FSFS is the default data store as of Subversion 1.2. Both are reliable enough to trust with your versioned data. But as you can see in 表 5.1. 「Repository 保存形式の比較」, the FSFS backend provides quite a bit more flexibility in terms of its supported deployment scenarios. More flexibility means you have to work a little harder to find ways to deploy it incorrectly. Those reasons—plus the fact that not using Berkeley DB means there's one fewer component in the system—largely explain why today almost everyone uses the FSFS backend when creating new repositories.

Fortunately, most programs which access Subversion repositories are blissfully ignorant of which back-end data store is in use. And you aren't even necessarily stuck with your first choice of a data store—in the event that you change your mind later, Subversion provides ways of migrating your repository's data into another repository that uses a different back-end data store. We talk more about that later in this chapter.

The following subsections provide a more detailed look at the available data store types.

Berkeley DB

Subversion の最初の 設計段階で、開発者はさまざまな理由で Berkeley DB を利用することに決めました。 その理由にはそのオープンソースライセンス、トランザクションのサポート、 信頼性、パフォーマンス、API の公開、スレッドの安全性、カーソルのサポート などが含まれていました。

Berkeley DB は本当のトランザクション機能をサポートしています—おそらく 上であげた理由の中で最も強力な機能です。Subversion リポジトリにアクセスする 複数のプロセスはそれぞれ他のデータを間違って破壊することを心配する必要はありません。 トランザクションシステムによって提供されている分離機能はどんな操作においても Subversion リポジトリのコードにデータベースを静的に見せることができるように するものです—他のプロセスによってときどき変更を受けているように見えるのを 防ぐものです— そしてそのような静的な見え方に基づいて、何を実行するか を決めることができるのです。もしその決定が他のプロセスがやったことと競合 した場合、操作全体は、それがまったく実行されなかったかのようにロールバック され、Subversion はもう一度、新しく更新された(そしてやはりまた静的に見える ような状態での)データベースに対してその処理を再実行することができます。

Berkeley DB のほかのすばらしい機能はホットバックアップ — 「オフライン」にせずにデータベース環境をバックアップできる 能力です。リポジトリのバックアップ方法についてはリポジトリのバックアップ項で議論しますが、オフラインに せずにリポジトリの完全なコピーをとることができる利点は明白でしょう。

Berkeley DB is also a very reliable database system when properly used. Subversion uses Berkeley DB's logging facilities, which means that the database first writes to on-disk log files a description of any modifications it is about to make, and then makes the modification itself. This is to ensure that if anything goes wrong, the database system can back up to a previous checkpoint—a location in the log files known not to be corrupt—and replay transactions until the data is restored to a usable state. See ディスク領域の管理項 for more about Berkeley DB log files.

But every rose has its thorn, and so we must note some known limitations of Berkeley DB. First, Berkeley DB environments are not portable. You cannot simply copy a Subversion repository that was created on a Unix system onto a Windows system and expect it to work. While much of the Berkeley DB database format is architecture independent, there are other aspects of the environment that are not. Secondly, Subversion uses Berkeley DB in a way that will not operate on Windows 95/98 systems—if you need to house a BDB-backed repository on a Windows machine, stick with Windows 2000 or newer.

While Berkeley DB promises to behave correctly on network shares that meet a particular set of specifications, [30] most networked filesystem types and appliances do not actually meet those requirements. And in no case can you allow a BDB-backed repository that resides on a network share to be accessed by multiple clients of that share at once (which quite often is the whole point of having the repository live on a network share in the first place).

警告

If you attempt to use Berkeley DB on a non-compliant remote filesystem, the results are unpredictable—you may see mysterious errors right away, or it may be months before you discover that your repository database is subtly corrupted. You should strongly consider using the FSFS data store for repositories that need to live on a network share.

Finally, because Berkeley DB is a library linked directly into Subversion, it's more sensitive to interruptions than a typical relational database system. Most SQL systems, for example, have a dedicated server process that mediates all access to tables. If a program accessing the database crashes for some reason, the database daemon notices the lost connection and cleans up any mess left behind. And because the database daemon is the only process accessing the tables, applications don't need to worry about permission conflicts. These things are not the case with Berkeley DB, however. Subversion (and programs using Subversion libraries) access the database tables directly, which means that a program crash can leave the database in a temporarily inconsistent, inaccessible state. When this happens, an administrator needs to ask Berkeley DB to restore to a checkpoint, which is a bit of an annoyance. Other things can cause a repository to 「wedge」 besides crashed processes, such as programs conflicting over ownership and permissions on the database files.

注意

Berkeley DB 4.4 brings (to Subversion 1.4 and better) the ability for Subversion to automatically and transparently recover Berkeley DB environments in need of such recovery. When a Subversion process attaches to a repository's Berkeley DB environment, it uses some process accounting mechanisms to detect any unclean disconnections by previous processes, performs any necessary recovery, and then continues on as if nothing happened. This doesn't completely eliminate instances of repository wedging, but it does drastically reduce the amount of human interaction required to recover from them.

So while a Berkeley DB repository is quite fast and scalable, it's best used by a single server process running as one user—such as Apache's httpd or svnserve (see 第6章)—rather than accessing it as many different users via file:// or svn+ssh:// URLs. If using a Berkeley DB repository directly as multiple users, be sure to read 複数リポジトリアクセス方法のサポート項.

FSFS

In mid-2004, a second type of repository storage system—one which doesn't use a database at all—came into being. An FSFS repository stores the changes associated with a revision in a single file, and so all of a repository's revisions can be found in a single subdirectory full of numbered files. Transactions are created in separate subdirectories as individual files. When complete, the transaction file is renamed and moved into the revisions directory, thus guaranteeing that commits are atomic. And because a revision file is permanent and unchanging, the repository also can be backed up while 「hot」, just like a BDB-backed repository.

The FSFS revision files describe a revision's directory structure, file contents, and deltas against files in other revision trees. Unlike a Berkeley DB database, this storage format is portable across different operating systems and isn't sensitive to CPU architecture. Because there's no journaling or shared-memory files being used, the repository can be safely accessed over a network filesystem and examined in a read-only environment. The lack of database overhead also means that the overall repository size is a bit smaller.

FSFS has different performance characteristics too. When committing a directory with a huge number of files, FSFS is able to more quickly append directory entries. On the other hand, FSFS writes the latest version of a file as a delta against an earlier version, which means that checking out the latest tree is a bit slower than fetching the fulltexts stored in a Berkeley DB HEAD revision. FSFS also has a longer delay when finalizing a commit, which could in extreme cases cause clients to time out while waiting for a response.

The most important distinction, however, is FSFS's imperviousness to 「wedging」 when something goes wrong. If a process using a Berkeley DB database runs into a permissions problem or suddenly crashes, the database can be left in an unusable state until an administrator recovers it. If the same scenarios happen to a process using an FSFS repository, the repository isn't affected at all. At worst, some transaction data is left behind.

The only real argument against FSFS is its relative immaturity compared to Berkeley DB. Unlike Berkeley DB, which has years of history, its own dedicated development team and, now, Oracle's mighty name attached to it, [31] FSFS is a much newer bit of engineering. Prior to Subversion 1.4, it was still shaking out some pretty serious data integrity bugs which, while only triggered in very rare cases, nonetheless did occur. That said, FSFS has quickly become the back-end of choice for some of the largest public and private Subversion repositories, and promises a lower barrier to entry for Subversion across the board.



[27] Whether founded in ignorance or in poorly considered concepts about how to derive legitimate software development metrics, global revision numbers are a silly thing to fear, and not the kind of thing you should weigh when deciding how to arrange your projects and repositories.

[28] trunk, tags, branches の三つのファイルの全体を 「TTB ディレクトリ」 と呼ぶことがあります。

[29] Often pronounced 「fuzz-fuzz」, if Jack Repenning has anything to say about it. (This book, however, assumes that the reader is thinking 「eff-ess-eff-ess」.)

[30] Berkeley DB requires that the underlying filesystem implement strict POSIX locking semantics, and more importantly, the ability to map files directly into process memory.

[31] Oracle bought Sleepycat and its flagship software, Berkeley DB, on Valentine's Day in 2006.