Latest Posts

Archives [+]

Categories [+]

Authors [+]

CRX Tar PM

Day CRX by default stores the data in tar files, using the Tar PM (Tar Persistence Manager). This is quite different from Jackrabbit (the JCR reference implementation), which uses a SQL database. What is the Tar PM, and why does it use the old tar file format?

The Tar PM is a transactional storage engine that is specially made for JCR content. Like relational databases, it supports ACID (atomicity, consistency, isolation, durability), but it doesn't know SQL, or relational integrity, and stores the data in another way.

Append Only

The biggest difference to a regular database is: the Tar PM is append-only. It doesn't do any in-place updates. Whenever there is a change, the Tar PM appends an entry to the newest file. Even when deleting content, an (empty) entry is appended. While this seems wasteful, it is actually faster than a regular database, because each change results in only one write operation. A database first stores a change in the log file (in many cases the old data and then the new data), and then writes the data again, this time in the main area. The Tar PM only writes the data once. Unused, old data is removed in a separate optimize process during off-peak hours, for example at night.

A regular database appends changes in a log file first (>>>), and then stores it again in the main data file using in-place updates (+-). The Tar PM only appends changes to the latest data file.

Linear Multi-Segment Index

Searching for a particular record in an append-only storage is tricky if you don't know where the record is stored. Scanning all tar files would take too long. For quick access, you need to have an index. Regular databases usually use a B-tree index for this. The problem is: B-trees are not append-only.

The Tar PM uses a special append-only index. Keys are stored in one or multiple index segments. Each segment is a sorted list, and each persistent segment is a file (a tar file of course!). Modifications to the index are kept in-memory until this structure grows too large, then sorted and written to a new file. There could be multiple persisted index segments, but index segments are later merged, ultimately into one large list.

A key lookup goes like this: First, the Tar PM looks in the in-memory index segment, which is implemented as a hash table. If the key is not found, then a lookup in the cache is made, which is also just a hash table. Afterwards, the Tar PM checks the persistent index segments, newest segment first.

Entries are sorted by key (131-903) in the index segment. The index contains pointers ([1]-[9]) where the actual content is stored in the data file.

As the index segments are sorted by key, you could do a binary search. But there is a faster way: the expected position of the key in the list is calculated directly from the key. This is possible because Jackrabbit generates keys (UUIDs) randomly, so entries are distributed almost evenly. In the example above, if you know there are entries 131-903, you can guess where the key 211 is stored, even if there are many keys. To further improve the accuracy of the lookup, the list is split into a number of groups, and the item count of each group is taken into account. If the expected position is off, the file is scanned similar to binary search, but in reality that is very seldom.

Like the data area, the index is append-only, that means data is never overwritten. Old, unused index segments are removed once they are no longer needed.

Why the Tar File Format

The file format of most database systems is proprietary, which makes it hard or impossible to read. The Tar PM uses the standard tar file format. If you are interested, you can inspect the files using one of the many tools that support this format. The tar format is future proof, and has a number of other advantages. One example is point-in-time recovery: while not directly implemented in the Tar PM yet, it is actually quite simple to do manually - just truncate the tar file at a given entry.

Backup

Another advantage of the append-only nature is 'backup-ability': Files are never modified after they are written, so backing up the repository is very simple - just copy all files.

 
(optional)
15 comments
Interesting. Tar stands for "Tape Archive" and I really got the image of that ol' music freak, recording jam sessions at day and cutting that tape at night while reading your article :)
0 Replies » Reply
What happens when you lose power in between the metadata write and the data write?

Do you have some kind of integrity checking on each record you append to the tar file?
0 Replies » Reply
Yes, there are checksums. There is a checksum for entry headers in the tar file format specification, and additionally the Tar PM adds a checksum to the data. Then there is a checksum for each index segment. Basically a checksum for everything.

In the recovery process after a power failure, the checksums are compared, and what doesn't match is truncated (actually a backup is created first, just in case). Uncommitted transactions are rolled back as well here.
0 Replies » Reply
So where do I get Tar PM?
0 Replies » Reply
It's included in CRX Quickstart (download here )
0 Replies » Reply
Wow, interesting. I do something similar for Cyrus IMAPd backups. I've got a parser that can read the index files directly from the disk format, and use that to map to the SHA1 of the actual data file, providing de-duplication. There's also a TAR file library (written in Perl, all our stuff is in Perl!) which handles streaming the tar file.

When the file gets below 80% "liveness" it gets re-compressed by streaming through the file and only copying the records which are still "alive" according to the index.

Deletions are handled by symlink records to a blank path.

the 512 byte blocking makes the format relatively easy to do disaster-recovery on, and we store the sha1 behind a NULL in the linkname part of the header - gnu tar can handle the files fine, but we can still do integrity checking if needed.
0 Replies » Reply
How do you handle replication? Do CRX/Tar PM support master/master style clustering?
0 Replies » Reply
Bron, that actually sounds like a great tool. Is it something you guys want to keep in-house, or can you through it up on Sourceforge?
0 Replies » Reply
Clustering is supported. It's not so easy to describe how it works :-) I will write an article about that when I have time.
0 Replies » Reply
Thomas, I know diskspace is not very expensive anymore, but still it is something which has limits. How do you handle content with has a high update frequencies? Append only sounds like a lot of outdated stuff will remain lying around.

Are there cleanup procedures to take old stuff pout of the live CRX possibly to a backup/archive location?
0 Replies » Reply
Hi,

I'm sorry, I didn't see your comment before.

> content with high update frequencies

Using generational garbage collection. Example: if a new tar file doesn't contain 'active' data, then it is deleted right away.
0 Replies » Reply
What are the advantages over database?
0 Replies » Reply
That is extremely cool, in that it is perfect for the space of storing binary content for long periods of time.

Question 1: Can you use it with Jackrabbit? :-)

Question 2: Is it designed such that if you migrated a workspace/repository to optical media (CD/DVD) that the TarPM would continue function in a read-only mode without any issues?
0 Replies » Reply
That was an inspiring post,

in this way no data will get lost

Thanks for writing about it
0 Replies » Reply
Very helpful post. What are the side effects of the optimization step when unused, old data is removed from the tar? I assume there are some side effects since you recommend that this operation be done off hours. Does the operation disable some functionality within the repository?

Thanks
0 Replies » Reply