Archive for 'November 2008'
I'm excited to point you to the announcement of CQ5's General Availability.
CQ5.1 sets new standards in terms of fun to develop, maintainability, usability, fun of use, in the enterprise software space, especially in WCM/ECM market for large enterprises.
As readers of this blog know, we've been working to validate our vision for CQ5 through our product release program and its quality gates. In addition of surpassing our quality milestones, we received great feedback from our Tech Preview and Beta customers, some have show us a way to go even further down our road and others have validated that they can also see our objectives in real live production, authoring and development. We had to add slots to our over subscribed beta program, all showed early and high levels of interest. This trend has proven to increase even more over the last few weeks.
For those of you who were not able to travel to Basel for our Worldwide Customer Summit see the pointers here for some of our beta customers' feedback.
If you haven't, I encourage you to try it for yourself. For more information you can also contact firstname.lastname@example.org
There is quite a difference between JCR-based applications and RDBMS-based applications in terms of access control (ACLs). These differences exist not only in the ACL-handling of the respective storage, but (as a consequence) also in the ACL handling of web apps that are based on JCRs or RDBMSs, respectively.
First, let's look at the repository level. In the JCR model each node possesses ACL settings for reading, deleting, modification, etc. A piece of content (i.e. JCR node) knows by itself which groups of users have which rights. Because the JCR content model is also hierarchical this ACL approach is comparable to the approach taken in (Unix) file systems.
The situation is different in relational databases where access rights are usually scoped by table (or view). The individual content elements (rows in the table) do not posses individual access rights.(*)
The table-level granularity of RDBMS access rights has consequences for RDBMS-based web apps: the db connection is performed as a technical user. This technical user needs to have the necessary rights to access or modify the table required by the application (i.e. the technical user needs to have the cumulative access rights of all web users). As a consequence the web app itself needs to implement the content access rights. Rather than having access control as a feature that is only implemented once (by the infrastructure developer) and tested a million times (with each application) it is a feature that is implemented with each application again and again. Moreover, two or more applications that operate on the same data run the risk of implementing different ACLs.
JCR-based web applications like Sling (or apps that are based on Sling) behave differently: the web user's credentials are used to create a session in the repository. As such the web app does not perform any access control at all, but rather delegates this task to the JCR repository. The obvious advantages are that the app developer is not bothered with implementing access control and that the ACLs are consistent between multiple apps running on the same JCR (you might have two apps already: consider that Jackrabbit and CRX come with a built-in WebDAV server that constitutes a web app in this sense). Again, the JCR-model can be compared to Unix where commands like "ls" or "tail" do not implement their own file access control, but rather inherit the rights of the user that executes them.
Here is an overview of the situation in both cases:
(*) It should be noted that this can be fixed by e.g. implementing triggers or stored procedures. However, in this post I would like to look at the standard out-of-the-box case, i.e. the infrastructure that this infrastructure software actually provides. The application developer should not be burdened with infrastructure development.
Day CRX by default stores the data in tar files, using the Tar PM (Tar Persistence Manager). This is quite different from Jackrabbit (the JCR reference implementation), which uses a SQL database. What is the Tar PM, and why does it use the old tar file format?
The Tar PM is a transactional storage engine that is specially made for JCR content. Like relational databases, it supports ACID (atomicity, consistency, isolation, durability), but it doesn't know SQL, or relational integrity, and stores the data in another way.
The biggest difference to a regular database is: the Tar PM is append-only. It doesn't do any in-place updates. Whenever there is a change, the Tar PM appends an entry to the newest file. Even when deleting content, an (empty) entry is appended. While this seems wasteful, it is actually faster than a regular database, because each change results in only one write operation. A database first stores a change in the log file (in many cases the old data and then the new data), and then writes the data again, this time in the main area. The Tar PM only writes the data once. Unused, old data is removed in a separate optimize process during off-peak hours, for example at night.
A regular database appends changes in a log file first (>>>), and then stores it again in the main data file using in-place updates (+-). The Tar PM only appends changes to the latest data file.
Linear Multi-Segment Index
Searching for a particular record in an append-only storage is tricky if you don't know where the record is stored. Scanning all tar files would take too long. For quick access, you need to have an index. Regular databases usually use a B-tree index for this. The problem is: B-trees are not append-only.
The Tar PM uses a special append-only index. Keys are stored in one or multiple index segments. Each segment is a sorted list, and each persistent segment is a file (a tar file of course!). Modifications to the index are kept in-memory until this structure grows too large, then sorted and written to a new file. There could be multiple persisted index segments, but index segments are later merged, ultimately into one large list.
A key lookup goes like this: First, the Tar PM looks in the in-memory index segment, which is implemented as a hash table. If the key is not found, then a lookup in the cache is made, which is also just a hash table. Afterwards, the Tar PM checks the persistent index segments, newest segment first.
Entries are sorted by key (131-903) in the index segment. The index contains pointers (-) where the actual content is stored in the data file.
As the index segments are sorted by key, you could do a binary search. But there is a faster way: the expected position of the key in the list is calculated directly from the key. This is possible because Jackrabbit generates keys (UUIDs) randomly, so entries are distributed almost evenly. In the example above, if you know there are entries 131-903, you can guess where the key 211 is stored, even if there are many keys. To further improve the accuracy of the lookup, the list is split into a number of groups, and the item count of each group is taken into account. If the expected position is off, the file is scanned similar to binary search, but in reality that is very seldom.
Like the data area, the index is append-only, that means data is never overwritten. Old, unused index segments are removed once they are no longer needed.
Why the Tar File Format
The file format of most database systems is proprietary, which makes it hard or impossible to read. The Tar PM uses the standard tar file format. If you are interested, you can inspect the files using one of the many tools that support this format. The tar format is future proof, and has a number of other advantages. One example is point-in-time recovery: while not directly implemented in the Tar PM yet, it is actually quite simple to do manually - just truncate the tar file at a given entry.
Another advantage of the append-only nature is 'backup-ability': Files are never modified after they are written, so backing up the repository is very simple - just copy all files.
ApacheCon US starts today in New Orleans. If you happen to visit it (lucky you!) please consider the talks given by Day's engineers:
Bertrand Delacretaz: Rapid JCR applications development with Sling: Sling is an OSGi-based scriptable applications layer, based on REST principles, that runs on top of a JCR content repository. In this talk, we'll see how Sling enables rapid development of JCR-based content applications, by leveraging the JSR 223 scripting framework along with the rich set of OSGi components provided by Sling. We will create a simple application from scratch in a few minutes, and explain a more complex multimedia application that does a lot with few lines of code. This talk will help you get started with Sling and understand how the different components fit together.
Open Source Collaboration Tools are Good For You!: What are the core requirements for a set of team collaboration tools? Looking at how ASF project communities collaborate online, we have identified four core drivers that help these projects succeed. We will show how the collaboration tools used by the ASF can allow any project team to move from an "ask around the office" collaboration model to our efficient "distributed self service information" model, while focusing on those core drivers to avoid being distracted by the tools themselves. Our analysis will help you estimate the effort and expected benefits of such a move.
Carsten Ziegeler: Apache Felix - A Standard Plugin Model for Apache: OSGi technology is becoming the preferred approach for creating highly modular and dynamically extensible applications. The Eclipse IDE was the first highly visible project to adopt OSGi technology a few years ago, but more and more projects are moving in the same direction (e.g., Spring, JOnAS) or considering it (e.g., Directory, Geronimo, JAMES, Jackrabbit). With Apache Felix readily available, there is no better time to start moving to OSGi technology. This talk will provide a status update on the Apache Felix project and will show in detail how to launch and embed the Apache Felix framework into your own projects and the issues around doing so. By attending this talk, you will receive enough information to immediately start using Apache Felix as a dynamically extensible plugin mechanism in your own applications, additionally providing them the benefits of module version management, dependency resolution, and life cycle control.
Jukka Zitting: Introduction to JCR and Apache Jackrabbit: Apache Jackrabbit is a fully conforming implementation of the Content Repository for Java Technology API (JCR). JCR is a standard for managing rich hierarchical content models with features like full text search, versioning, and transactions. This presentation introduces you to the key concepts of JCR and shows you how to use Apache Jackrabbit and related projects to build various types of content applications like wiki and blog engines, email archives, image galleries, etc.