Joint post of Henri Bergius and Michael Marth cross-posted here and here.
Web Content Repositories are more than just plain old relational databases. In fact, the requirements that arise when managing web content have led to a class of content repository implementations that are comparable on a conceptual level. During the IKS community workshop in Rome we got together to compare JCR (the Jackrabbit implementation) and Midgard's content repository. While in some cases the terminology might be different, many of the underlying ideas are identical. So we came up with a list of common traits and features of our content repositories. For comparison, there is also Apache CouchDB.
So, why use a Content Repository for your application instead of the old familiar RDBMS? Repositories provide several advantages:
- Common rules for data access mean that multiple applications can work with same content without breaking consistency of the data
- Signals about changes let applications know when another application using the repository modifies something, enabling collaborative data management between apps
- Objects instead of SQL mean that developers can deal with data using APIs more compatible with the rest of their desktop programming environment, and without having to fear issues like SQL injection
- Data model is scriptable when you use a content repository, meaning that users can easily write Python or PHP scripts to perform batch operations on their data without having to learn your storage format
- Synchronization and sharing features can be implemented on the content repository level meaning that you gain these features without having to worry about them
| feature | JCR / Jackrabbit | Midgard | CouchDB |
| content type system | In JCR structured or unstructured nodes are supported and can be mixed at will in a content tree. | Content types are defined in MgdSchema types. All content must be stored to an MgdSchema type, but types can be extended on content instance level using the "parameter" triplets | Type-free |
| type hierarchy | Structured node types support inheritence of types, additional cross-cutting aspects can be added with "mixins". Node types can define allowed node types for child nodes in the content hierarchy. | MgdSchemas allow inheritance, and an extended type can be instantiated either using the extended type or the base type | Type-free |
| IDs | Nodes with mixin "referenceable" have |
Every object has a GUID used for referencing. Objects located in trees that have a "name" property can also be referred to using the path | All objects can be accessed via a UUID |
| References | Nodes can reference each other with hard link (special property type) or soft link (by referring to the node path) | MgdSchema types can have properties linking to other objects of same or different type. A link of "parentfield" type places an MgdSchema type in a tree. | No reference support built-in |
| content hierarchy | All content is hierarchical / in a tree | Content can exist in tree, or independently of it depending on the MgdSchema type definition | flat structure |
| interesting property types | Multi-valued (like an array), binary properties (e.g. for files), nodes have an implicit sort-order | Binary properties stored using the Midgard Attachment system | Support for binary properties |
| transactions | Multiple content modifications are written in transactions. | Transactions can be used optionally. | |
| events | JCR Observers can register for content changes on different paths and/or for different node types and/or CRUD, receive notification of changes as serialized node | All transactions cause both process-internal GObject signals, and interprocess DBus signals | Support for one external event notification shell script |
| workspaces | Workspaces provide separate root trees. | No workspaces support in Midgard 9.03, coming in next version | Multiple databases within one CouchDB instance |
| import and export | nodes or parts of the repository (or the whole repo) can be imported or exported in XML. 2 formats: docview for human-frindly representation, sysview including all technical aspects | Objects can be exported and imported in XML format. There are tools supporting replication via HTTP, tarballs, XMPP, and the CouchDB replication protocol | JSON serialization is the standard way of accessing the repository. CouchDB replication protocol supports full synchronization between instances |
| versioning | Checkin/checkout model to create new versions of nodes, optionally versions complete sub-trees, supports branching of versions. | No versioning | All versions of content are stored and accessible separately, no branching |
| locking | Nodes can be locked and unlocked | Objects can be locked and unlocked | |
| object mapping | Not in standard, but implemented in Jackrabbit. Rarely used in practice. | Object mapping is the standard way of accessing the repository | All content is accessed via JSON objects |
| queries | In JCR1 Sql or XPath, in JCR2 also QueryBuilder. | Query Builder | Javascript map/reduce |
| access control | Done on repository level, i.e. all access control is independent of application. In Jackrabbit: pluggable authentication/authorization handlers. | No access control in Midgard repository, usually implemented on application level. Midgard proves a user authentication API | No access control |
| persistence | In Jackrabbit different Persistence Managers can be plugged in (RDBMS, tar file, ...) | libgda allows storage to different RDBMS like MySQL, SQLite and Postgres | CouchDB has its own storage |
| architecture | Jackrabbit: library (jar), JEE resource, OSGi bundle or standalone server | Library | Erlang-based daemon |
| APIs | Standard: Java-based, PHP coming up. In Jackrabbit: also WebDAV and HTTP-based API | C, Objective-C, PHP, Python | HTTP+JSON |
| full-text search | Included in repository. In Jackrabbit: Lucene bundled | No (SOLR used on application level) | Plugin for using Lucene, not installed by default |
| standard metadata | All nodes have access rights, jcr:primaryType and jcr:mixinTypes properties. JCR 2.0 standardizes a set of optional metadata properties. | All objects have a set of standard metadata including creator, revisor, timestamps etc | No standard properties |

Two random initial thoughts, though:
Content Hierarchy - Why not a graph? Or simply tagged? Content Trees are like the Apple Trees in the Garden of Eden - the root of all evil.
Workspaces - How is a Workspace different to a version branch?
I'm not sure if this is useful, but I started collecting various XML representations from various repositories.
http://jonontech.com/projects/xml-export-formats/
The aim was to make a table like the above but including many more vendors. Anyone interested in help with this? Maybe a Google Doc?
glad you like it.
Re your thoughts:
- for JCR the hierarchical model has turned out to be the most useful for web content (for our purposes). Nothing wrong with other models, of course. It really depends on the problem to solve.
Other than that, your example "tagged" is covered quite well by a flat hierarchy IMO.
- a workspace is a completely different space with its own root, access rights, search results etc. More like a completely different repository than just a branch. Can be used e.g. in a WCM staging-publish scenario.
Collecting XML representations looks useful (e.g. to help with content migrations). I am happy to help. Google Doc sounds good to me.
JCR is primarily a tree, Midgard a graph, and CouchDB tagged.
In addition to the primary model, both JCR and Midgard have pretty good support for all of the models.
There are a number of benefits you get from a tree hierarchy, most notably the straightforward URI mapping and the simple access control model. The tree model also has theoretical benefits related to partitioning and things like MVCC.
We've been discussing these things in London pubs a lot recently. In particular with Justin Cormack (@justincormack) who will correct me if I'm talking rubbish.
But, we like the model where WCM staging-publish models are simply part of the (distributed) revision control system. Each "environment"/"workspace" is a branch. Publishing is like merging changes from one to another.
And yes, I agree that a tree can of course work. But defining the aspect as "content hierarchy" to me, at least, implies a tree. Maybe it should be "content structure" or something ...
Looking forward to the Google Doc.
Jon
> hierarchy" to me, at least, implies a tree
Agreed. We should have named it "model" or "structure".
Professional distortion of mine :)
I believe the branches / workspaces / environments are a good candidate for "properly name and define the concepts" (from your first comment).
Nodes with mixin "referenceable" have GUID.
The tree issue is interesting. Having a base tree for human organizational purposes (and as you mention for things like permissions where human validation of assignment is important) is very useful. It also enables a mapping to file systems, WebDAV and so on. I think the mapping to URLs is only valid for simple applications and you should expect to have to remap differently in the application layer in many cases. I would argue that application level URI mapping is not the content repository's responsibility, although it may need to have supporting tools.
I am the @justincormack who has been discussing versioning a lot with Jon, everyone who has tried knows that reconciling terminology and concepts on versioning is difficult but it needs to be pursued... alas I couldn't make this IKS workshop, but I was at the first one, and should be at the next one. The big terminology issue was raised then, and I do think it is important we all understand where we are talking terminology differences and where they are technical differences.
Terminology wise, I dislike "references" and prefer "relations". This is web content management after all. But also it lets you ask questions like what metadata can relations have (just names, or properties too), and can you have standalone relations (like an external RDF file or equivalent) or not. Also it is useful to know what the query model is for relations (if any; although I would count traversal tools as queries like neo4j does).
It would be useful to add CMIS in, as it supports many of the web content repository requirements even though that is not the use case. (eg it has standalone relation support, with metadata on the relations); it gives some more comparison points.
I am still working on a REST repository spec, as referenced here: http://blog.technologyofcontent.com/2009/10/restful-daydream-4/ and I hope to do some more concrete drafting shortly, so will try to fit things into this type of terminology.
I had not looked at the Midgard stuff for a while, good to have a summary and will look more.
thanks for the insightful comment.
Re the CouchDB versioning: Bergie and I actually had an argument about this issue when compiling the list. I took your view, but the counter argument was that the versions remain available unless you explicitly purge them. So we settled for "supported".
It seems to be another case for "properly name and define the concepts".
Looking forward to your REST repository proposal.
I think that a repo should expose a graph. In JCR 1.0 we initially specified a graph but at the last minute a lot of vendors did not find that very easy to implement, so we postponed it to JCR 2.0. In JCR 2.0 the graph is back in the shape of "Shareable Nodes" [1]
See you soon ;)
regards,
David
[1] http://www.day.com/specs/jcr/2.0/14_Shareable_Nodes.html