Latest Posts

Archives [+]

Structured or unstructured? In JCR you do not have to choose

Recently, I read up on quite a number of NoSQL protagonists. Of course, one dominant theme in NoSQL land is "schemaless" as opposed to the full-schema nature of relational databases. As usual, both approaches have their specific pros and cons. A common critism of schemaless data stores is that the entropy of the data would create problems in the long run when too much unstructured data has been amassed. On the other, hand full-schema data bases are much less flexible or downright the wrong tool for unstructured data.

In this post I would like to point out that you do not necessarily have to choose between those extremes: JCR-based data stores allow you to store unstructured data, fully structured data and anything inbetween. In lack of a better term I would like to call this a "schema-optional" data store with "semi-structured" data.

  • The JCR node type nt:unstructured is designed to accept any properties, so you can dump at will strings, dates or even binaries into such a node. This node type is very useful to get started with coding an application when you do not know what the end result should look like. It allows for a development approach coined "data first, structure later" where structure emerges from data, rather than be defined a priori.
  • On the other end of the spectrum you can have rigidly defined node types. JCR allows you to specify e.g. mandatory properties, default values or the allowed child node types in a node hierarchy. The Apache Jackrabbit site has a good overview of the Compact Namespace and Node Type Definition which is a notation used to define such structure.

In between these two extreme cases any middle ground is possible in JCR repositories:

  • First, a rigid node type definition for a specific node can define "residual" properties. Such an approach allows the application to set not only the properties that were defined a priori in the node type definition, but also anything else. This is particularly useful for scenarios were only a part of the requirements is known beforehand or where the requirements are known to evolve over time. You can define the known parts but an application can still freely write anything into the node as if it was unstructured.
  • Second, it should also be noted that these structured, unstructured and semi-structured nodes can happily live next to each other in the same repository tree. So different parts of your application can make use of different levels of structure not only through different node types, but also through different parts in the node hierarchy.

With JCR 2.0 it has become quite a bit easier to evolve the structure (after all, the mantra is "data first, structure later", not "structure never"): one can now change the node types of existing nodes. That facilitates a migration from, say, nt:unstructured nodes to more structured types.

 

COMMENTS

  • By James Stansell - 3:28 PM on May 04, 2010   Reply
    The JCR is so flexible that a property in one node can have a different type than the the property of the same name in a different node! (my team had some recent hilarity with this one) ;-)