Latest Posts

Archives [+]

cq5 content models: the tags

In software engineering, modularity often leads to hard choices when it comes to how big or small things should be. In a JCR content repository, the question is how granular should my content be?. A more granular structure contains more information, but too much granularity might slow things down.

Inside a JCR node, we can create a simple or complex hierachy of content atoms and metadata. But how far should we go? Should we think in terms of files, mini-databases, or simple name-value pairs?

JCR beginners often have a hard time figuring out the best content models for their problem, so we thought we'd share some of our experience here.

Starting with this post, we will explain some of the cq5 content structures. Without going into theoretical details - just by describing and explaining those structures.

Today, we'll have a look at the cq5 tags, used as semi-structured metadata, mostly for content pages. In cq5, tags like stockphotography/animals/birds can be added to content pages. Tags belong to namespaces (stockphotography in our example), and can be arranged hierarchically within their namespace.

cq5 tags - user view

Looking at the tags from the cq5 site admin console, we see a simple tree of concepts,grouped in namespaces (Marketing) and categories (Interest). Each tag has a unique TagID, visible in the first column on the right, that will later be used to connect content with those tags.

Nothing surprising here, except maybe the fact that our tags live in a hierarchical space,as opposed to a flat one. This creates simple namespaces for our tags, allowing several "worlds"of tags to be combined without conflicts.

How are we going to store this in JCR? In cq5, the tags are stored as a tree of JCR nodes,with a structure similar to the above one, using the cq:Tag node type. The content model simply reflects the reality of the tags and their natural organization.

The cq:Tag node type

Here's the definition of the cq:Tag node type in CQ5.2:

The tag node is required to have a sling:resourceType property with a default value of tagging/tag. That property is used by the Sling rendering system to select the appropriate components to render the tags, in the cq5 site admin console for example.

The node can contain nt:base child nodes which have the cq:Tag type by default. The cq:Tag node can also contain any number of additional ("residual" in JCRspeak) properties, single or multi-valued.

The cq:Tag node type also uses the mix:title mixin, which defines two optional String properties,jcr:title and jcr:description. The jcr:title property is used to allow tags to be renamed without changing their identifier. The cq5 user interface displays the jcr:title value, which can change over time,but it's the path of the cq:Tag node that is used as the tag identifier.

There's no specific node type for tag namespaces: a cq:Tag node that doesn't have a cq:Tag parent is considered as being a namespace. In cq5, tag definitions are stored under /etc/tags, and that node is not a cq:Tag, so cq:Tag child nodes like /etc/tags/marketing define tag namespaces.

At Day we like to keep things open whenever possible: the cq:Tag node type is not designed to put strong constraints on the content, and that's inline with David's model rule #1:

Data First, Structure Later. Maybe

We haven't reached the maybe stage yet.The cq:Tag node type is clearly here to help, not to restrict what we can do.

Tags content model

Switching to the CRX Explorer, we notice that the tree structure under /etc/tags simply maps the namespace/category/tag structure of our tags. Nothing surprising again, and that's a good thing. Obvious content structures will help others understand what we're doing.

Looking at the properties of the /etc/tags/stockphotography/animals/baby_animals node,we see that the TagId property that's visible in the cq5 site admin console is not explicitely stored in the content - it is simply defined by the storage path of the tag node under /etc/tags, to avoid redundant information.

Don't you love the Principle of Least Surprise?

At this point you're probably thinking that all this is quite obvious - and you're right! The beauty of a JCR content repository is that you can in most cases store information without any structural transformations. Tags are items grouped in namespaces and categories, so a tree of namespace/category/tag nodes makes perfect sense, and is largely self-explaining.

Tagging content

To tag content, we simply add a multi-value tags property to the _jcr_contentnodes of cq5 pages, or to other pieces of content. A page might have:

cq:tags =
[

marketing:interest/business,
marketing:interest/investor,
marketing:interest/services
]

if it was tagged with the business, investor, services tags of the interest category of the marketing namespace.

We don't use JCR references, but simply store paths in properties, as this gives us more flexibility when restructuring things. It's hard to say what will happen to those tags, and to the very concept of tagging, over the expected lifetime of our product, so we accept potentially dangling references (and cope with them at the application level) to gain content agility.

Coda

That's it for now! We hope to write more about our content models in the near future, to help our readers see how simple JCR content models can be - and should be.

As usual, feedback is very welcome - let us know if this information is useful to you!

 

COMMENTS

  • By Christian Sprecher - 2:38 PM on Apr 23, 2009   Reply
    Just out of curiosity:<br/>Is this tag concept part of a broader taxonomy model within CQ5?<br/><br/>And what is your (resp. Day's) stance reg. "auto-tagging"? Is it worth it?.<br/>
  • By Bertrand Delacretaz - 7:20 AM on Apr 24, 2009   Reply
    @Christian, the tags *are* the taxonomy, we use a few (Sling/OSGi) tagging services to manage them and the tagged content.<br/><br/>IMO, auto-tagging can be very useful...once that works! We are exploring options, especially in the context of the IKS project [1]. One could either try to extract all tags from content, or just suggest additional tags based on the tags that you select. The latter might give better results for now, until content extraction algorithms are good enough for arbitrary content. I *think* some or our customers already do auto-tagging or something similar, tuned to their types of content and vocabularies. The problem is much simpler with a controlled vocabulary than with unrestricted content, of course.<br/><br/>[1] http://dev.day.com/microsling/content/blogs/main/iksfirst3.html
  • By Renaud Richardet - 7:34 PM on May 02, 2009   Reply
    Thanks Bertrand for the clear explanation. In terms of performance, how does it scale up when one queries for all documents with the tag "marketing:interest/services"?
  • By Bertrand Delacretaz - 8:29 AM on May 04, 2009   Reply
    Hi Renaud, as the path of the tag "marketing:interest/services" is stored as a property of each document that has that tag, the query is quite efficient and scalable. <br/><br/>I'm not an expert in JCR query performance, but I would assume that looking for all nodes that have a property with a specific string value is fast. That's certainly the case in Jackrabbit which uses Lucene's inverted indexes.
  • By Bertrand Delacretaz - 11:49 AM on Jul 06, 2010   Reply
    In the meantime we have started work on autotagging as part of the IKS FISE project, see http://wiki.iks-project.eu/index.php/FISE