Adobe CRX is an extremely versatile content store that can handle a wide range of content types (structured and unstructured), capable of reliably storing many millions of objects. In fact, the system's ultimate storage limits are actually not subject to any particular limitations of CRX itself but (rather) depend on the underlying persistence manager. You can choose from a number of different types of persistence (DB2, MySQL, Oracle, TarPM; see documentation here), each with its own particular limitations.
In general, the default TarPM persistence manager gives better performance than most RDBMS alternatives for the typical CRX use cases (involving web content and user management). But in certain situations, with certain use cases, performance with TarPM can take a hit. The most common problem? Big Flat Lists.
Although read performance remains good, write performance can suffer in the case where you need to store, say, thousands of sibling nodes under one parent node. This has to do with the fact that TarPM is an append-only store in which objects are immutable and never overwritten, only rewritten. What it means is that the cost of adding (or updating) Node No. N-thousand-plus-one can be quite high.
Of course, the answer is to divide and conquer: Break the nodes up into smaller groups, preferably hierarchical groups.
Suppose you have a large number of users whose user-data you want to store in CRX, and you'd like to be able to store users by name. The naive way (we'll keep the example simple and assume no name collisions) would be to store Joe Smith under a node named users/joe_smith, Lee Jones under users/lee_jones, etc. But after a thousand names or so, performance will start to suffer noticeably as new entries are written to the repository. Far better performance will result if container nodes (buckets) are created for each letter of the alphabet, and for each Last Name, so that you can add Joe Smith as /users/S/Smith/Joe, for example.
A more sophisticated approach would be to hash user IDs and chunk the hash to form an ad-hoc hierarchy. For example, "Joe Smith" might give a hash of ab12cd34. The user data for Joe Smith can be stored at users/ab/12/cd/34. When the time comes to look up data for Joe Smith, you would first hash the name (to obtain ab12cd34), then create the necessary path from the hash, and look up the data.
As it turns out, the Jackrabbit API (which of course is built into CRX) offers yet another alternative for efficient hierarchical storage of arbitrary data, in the form of the BTreeManager. This class provides B+ tree-like behavior in allocating subtrees of nodes that are always balanced, with a fixed limit on how many siblings any given node can have. (You provide the limit as an argument in the constructor.)
I wrote a very short test script (in ECMAScript) to show how the BTreeManager operates, as shown below:
<html>
<body>
<%
/* Create a new TreeManager instance rooted at the current node.
Splitting of nodes takes place
when the number of children of a node exceeds 40 and is done such that each new
parent node has >= 10 child nodes. Keys are ordered according to the natural
order of java.lang.String. */
var treeManager = new Packages.org.apache.jackrabbit.commons.flat.BTreeManager( this.currentNode, 10, 40, Packages.org.apache.jackrabbit.commons.flat.Rank.comparableComparator(), true);
// Create a new NodeSequence with that tree manager
var nodes = Packages.org.apache.jackrabbit.commons.flat.ItemSequence.createNodeSequence(treeManager);
var totalNodes = 100;
// Do some profiling:
var start = 1 * new Date();
// add a bunch more nodes
for (var i = 0; i < totalNodes; i++)
nodes.addNode( "MyNode" + i,
Packages.javax.jcr.nodetype.NodeType.NT_UNSTRUCTURED);
var end = 1 * new Date();
%>
<%= "Total time: " + (end - start) + " millisecs" %>
</body>
</html>
I called this script tree.esp and placed it under /apps/tree in CRX, then created a dummy node under /content and gave the dummy node a sling:resourceType of "tree" (to trigger the script when navigating to content/dummyNode.tree).
The performance benefits of BTreeManager are notable. On my (decrepit Dell) laptop, adding 100 nodes as a flat list took 1.6 seconds (which includes about 200 milliseconds for servlet compilation). Adding 1000 nodes as a flat list (no B-tree) took 22 seconds. Adding 5000 nodes took 289 seconds. Note that adding five times as many entries took almost 13 times as long.
By contrast, using BTreeManager (set to a maximum sibling breadth of 40), adding 1000 nodes took 14 seconds and adding 5000 took 86 seconds. (Five times the data takes roughly five times as long.)
The real lesson here is: If your content is hierarchical (or can be made to look hierarchical), by all means capitalize on that fact! Don't try to treat your content as a Big Flat List, especially if you'll be doing a lot of updates. (If you're doing mostly reads and few writes, on the other hand, it doesn't much matter.) Introducing a bit of hierarchy to your content organization scheme will go a long way toward promoting fast update performance.
(Many thanks to Felix Meschberger and Marcel Reutegger for input into this blog.)