CRX Clustering

CRX provides built-in clustering capability that enables multiple repository instances running on separate machines to be joined together to act as a single repository. The repository instances within a cluster are referred to as cluster nodes (not to be confused with the unrelated concept of the JCR node, which is the unit of storage within a repository).

Each node of the cluster is a complete, fully functioning CRX instance with a distinct address on the network. Any operation (read or write) can be addressed to any cluster node.

When a write operation updates the data in one cluster node, that change is immediately and automatically applied to all other nodes in the cluster. This guarantees that any subsequent read from any cluster node will return the same updated result.

How Clustering Works

A CRX cluster consists of one master instance and some number of slave instances. A single standalone CRX instance is simply a master instance with zero slave instances.


Typically, each instance runs on a distinct physical machine. The instances must all be able to communicate directly with each other over TCP/IP.

Each cluster instance retains an independent identity on the network, meaning that each can be written to and read from independently. However, whenever a write operation is received by a slave instance, it is redirected to the master instance, which makes the change to its own content while also guaranteeing that the same change is made to all slave instances, ensuring synchronization of content.

In contrast, when a read request is received by any cluster instance (master or slave) that instance serves the request immediately and directly. Since the content of each instance is synchronized by the clustering mechanism, the results will be consistent regardless of which particular instance is read from.

The usual way to take advantage of this is to have a load-balancing webserver (for example, the Dispatcher) mediate read requests and distribute them across the cluster instances, thus increasing read responsiveness.

Architecture and Configuration

When setting up a clustered installation, there are two main deployment decisions to be made:

  • How the cluster fits into the larger architecture of the installation
  • How the cluster is configured internally

The cluster architecture, on the other hand, involves how the cluster is used within the installation. The two main options are:

  • Active/active clustering
  • Active/passive clustering

The cluster configuration defines which internal storage mechanisms are used and how these are shared or synchronized across cluster nodes. The three most commonly used configurations are:

  • Shared nothing
  • Shared data store
  • Shared external database

First we will look at cluster architecture. Cluster configuration will be covered later on.

Clustering Architecture

Clustering architectures usually fall into one of two categories: active/active or active/passive.

Active/Active

In an active/active setup the processing load is divided equally among all nodes in the cluster using the a load balancer such as the CQ Dispatcher. In normal circumstances, all nodes in the cluster are active and running at the same time. When a node fails, the load balancer redirects the requests from that node across the remaining nodes.

file

Active/Passive

In an active/passive setup, a front-end server passes all requests back to only one of the cluster nodes (called the primary) which then serves all of these requests itself. A secondary node (there are usually just two) runs on standby. Because it is part of the cluster the secondary node remains in sync with the primary, it just does not actually serve any requests itself. However, if the primary node fails then the front-end server detects this and a "failover" occurs where the server then redirects all requests to the secondary node instead.

file

CRX Clustering and CQ

Because CQ is built on top of the CRX repository, CRX clustering can be employed in CQ installations to improve performance.

To understand the various ways that CRX clustering can fit into the larger CQ architecture we will first take a look at two common, non-clustered CQ architectures: single publish and multiple publish.

Single Publish

A CQ installation consists of two separate environments: author and publish

The author environment is used for adding, deleting and editing pages of the website. When a page is ready to go live, it is activated, causing the system to replicate the page to the publish environment from where it is served to the viewing audience.

In the simplest case, the author environment consists of a single CQ instance running on a single server machine and the publish instance consists of another single CQ instance running on another machine. In addition, a Dispatcher is usually installed between the publish server and the web for caching.

file

Multiple Publish

A common variation on the installation described above is to install multiple publish instances (usually on separate machines). When a change made on the author instance is ready to go live, it is replicated simultaneously to all the publish instances.

As long as all activations and deactivations of pages are performed identically on all publish instances, the content of the publish instances will stay in sync. Depending on the configuration of the front-end server, requests from web surfers are dealt with in one of two ways:

  • The incoming requests are distributed among the publish instances by the load-balancing feature of the Dispatcher.
  • The incoming requests are all forwarded to a primary publish instance until it fails, at which point all requests are forward to the secondary publish instance (in this arrangement there are usually only two instances).

Note

The dispatcher is an additional piece of software provided by Adobe in conjunction with CQ that can be installed as a module within any of the major web servers (Microsoft IIS, Apache, etc.). Load-balancing is a feature of the dispatcher itself, and its configuration is described here.

The configuration of failover, on the other hand, is typically a feature of the web server within which the dispatcher is installed. For information on configuring failover, therefore, please consult the documentaion specifc to your web server.

file

Is Multiple Publish a Form of Clustering?

Note

The architecture shown above describes two possible configurations for a multiple publish system: load balancing and failover.

The first setup is sometimes described as a form of active/active clustering and the second as a form of active/passive clustering. However, these archiectures are not considered true CRX clustering because:

  • This solution is specific to the CQ system since the concept of activation and replication of pages is itself specific to CQ. So this is not a generic solution for all CRX applications.
  • Synchronization of the publish instances is dependant on an external process (the replication process configured in the author instance), so the publish instances are not in fact acting as a single system.
  • If content is written to a publish instance from the external web (as is the case with user generated content such as forum comments) CQ uses reverse-replication to copy the content to the author instance, from where it is replcated back to all the publish instances. While this system is sufficient in many cases, it lacls the robustness of content synchronization under true clustering.

For these reasons the multiple publish arrangement is not a true clustering solution as the term is usually employed.

Publish Clustering

There are a number of options for using true CRX clustering within a CQ installation. The first one we will look at is publish clustering.

In the publish clustering arrangement, the publish instances of the CQ installation are combined into a single cluster. The front-end server (including the dispatcher) is then configured either for active/active behavior (where load is equally distributed across cluster nodes) or active/passive behavior (where load is only redirected on failover).

The following diagram illiustrates a publish cluster arrangement:

file

Note

Hot Backup

A variation on publish clustering with failover is to have the secondary publish server completely disconnected from the web, functioning instead as a continually updated and synchronized backup server. If at any time the back up server needs to be put on line, this could then be done manually by reconfiguring the front-end server.

Author Clustering

Clustering can also be employed to improve the performance of the CQ author environment. An arrangement using both publish and author clustering is shown below:

file

Other Variations

It is also possible to set up other variations on the author clustering theme by pairing the author cluster with either multiple (non-clustered) publish instances or with a single publish instance. However, such variants are rarely used in practice.

Clustering and Performance

When faced with a performance problem in single instance CQ system (either at the publish or author levels), clustering may provide a solution. However, the extent of the improvement, if any, depends upon where the performance bottleneck is located.

Under CQ/CRX clustering, read performance scales linearly with the number of nodes in the cluster. However, additional cluster nodes will not increase write performance, since all writes are serialized through the master node.

Clearly, the increase in read performance will benefit the publish environment, since it is primarily a read system. Perhaps surprisingly, clustering can also benefit the author environment because even in the author environment the vast majority of interactions with the repository are reads. In the usual case 97% of repository requests in an author environment are reads, while only 3% are writes.

This means that despite the fact that CQ clustering does not scale writes, clustering can still be a very effective way of improving performance both at the publish and author level.

However, while increasing the number of cluster nodes will increase read performance, a bottleneck will still be reached when the frequency of requests gets to the point where the 3% of requests that are writes overwhelm the capabilities of the single master node.

In addition, while the percentage of writes under normal authoring conditions is about 3%, this can rise in situations where the authoring system handles a large number of more complex processes like workflows and multisite management.

In cases where a write bottleneck is reached, additional cluster nodes will not improve performance. In such circumstances the correct strategy is to increase the hardware performance through improved CPU speed and increased memory.

CRX Storage Overview

Clustering in CRX can be configured in a number of ways, depending on the implementation chosen for each element of the storage system. To understand the options available, a brief overview of CRX storage may therefore be helpful.

Storage Elements

Storage in CRX is made up of the following major elements:

  • Persistence store
  • Data store
  • Journal
  • Version storage
  • Other file-based storage

Persistence Store

The repository's primary content store holds the hierarchy of JCR nodes and properties. This storage is handled by a persistence manager (PM) and is therefore commonly called the persistence store (despite the fact that all the stoage elements are in fact persistent). It is also sometimes referred to as the workspace store because it is configured per workspace (see below). 

While a number of different PMs are available, each using a different underlying storage mechanism, the default PM is the Tar PM, a high-performance database built into CRX and designed specifically for JCR content. It derives its name from the fact that it stores its data in the form of standard Unix-style tar files in the file system of the server on which you are running CRX. Other PMs, such as the MySQL PM and the Oracle PM, store data in conventional relational databases, which must be installed and configured separately.

All non-binary property values and all binary properrty values under a certain (configurable) size, are stored directly by the PM in the content hierarchy, in the manner specific to that PM. However, binary property values above the threshold size are redirected to the data store, and a reference to the value in the data store is stored by the PM (DS, see below).

Depending on the cluster configuration, each cluster node may have its own separate PM storage (the shared nothing and shared data store configurations) or it may share the same PM storage as the other cluster nodes (the shared external database arrangement). In the case where each cluster node has its own separate PM storage, these are kept synchronized across instances through the journal (see below). For more details on persistence managers, see Persistence Managers.

Note

The CRX repository actually supports multiple content hierarchies, which are called workspaces in the terminology of JCR. In theory, different workspaces can be configured with different PMs. However, since multiple workspaces are rarely used in practice and do not have a role in the architecture of CQ, this will not typically be an issue.

Data Store

The data store (DS) holds binary property values over a given, configurable, size. On write, these values are streamed directly to the DS and only an reference to the value is written by the PM to the persistence store. By providing this level of indirection, the DS ensures that large binaries are only stored once, even if they appear in multiple locations within a workspace. In a clustered environment the data store can be either per repository instance (per cluster node) or shared among cluster nodes in commonly accessible network file system directory. For more details on the data store, see Data Store.

Journal

Whenever the repository writes data it first records the intended change in the journal. Maintaining the journal helps ensure data consistency and helps the system to recover quickly from crashes. In a clustered environment the journal plays the critical role of synchronizing content across cluster instances.

Version Storage

The version storage is held in a single, per-repository, hidden workspace only accessible through the versioning API. Though the version storage is reflected into each normal workspace under/jcr:system/jcr:versionStorage, the actual version information resides in the hidden workspace.

Since the version information is stored in the same way as content in a normal workspace, any changes to it in one cluster instance are propagated to the other instances in the same way as changes in a normal workspace. 

The PM used for version storage is configured separately from those used for normal workspaces. By default the Tar PM is used.

Other File-Based Storage

In addition to the above elements, CRX also stores some repository state information in plain files in the file-system. This includes the search indices, namespace registry, node type registry and access control settings. In a cluster, this information is automatically synchronized across cluster nodes.

Cluster Configuration

As mentioned above, there are three commonly used cluster configurations, which differ according to how the various storage elements are configured. These are:

Shared Nothing

This is the default configuration. In this configuration all elements of CRX storage are held per cluster node and synchronized over the network. No shared storage is used. The Tar PM is used for the persistence storage, the Tar Journal is used for the journal and the Cluster Data Store is used for the data store.

Shared Data Store

In this configuration the workspace stores and the journal are maintained per-cluster node as above, using the Tar PM and Tar Journal, but the data store is held in a shared directory accessible to all cluster nodes and uses the File Data Store.

Shared External Database

In this configuration all cluster nodes shared a common persistence store, journal and data store, which are all held in a single external RDBMS.

In addition, the external database storage is sometimes used to store the version storage as well (again, a single store shared across all cluster nodes).

When configuring CRX to use an external database, typically, a single backend database system is used to store all the elements (Workspaces, Data Store, Journal and Version Storage) for all instances in the cluster. Note that nothing prevents this backend database system from itself being a clustered system.

repository.xml

This section describes the details of cluster configuration for the three common patterns mentioned above and also serves as a general guide to other possible variations on clustering.

The configuration parameters for clustering are found primarly in the the file crx-quickstart/repository/repository.xml. The highlighted sections are the ones relevant to cluster configuration

Data Store Configuration

The element DataStore holds the parameters that govern the data store which holds large binary objects. By default the data store is configured to use shared-nothing clustering, meaning that each instance maintains its own copy of the store and they are all kept in sync. The default configuration looks like this:

<DataStore class="com.day.crx.core.data.ClusterDataStore">
<param name="minRecordLength" value="4096"/>
</DataStore>
  • minRecordLengthThe minimum size for an object to be stored in the Data Store as opposed to the inline within the regular PM. The default is 4096 bytes. The maximum supported value is 32000.

It is possible to maintain shared-nothing clustering for the worksapce stores and journal while using a shared file for the data store. This may be done to reduce disk space requirements, all cluster nodes can share the same data store. To use a shared data store, change the data store configuration in the repository.xml file as follows:

<DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
<param name="minRecordLength" value="4096"/>
<param name="path" value="${rep.home}/shared/datastore"/>
</DataStore>
  • path: The path where the data store files are stored. All cluster nodes must point to the same physical directory.

Journal Configuration

The element <Cluster> holds the parameters that govern the journal.

<Cluster syncDelay="2000">
<Journal class="com.day.crx.persistence.tar.TarJournal">
<param name="bindAddress" value=""/>
</Journal>
</Cluster>
  • bindAddress: Used if the synchronization between cluster nodes should be done over a specific network interface. Default: empty (meaning all network interfaces are used).
  • syncDelay: Events that were issued by other cluster nodes are processed after at most this many milliseconds. Optional, the default is 5000 (5 seconds)
  • maxFileSize: The maximum file size per journal tar file. If the current data file grows larger than this number (in megabytes), a new data file is created (if the last entry in a file is very big, a data file can actually be bigger, as entries are not split among files). The maximum file size is 1024 (1 GB). Data files are kept open at runtime. The default is 256 (256 MB).
  • maximumAge: Age specified as duration in ISO 8601 or plain format. Journal files that are older than the configured age are automatically deleted. The default is "P1M", which means files older than one month are deleted.
  • portList: The list of listener ports to use by this cluster node. When using a firewall, the open ports must be listed. A list of ports or ranges is supported, for example: 9100-9110 or 9100-9110,9210-9220. By default, the following port list is used: 8088-8093.
  • preferredMaster: Flag indicating whether this cluster node should be a preferred master. If this flag is set, the node will become the master upon startup. Default: false.

Workspace Store Configuration

All three of the active clustering variants (shared nothing, shared data store or shared data store and journal) use the TarPM persistence manager for workspace storage. Changing this PM is also possible, but since it is not an issue tied specifically to clustering (single-instance installs may also wish to change workspace PMs) it is not covered here, but in the Persistence Managers section.

<Repository>
...
<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
<param name="url" value="jdbc:mysql://192.168.2.34:3306/crx1"/>
<param name="user" value="crx"/>
<param name="password" value="crxpassword"/>
<param name="minRecordLength" value="4096"/>
<param name="maxConnections" value="30"/>
</DataStore>
...
<Workspace name="${wsp.name}" simpleLocking="true">
...
<PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.
MySqlPersistenceManager">
<param name="url" value="jdbc:mysql://192.168.2.34:3306/crx1"/>
<param name="user" value="crx"/>
<param name="password" value="crxpassword"/>
<param name="schemaObjectPrefix" value="${wsp.name}"/>
</PersistenceManager>
...
</Workspace>
...
<Cluster>
<Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal">
<param name="revision" value="${rep.home}/revision.log" />
<param name="driver" value="com.mysql.jdbc.Driver"/>
<param name="url" value="jdbc:mysql://192.168.2.34:3306/crx1" />
<param name="user" value="crx"/>
<param name="password" value="crxpassword"/>
<param name="databaseType" value="mysql"/>
</Journal>
</Cluster>
...
</Repository>
<Repository>
...
<Versioning rootPath="${rep.home}/version">
...
<PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.
MySqlPersistenceManager">
<param name="url" value="jdbc:mysql://192.168.2.34:3306/crx1"/>
<param name="user" value="crx"/>
<param name="password" value="crxpassword"/>
<param name="schemaObjectPrefix" value="version_"/>
</PersistenceManager>
...
</Versioning>
...
</Repsoitory>

Cluster Properties and Node ID

The file crx-quickstart/repository/cluster_node.id contains the cluster node id (unique for each cluster node). This file is automatically created by the system. By default it contains a randomly generated UUID, but it can be any name. When copying a cluster node, this file should be copied (if two cluster nodes contain the same cluster node id, only the first cluster node can connect).

Example file:

08d434b1-5eaf-4b1c-b32f-e9abedf05f23

The file crx-quickstart/repository/cluster.properties contains cluster configuration properties. The file is automatically updated by the system if the cluster configuration is changed in the GUI.

Example file:

cluster_id=86cab8df-3aeb-4985-8eb5-dcc1dffb8e10

addresses=10.0.2.2,10.0.2.3

members=08d434b1-5eaf-4b1c-b32f-e9abedf05f23,fd11448b-a78d-4ad1-b1ae-ec967847ce94

The cluster_id property contains the cluster ID, which must be the same for all cluster nodes that participate in this cluster. By default this is a randomly generated UUID, but it can be any name.

The addresses property contains a comma separated list of the IP addresses of all nodes in this cluster. This list is used at the startup of each cluster node to connect to the other nodes that are already running. The list is not needed if all cluster nodes are running on the same computer (which may be the case in certain circumstances, such as testing).

The members properties contains a comma separated list of the cluster node IDs that participate in the cluster. This property is not required for the cluster to work, it is for informational purposes only.

System Properties

The following system properties affect the cluster behavior:

socket.connectTimeout: The maximum number of milliseconds to wait for the master to respond (default: 1000). A timeout of zero means infinite (block until connection established or an error occurs).

socket.receiveTimeout: the maximum number of milliseconds to wait for a reply from the master or slave (SO_TIMEOUT; default: 60000). A timeout of zero means infinite.

com.day.crx.core.cluster.DisableReverseHostLookup: Disable the reverse lookup from the master to the slave when connecting (default: false). If not set, the master checks if the slave is reachable using InetAddress.isReachable(connectTimeout).

Clustering Setup

In this section we describe the process of setting up a CRX cluster.

Clustering Requirements

The following requirements should be observed when setting up clustering:

  • Each cluster node (CRX instance) should be on its own dedicated machine. During development and testing one can install multiple instances on a single machine, but for a production environment this is not recommended.
  • For shared-nothing clustering, the primary infrastructure requirement is that the netwrok connecting the cluster nodes have high reliability, high-availability and low-latency.
  • For shared data store and shared external database clustering, the shared storage (be it file-based or RDBMS-based) should be hosted on a high-reliability, high-availability storage system. For file storage the recommended technologies include enterprise-class SAN systems, NFS servers and CIFS servers. For database storage a high-availability setup running either Oracle or MySQL is recommended. Note that since, in either case, the data is ultimately stored on a shared system, the reliability and availability of the cluster in general depends greatly on the reliability and availability of this shared system.

GUI Setup of Shared Nothing Clustering

By default a freshly installed CRX instance runs as a single-master, zero-slave, shared-nothing cluster. Additional slave nodes can be added to the master easily through the cluster configuration GUI.

If you wish to deploy a shared data store or shared external database cluster, or if you wish to tweak the settings of the default shared-nothing cluster, you will have to perform a manual configuration. In this section we describe the GUI deployment of a shared-nothing cluster. Manual configuration is covered in the next section.

  1. Install two or more CRX instances. In a production environment, each would be installed on a dedicated server. For development and testing, you may install multiple instances on the same machine.

  2. Ensure that all the instance machines are networked together and visible to each other over TCP/IP.

    Caution

    Cluster instances communicate with each other through port 8088. Consequently this port must be open on all servers within a cluster. If one or more of the servers is behind a firewall, that firewall must be configured to open port 8088.

    If you need to use another port (e.g., due to firewal setup), use Manual Cluster Setup approach. You can configure the cluster communication port to another port number, in which case that port must be visible through any firewall that may be in place.

    To configure the cluster communications port, change the portList parmeter in the <Journal> element of repository.xml as described here.

  3. Decide which instance will be the master instance. Note the host name and port of this instance. For example, if you are running the master instance on your local machine, its address might be localhost:4502.

  4. Every instance other than the master will be a slave instance.

    You will need to connect each slave instance to the master by going to their respective cluster configuration pages here:

    http://<slave-address>/libs/granite/cluster/content/admin.html

    file
  5. In the Cluster Configuration page enter the address of the master instance in the field marked Master URL, as follows:

        http://<master-address>/

    For example if both your slave and master on your local machine you might enter

        http://localhost:4502/

    Once you filled in the Master URL, enter your Username and Password on the master instance and click Join. You must have administrator access to set up a cluster.

  6. Joining the cluster may take a few minutes.

    Allow some time before refreshing the master and slave UIs in the browser. Once the slave is properly connected to the master you should see something similar to the following on the master and slave cluster UIs:

    file

    Note

    In some cases, a restart of the slave instance might be required to avoid stale sessions.

Note

When configuring file-based persistence managers such as the Tar PM (as opposed to database-based PMs) the file system location specified (for example, location where the tar PM is configured to store its tar files) should be a true local storage, not network storage. The use of network storage in such cases will degrade performance.

Manual Cluster Setup

In some cases a user may wish to set up a cluster without using the GUI. There are two ways to do this: manual slave join and manual slave clone.

The first method, manual slave join, is the same as the standard GUI procedure except that it is done without the GUI. Using this method, when a slave is added, the content of the master is copied over to it and a new search index on the slave is built from scratch. In cases where an pre-existing instance with a large amount of content is being "clusterized" this process can take a long time.

In such cases it is recommended to use the second method, manual slave clone. In this method the master instance is copied to a new location either at the file system level (i.e., the crx-quickstart directory is copied over) or using the online backup feature and the new instance is then adjusted to play the role of slave. This avoids the rebuilding of the index and for large repositories can save a lot of time.

Manual Slave Join

The following steps are similar to joining a cluster node using the GUI. That means the data is copied over the network, and the search index is re-created (which may take some time):

Master

  • Copy the files crx-quickstart-2.2.*.jar and and license.properties to the desired directory.
  • Start the instance:
     java -Xmx512m -jar *.jar
  • If a shared data store should be used: stop the instance, change crx-quickstart/repository/repository.xml, move the datastore directory to the required place, start the instance, and verify it still works.

Slave

  • Copy the files crx-quickstart-2.2.*.jar and license.properties to the desired directory (usually on a different machine from the master, unless you are just testing).
  • Unpack the JAR file:
    java -Xmx512m -jar crx-quickstart-2.2.*.jar -unpack
  • Copy the files repository.xml and cluster.properties from the master:
    cp ../n1/crx-quickstart/repository/repository.xml crx-quickstart/repository/
    cp ../n1/crx-quickstart/repository/cluster.properties crx-quickstart/repository/
  • Copy the namespaces and node types from the master:
    cp -r ../n1/crx-quickstart/repository/repository/namespaces/ crx-quickstart/repository/repository/
    cp -r ../n1/crx-quickstart/repository/repository/nodetypes/ crx-quickstart/repository/repository/
  • If this new slave is on a different machine from the master, append the IP address of the master to the cluster.properties file of the slave:
    echo "addresses=x.x.x.x" >> crx-quickstart/repository/cluster.properties
    where x.x.x.x is replaced by the correct address. At the master, the IP address of the slave should be added to the cluster.properties file as well.
  • Start the instance:
    java -Xmx512m -jar crx-quickstart-2.2.*.jar

Manual Slave Cloning

The following steps clone the master instance and change that clone into a slave, preserving the existing search index:

Master

Your existing repository will be the master instance.

If it is feasible to stop the master instance:

  • Stop the master instance either through the GUI switch or the command line stop script.
  • Copy the crx-quickstart directory of the master over to the location where you want the slave installed, using a normal filesystem copy (cp, for example).
  • Restart the master.

If it is not feasible to stop the master instance:

  • Do an online backup of the instance to the new slave location. The online backup tool can be made to write the copy directly into another directory or to a zip file which you can then unpack in the new location. See here for details. The process can be automated using curl or wget. For example:

    curl -c login.txt "http://localhost:7402/crx/login.jsp?UserId=admin&Password=xyz&Workspace=crx.default"

    curl -b login.txt -f -o progress.txt "http://localhost:7402/crx/config/backup.jsp?action=add&&zipFileName=&targetDir=<targetDir>"
     

Slave

In the new slave instance directory:

  • Modify the file crx-quickstart/repository/cluster_node.id so that it contains a unique cluster node ID. This ID must differ from the IDs of all other nodes in the cluster.
  • Add the node ID of the master instance and all other slave nodes (apart from this one) separated by commas to the file crx-quickstart/repository/cluster.properties. For example, you could use something like the following command (with the capitalized items replaced with the actual IDs used):

    echo "members=MASTER_NODE_ID,SLAVE_NODE_1_ID" >> crx- quickstart/repository/cluster.properties

  • Add the master instance IP address and the IP address of all other slave instances (apart from this one) to the file crx-quickstart/repository/cluster.properties. For example, you could use something like the following command (with the capitalized items replaced with the actual IDs used):

    echo "addresses=MASTER_IP_ADDRESS,SLAVE_1_IP_ADDRESS" >> crx- quickstart/repository/cluster.properties

  • Start the slave instance. It will join the cluster without re-indexing. Note: Once the slave is started, the master cluster.properties file will automatically be updated by appending the node ID and IP address of the slave.

Out-of-Sync Cluster Instances

In some cases, when the master instances is stopped while the other cluster instances are still running, The master instance cannot re-join the cluster after being restarted.

This can occur in cases where a write operation was in progress at the moment that the master node was stopped, or where a write operation occured a few seconds before the master instance was stopped. In these cases, the slave instance may not receive all changes from the master instance. When the master is then re-started, CRX will detect that it is out of sync with the remaining cluster instances and the repository will not start. Instead, an error message is written to the server.log saying the repository is not available, and the following or a similar error message in the file crx-quickstart/logs/crx/error.log and crx-quickstart/logs/stdout.log:

ClusterTarSet: Could not open (ClusterTarSet.java, line 710)
java.io.IOException: This cluster node and the master are out of sync. Operation stopped.
Please ensure the repository is configured correctly.
To continue anyway, please delete the index and data tar files on this cluster node and restart.
Please note the Lucene index may still be out of sync unless it is also deleted.
...
java.io.IOException: Init failed
...
RepositoryImpl: failed to start Repository: Cannot instantiate persistence manager
...
RepositoryStartupServlet: RepositoryStartupServlet initializing failed

Avoiding Out-of-Sync Cluster Instances

To avoid this problem, ensure that the slave cluster instances are always stopped before the master is stopped.

If you are not sure which cluster instance is currently the master, open the page http://localhost:port/crx/config/cluster.jsp. The master ID listed there will match the the contents of the file crx-quickstart/repository/cluster_node.id of the master cluster instance.

Recovering an Out-of-Sync Cluster Instance

To re-join a cluster instance that is out of sync, there are a number of solutions:

  • Create a new repository and join the cluster node as normal.
  • Use the Online Backup feature to create a cluster node. In many cases this is the fastest way to add a cluster node.
  • Restore an existing backup of the cluster instance node and start it.
  • As described in the error message, delete the index and data tar files that are out-of-sync on this cluster node and restart. Note that the Lucene search index may still be out of sync unless it is also deleted. This procedure is discouraged as it requires more knowledge of the repository, and may be slower than using the online backup feature (specially if the Lucene index needs to be re-built).

Important Notes

Locking in active cluster. Active cluster does not support session-scoped locks. Open-scoped locks or application-side solutions for synchronizing write operations should be used instead.

Tar PM Optimization

The Tar PM stores its data in standard Unix-style tar files. Occassionally, you may want to optimize these storage files to increase the speed of data access. Optimizing a Tar PM clustered system is essentially identical to optimizing a stand-alone Tar PM instance (see Optimizing Tar Files).

The only difference is that to optimize a cluster, you must run the optimization process on the master instance. If the optimization is started on the master instance, the shared data as well as the local cache of the other cluster instances is optimized automatically. There is a small delay before the changes are propagated (a few seconds). If one cluster instance is not running while optimizing, the tar files of that cluster instance are automatically optimized the next time the instance is started.


Your comments are welcome.
Did you notice a way we could improve the documentation on this page? Is something unclear or insufficiently explained? Please leave your comments below and we will make the appropriate changes. Comments that have been addressed, by improving the documentation accordingly, will then be removed.

COMMENTS

  • By Raman - 11:47 AM on Apr 14, 2010   Reply
    How do cluster nodes find each other/communicate with each other: udp, broadcast, tcp? Are there any networking requirements based on this?
    • By tmueller - 3:51 PM on Apr 14, 2010   Reply
      How do cluster nodes find each other: when starting, the master writes its IP address and port to the shared storage (in a properties file). The slave(s) read this information from there. How do they communicate: using TCP/IP.
    • By robert.a.brown2 - 12:58 AM on Jul 16, 2010   Reply
      How does one disable clustering? And if clustering is already enabled how do you disable it?
      • By robert.a.brown2 - 2:01 AM on Jul 17, 2010   Reply
        Sorry my question was redundant. I meant to ask is it possible to install without clustering enabled? And if it is already enabled how to you disable it?
        • By gklebus - 5:07 PM on Aug 18, 2010   Reply
          CRX & CQ5 are by default *ready* to be clustered - they have journaling enabled. When you install one instance, however, it's not a cluster yet - cluster, by definition, requires at least two instances.

          It is possible, although not generally recommended to disable journal on a single instance. Please get in touch with our support to learn more about it.
      • By Oliver - 2:14 PM on May 19, 2011   Reply
        "Avoiding Out-of-Sync Cluster Instances

        To avoid this problem, ensure that the slave cluster instances are always stopped before the master is stopped."

        This means: When we have to make some maitenance task on our master server, that need stopping of the master server from our shared nothing cluster, we´ve always to stop our slave servers.
        So a planned outage of our master server will be an outtage of our whole cluster?

        Is this really your suggestion?

        What would be your solution for 7x24?

        • By alvawb - 7:17 PM on Jun 24, 2011   Reply
          Please see the reply below. The hotfix fixes this issue.
          • By clay - 5:44 PM on Dec 22, 2011   Reply
            I have set up a clustered author environment and I applied all the cluster patches, feature packs, and hotfix packages recommended by daycare. I had to restart my author instances in order to activate some new parameters and when I restarted the instances the slave cluster node did not come back up.

            In a cluster, the members should not depend on each other being up, nor should there be any order in which the instances are restarted. What happens if there is an outage on the master node, like a server or jvm crash? What happens in a production environment where each node has to be taken down one at a time to perform maintenance so that the cluster stays on line for use?

            Is there a plan on when these problems are really going to be fixed?
            • By alvawb - 1:46 PM on May 02, 2012   Reply
              Thanks for your feedback. If you're still experiencing this issue, please go to daycare at daycare.day.com to contact Customer Support.
        • By raman - 12:36 PM on May 20, 2011   Reply
          Oh man, is this ever gonna work ???
          • By Thomas Mueller - 9:36 AM on May 24, 2011   Reply
            There is a CRX feature pack available that solves the known problems (including 'avoiding out-of-sync instances').
            • By aheimoz - 8:29 AM on May 25, 2011   Reply
              And if you're still having problems please give us some more information about the specific issues - and/or an email address where we can contact you.
          • By rkent - 6:40 PM on Jun 01, 2011   Reply
            Hi, I assume this is the crx featurepack/hotfix pack you are talking about which solves the 'known problems'?

            http://www.day.com/content/kb/home/Crx/Hotfixes/crx-2-2/hotfixpack-2-2-0-12.html
            • By alvawb - 7:15 PM on Jun 24, 2011   Reply
              Yes, that is the correct featurepack.
            • By amaggi - 3:31 PM on Aug 03, 2011   Reply
              Which are the parameters needed to confiugre the FileJournal for sharing the Journal? This is not described like the Data Store sharing: can you provide more details?
              • By ppiegaze - 9:43 PM on Aug 12, 2011   Reply
                In CRX 2.2 the recommended journal is the TarJournal. If you wish to use the older FileJournal implementation --which should still work in a 2.2 installation-- you should consult the CRX 2.1 documentation here: http://dev.day.com/content/docs/en/crx/2-1/administering/cluster.html.
                • By amaggi - 9:47 AM on Aug 18, 2011   Reply
                  Hi,
                  thanks for your input.
                  One other question: since the TarJournal is the recommended one, can I use this when sharing the Journal?
                  Reading the documentation it seems that in this case I should use the FileJournal instead.
                  • By ppiegaze - 12:11 AM on Aug 19, 2011   Reply
                    I assume that by "sharing the journal" you mean using a shared file system directory, correct?

                    The FileJournal was the default journal implementation in CRX 2.1. By design it stores the journal data in a shared file system directory accesible by all cluster instances. That is basically what makes it the "FileJournal".

                    In CRX 2.2 the default journal implementation was changed to the TarJournal, which works differently (and better). It does not use a shared directory, but rather synchronizes automatically across cluster nodes directly via a TCP/IP connection.

                    So, to answer your question, the TarJournal cannot be used with a shared directory because that's simply not how it works. The File Journal must be used with a shared directory because that is intrinsic to its design.

                    Unless you have a very pressing reason to use the older technology, we recommend that you use the default TarJournal. More info on the older Filejournal set up can be found in the docs for CRX 2.1, here: http://dev.day.com/content/docs/en/crx/2-1/administering/cluster.html
              • By vfinet - 5:14 PM on Sep 15, 2011   Reply
                Hi,

                I have 2 CQ Author 5.4 instances within a cluster.
                I have one web server in front of thoses two guys dispatching the request with the CQ Dispatcher Cache 4.0.10

                It has worked for 1 day and then "kaboom" one of the instance is out of sync (surelly because we stopped one of the instance while modification were on their way...).
                I will follow your recommandation on how to recover the situation, but my question is as follow....

                ==> does the dispatcher knows when an instance of CQ is out of sync and thus is there an impact for users that uses the web server? is it transparent for them?

                Best regards,
                VIncent
                • By yogesh - 12:05 AM on Oct 11, 2011   Reply
                  does the dispatcher knows when an instance of CQ is out of sync and thus is there an impact for users that uses the web server? is it transparent for them?

                  --- No there is no way for dispatcher to know this.


                • By Peter Mac Courtney - 7:50 PM on Sep 23, 2011   Reply
                  How can I get the cluster instance id from within a bundle? I assume that this is the same as the value that is added to the "application" property of a sling job when it is not local, and I need to do something similar elsewhere...
                  • By alvawb - 1:51 PM on May 02, 2012   Reply
                    Thanks for your feedback. Feel free to leave your comment on http://forums.adobe.com/community/digital_marketing_suite/cq5.
                  • By Chris Trubaini - 8:06 PM on Nov 23, 2011   Reply
                    The steps listed for setting up a shared data store on master involve:
                    1 - start the server
                    2 - stop the server
                    3 - change repository.xml to point to the shared location for the data store
                    4 - start the server and make sure it works

                    It should be added that if you don't copy the repository that was created on the initial start to the share location then it will throws tons of errors when you start the server again.
                    • By Anonymous - 8:41 AM on Jan 07, 2012   Reply
                      You also need to copy all data from <default-repository>/datasource into your new shared datasource
                      • By ppiegaze - 1:51 PM on May 02, 2012   Reply
                        Thanks for pointing these problems out. We will update the documentation ASAP
                    • By lchis - 5:43 PM on Dec 26, 2011   Reply
                      Can you have two clustered nodes in which all data gets replicated between the two except for user data?
                      • By jkautzma - 1:52 PM on May 02, 2012   Reply
                        Hi, you cannot exclude data from being synced within a cluster.
                      • By piyush - 10:31 AM on Feb 21, 2012   Reply
                        Can we do clustering with HTTPS ?
                        • By alvawb - 1:55 PM on May 02, 2012   Reply
                          Clustering is not at the http level. One thing that you could do is route the traffic via a VPN tunnel.
                        • By steve - 2:02 AM on Apr 14, 2012   Reply
                          Is there documentation on active/passive cluster configuration?
                          • By alvawb - 3:53 PM on Apr 16, 2012   Reply
                            There is documentation about active/passive cluster configuration on this page. Is there particular information that you're looking for that you can't find?
                          • By Cindy Navarro - 6:04 PM on May 10, 2012   Reply
                            Do you have links for this section:
                            For more details on the journal, see XX.

                            ADD A COMMENT

                             

                            In order to post a comment, you need to sign-in.

                            Note: Customers with DayCare user accounts need to create a new account for use on day.com.

                            ***