Posted by Ben Peter JAN 28, 2011
Posted in ajax, cms, development and performance Comment 1
In your typical CMS setup, most of the content is actively managed as such. But you often come across scenarios where other data needs to appear on a page, e.g. prices or product data that are provided by external sources and change on their own schedule.
There are various way to handle such a requirement, each with its own upsides and downsides.
One obvious way would be to access the data from within the CMS domain, i.e. build a component that reaches out to the data source and renders the appropriate data accordingly. It’s straightforward to implement, the only variation being the complexity of data access (which is involved anyways).
The one issue with this approach is that it will leave your page uncacheable to make sure that the data is always up to date. Every request to the page needs to hit the Publisher so that the component can reach out to the data source and pull the most recent data. Caching that page on a Dispatcher or CDN level is out of the question.
If you want a page that doesn’t hit the Publisher and can be cached by the Dispatcher and on the CDN, you can take a slightly different approach: build a component that in edit mode will allow you to pull updated data from the data source, and store it as part of the page’s content. In publish mode the data will be rendered just as the rest of the page’s content.
The issue with data updates is not eliminated, but it’s now pushed to the authoring side. On the publisher, the page is fully cacheable, but you need to make sure that whenever the data changes, the page is activated. That can happen automatically, if you have technical ways to be notified of data changes, or can be organizational (read: phone call). Altough technically often not a problem, the automated update often is impossible because the involved pages contain other content that may or may not be ready for activation and still need to be checked by a human. Or they are simply part of a review and approval workflow that is not certain to complete within the time that the data is allowed to be out of date on the public-facing systems.
From an implementation perspective, this option is slightly more complex than the previous option.
If there is a need to update the data in the page in a fully automated fashion, there are more options available that merge content and data not at the time the page is baked, but on the webserver or in the browser.
Bringing data and content together in the browser is easily done through AJAX requests, as long as you can expose the data source in a way that will give you the right data per page as e.g. JSON. For both performance and information control reasons, you want to put a layer on top of the data source that will not simply spit out all of the data, but just the data that are required for that particular page.
This approach works well, is very simple to implement, and matches the second approach in terms of cacheability: the page can be cached at all levels. A request for the page can be offloaded at a CDN layer and from the Dispatcher cache. Only an activation of the page due to content changes will require these layers to be purged or invalidated. The delivery layer on top of the data source can follow its own caching strategy.
The approach may be inappropriate if the display of data must not be deferred until the request for the data is complete, or if it must not depend on Javascript being available. While today such restrictions are typically not considered important from a user experience point of view, legal requirements can often enforce that a page be either displayed completely with accurate information and independent of Javascript, or not at all.
That leads to the fourth option that’s available which allows for good cacheability, data accuracy and user agent compatibility. Instead of aggregating in the browser, the aggregation is performed on the webserver.
For that, a layer is built on top of the data source that renders HTML fragments that go into the page. The respective CQ component does nothing but render an appropriate SSI statement that fetches the HTML fragment from the data rendering layer and plugs it into the page. As typically SSI does not allow you to include remote sources, a reverse proxy is required to make the data available as a local path if the data source is not deployed within the same virtual host.
That leaves the page cacheable on the publisher: the page including the SSI statement is pulled from the dispatcher cache, SSI statements are evaluated, and the page is then returned to the requesting layer – either the CDN, a proxy, or a browser. It can however not be cached at a CDN level, as for data consistency the webserver needs to re-perform the SSI on each request.
In terms of implementation this option is pretty simple, but you want to make sure you have someone handy who knows your webserver well.
For completeness’ sake, there’s an option that has similar characteristics as the first option in terms of cacheability and process interdependencies but is slightly more complex to implement. If you can afford to hit the Publisher on each request and your data source has a hierarchical structure, you can choose to make it visible to Sling as resources within the repository. This is typically only worth the effort if that data is used in many different scenarios and if it is semantically part of your content, but not managed as such because it comes from an established source that is outside of the system’s domain.
Which of these options is appropriate is up to the concrete situation. None of them can be generally ruled out or recommended, although it may be considered bad manners to introduce interdependencies between content publication and data update processes, which the first two and the fifth option do.

