Latest Posts

Archives [+]

Categories [+]

Authors [+]

Entries filed under 'data first'

    Posted by Michael Marth MAY 04, 2010

    Posted in agile, data first, davids model, jcr and modelling Comments 2

    Recently, I read up on quite a number of NoSQL protagonists. Of course, one dominant theme in NoSQL land is "schemaless" as opposed to the full-schema nature of relational databases. As usual, both approaches have their specific pros and cons. A common critism of schemaless data stores is that the entropy of the data would create problems in the long run when too much unstructured data has been amassed. On the other, hand full-schema data bases are much less flexible or downright the wrong tool for unstructured data.

    In this post I would like to point out that you do not necessarily have to choose between those extremes: JCR-based data stores allow you to store unstructured data, fully structured data and anything inbetween. In lack of a better term I would like to call this a "schema-optional" data store with "semi-structured" data.

    • The JCR node type nt:unstructured is designed to accept any properties, so you can dump at will strings, dates or even binaries into such a node. This node type is very useful to get started with coding an application when you do not know what the end result should look like. It allows for a development approach coined "data first, structure later" where structure emerges from data, rather than be defined a priori.
    • On the other end of the spectrum you can have rigidly defined node types. JCR allows you to specify e.g. mandatory properties, default values or the allowed child node types in a node hierarchy. The Apache Jackrabbit site has a good overview of the Compact Namespace and Node Type Definition which is a notation used to define such structure.

    In between these two extreme cases any middle ground is possible in JCR repositories:

    • First, a rigid node type definition for a specific node can define "residual" properties. Such an approach allows the application to set not only the properties that were defined a priori in the node type definition, but also anything else. This is particularly useful for scenarios were only a part of the requirements is known beforehand or where the requirements are known to evolve over time. You can define the known parts but an application can still freely write anything into the node as if it was unstructured.
    • Second, it should also be noted that these structured, unstructured and semi-structured nodes can happily live next to each other in the same repository tree. So different parts of your application can make use of different levels of structure not only through different node types, but also through different parts in the node hierarchy.

    With JCR 2.0 it has become quite a bit easier to evolve the structure (after all, the mantra is "data first, structure later", not "structure never"): one can now change the node types of existing nodes. That facilitates a migration from, say, nt:unstructured nodes to more structured types.

    Posted by Michael Marth SEP 25, 2009

    Posted in data first and jcr Comment 1

    My colleague Cedric Huesler gave a talk "Data First in Cloud Persistence" at yesterday's CloudCamp in London. Missed it? You'll have another chance next week at the Frankfurt CloudCamp. Meanwhile, here's the slide deck (I love the second slide):



    Posted by Michael Marth JUL 29, 2009

    Posted in data first and lotd Add comment

    A couple of days ago, I wrote about the NoSQL movement and my conviction that data storage models will soon be better fitted to the data at hand (rather than have any data shoehorned into a relational model). It turns out that Scott Leberknight has come up with a nice buzz word for this phenomenon: "Polyglot Persistence". Quote from a presentation abstract of his:

    Polyglot persistence is all about considering your persistence requirements and selecting a persistence mechanism that best mets those requirements, as opposed to selecting an RDBMS as the default choice.

    InfoQ has more information on his talk. From the article:

    The types of data managed in the applications is very different as well. It can be either Structured (relational data), Semi-Structured (for example, documents in a medical records system) or Unstructured (audio/video stream).

    (SCNR: if you have all of the above you might want to look at content repositories).

    Related to this: For an introduction to ACID vs BASE you might also enjoy the talk "Drop ACID and think about data" from PyCon.

    Posted by Michael Marth JUL 16, 2009

    Posted in data first, ecm and wcm Comments 10

    OK, I admit it, declaring that "the RDBMS is dead" is a meme that has been going around the software industry for a while. Remember object-oriented data bases that were supposed to replace the relational ones? Well, guess who is still here. However, despite the RDBMS's amazing survival skills I would like to propose a related prediction:

    I believe that the year 2009 will go down in history as the year when the "relational model default" ended. The term "relational model default" was coined by me to describe a peculiar thing that goes on in application development: start talking to your average application developer about some arbitrary business requirement and chances are that simultaneously he mentally constructs a relational model to fit those requirements.

    That relational approach to modeling your problem may or may not be suitable. The real problem is that all too often this default does not get challenged. As a consequence, whatever the fitting data model might be, it gets shoehorned into tables and relations.

    This default "thinking" has not yet changed for the masses, but I believe that it has changed for the early adopters (which means that invariably it will change for the masses in some years).

    I see the default to change from:

    "I need to store some data i.e. I need a relational database"

    to:

    "I need to store something, let me see the data to decide how to store it."

    The most concrete and visible manifestation of the rising interest in non-relational data store is the "NoSQL" movement. NoSQL denotes a group of people interested in exploring and comparing alternatives to the traditional relational data storages like MySQL or Postgres. The inaugural get-together has been covered in Computerworld, see also Johan Oskarsson's post and there is, of course, a Hashtag.

    Other than the NoSQL group I have a second data point to offer: there is a Cambrian Explosion happening in terms of projects exploring non-relational data stores. During the Cambrian Explosion a major diversification of organisms took place. Similarly a plethora of new projects that explore alternatives to relational models continue to gain interest. Here is an incomplete list:

    AllegroGraph, Amazon's SimpleDB, Cassandra, CouchDB, Dynomite, Google's App Engine datastore, HBase, Hypertable, Kai, MemcacheDB, Mongo DB, Neo4J, OpenRDF, Project Voldemort, Redis, Ringo, Scalaris , ThruDB, Tokyo Cabinet (and Tokyo Tyrant and LightCloud)

    Last, but certainly not least, there are Apache Jackrabbit and Apache Sling.

    From my perspective there are three main areas of innovation in this Cambrian Explosion of data stores:

    1. Models
    In the relational model you break down your data into tables and relations. This model implies that the data is somewhat tabular. However, in some cases the data simply is not tabular.

    Consider web content, which is hierarchical and mixes fine-granular data with binary files (this model is implemented in Jackrabbit). Other (not mutually exclusive) alternative models are document-oriented, key-value pairs, or Graphs/RDF.

    One very important aspect of many alternative models is that they are schemaless. That means that they accommodate for Data First approaches where it is not required to define the data structure before one can actually store any data. This enables agile approaches to software development in the short term as well as more flexibility in the long term evolution of business requirements.

    Without defining a data structure first it is not possible to store anything at all in an RDBMS. This fact is probably one of the root causes of the relational default thinking. An RDBMS-based developer simply cannot develop anything without thinking about table structure.

    2. Scalability
    A second area of innovation is scalability. This can be split down into two sections: One is scalability achieved by distributing the data store across separate machines, the approach pioneered by Google. Opposed to classical clustering of RDBMSs the order of magnitude of machines that are considered is hundreds rather than ten. Obviously, different trade-offs regarding consistency and availability of individual cluster nodes must be taken when architecting for such a high number of cluster nodes. Eventual consistency is one of the interesting concepts invented in this space.

    While the commoditization of server hardware triggered this first approach to scalability, a second area is related to the rise of multi-core processors. For a number of years CPUs have not gotten faster, but rather the number of cores has increased. There is no explicit contradiction in running a classical RDBMS on a multi-core machine and even having the RDBMS take advantage of them. However, it seems to me that the SQL language is a poor fit for queries in a multi-core environment when compared with alternatives such as Map/Reduce which are parallel by design.

    3. Web
    The third area of innovation revolves around the fact that the web is the dominant paradigm for computing in our time. This is also acknowledged by the two considerations discussed above. However, a third one is that HTTP is used for accessing the data. Other types of connectivity that were typically implemented as JDBC or ODBC drivers are not needed/used anymore. In many cases the data store exposes its resources in a RESTful API. An obvious benefit is the ubiquitous availability of clients including the browser itself. The classical RDBMS approach involving a dedicated driver looks like a client-server architecture mindset in comparison (I wrote about this 1.5 years ago).

    At this point let me re-iterate that RDBMSs are here to stay, just like mainframes never went away. Moreover, a couple of the innovation areas cited above are not that new at all, especially, when it comes to non-relational data models (for example, I recently dug into the foundations of the Lotus Notes document store and came out very impressed). However, it is only now that the relational model default will disappear.

    What about content management systems?

    Considering the content management system industry as a whole I am extremely happy about this shift away from RDBMSs. Especially the model aspect is crucial: RDBMSs embody a fundamentally wrong model for content. There are varying opinions in the industry about what "content" really is, but one thing is more or less universally accepted: it is (at least partially) unstructured. Well, RDBMSs are designed for structured data. Duh.

    So why are there one gazillion LAMP-based CMSs? I blame the relational model default. But as this default vanishes we will see more and more CMSs that are not based on an RDBMS (see the Jackrabbit wiki for a list of JCR-based ones, as well as the recent PHP-based JCR implementations Jackalope or for Typo3 or the Midgard content repository).

    Don't laugh, but I truly envision a better (CMS) world once more CMSs are built upon proper tools and not forced into a relational model anymore. It will be a better world for developers and consequently for the CMS users.

    What about Day?

    REST and content repositories were invented and evangelized by Day's Chief Scientist Roy and Day's CTO David years ago already. So it is no surprise that Day's content management systems are in an excellent shape with respect to these considerations. CQ5 is built upon Apache Jackrabbit, i.e. a data store that implements a content-centric model, and Apache Sling, a web framework designed to be RESTful right from the start.

    When it comes to scaling: a week ago we gave a live demonstration on how to install and cluster CQ5 on Amazon's EC2 service. But, expect even more exciting news in this area.

    Posted by Michael Marth JAN 12, 2009

    Posted in ab testing, crx quickstart, data first, javascript, sling and tracking Comments 6

    John Resig of JQuery fame has written an interesting article about a Javascript library called Genetify by Greg Dingle which is for A/B Testing web sites. Wikipedia explains A/B Testing as:

    A/B testing, or split testing, is a method of advertising testing by which a baseline control sample is compared to a variety of single-variable test samples in order to improve response rates. A classic direct mail tactic, this method has been recently adopted within the interactive space to test tactics such as banner ads, emails and landing pages.

    Significant improvements can be seen through testing elements like copy text, layouts, images and colors. However, not all elements produce the same improvements, and by looking at the results from different tests, it is possible to identify those elements that consistently tend to produce the greatest improvements.

    In the context of a web page one might for example change the colors or the texts, display each variation to a subset of the site's visitors and determine the most successful variant by the number of page views or sold items.

    There's two things to note about Genetify: first, it takes this process to the client, i.e. the served HTML page already contains all possible variants and a particular variant is chosen on the client-side. Second, over time the optimal variation will be shown more often than suboptimal versions. This is the "genetic" part (as in Genetic Algorithms).

    John provides a good overview of the library and also points to Genetify's instructive demo. After John's post Genetify's author Greg Dingle has open-sourced Genetify on GitHub including a PHP/RDBMS-based backend which is announced and discussed in the comments of John's post. In another comment of that thread Rob Howell says:

    Also, would be very cool to see it integrated server-side into a decent CMS.

    Hmm, I happen to know a decent CMS so I had a look how Genetify could be ported (to Apache Sling actually, which makes it suitable for CQ5 or any other Sling-based web application):

    Originally, I planned to simply re-implement the PHP backend and leave the JS untouched. But I realized that the style of interaction between the JS script and the PHP-backend was so much out of tune with how one would design the interaction in a RESTful framework like Sling that I decided to tweak the JS script as well. As such, this excercise became more interesting in the sense that some differences between PHP/RDBMS-backends (I should rather say: the way PHP-based backends are usually designed) and Sling/JCR-backends became visible.

    The first difference was for recording "variants" and "goals". The variants are the permutations of the genes that are shown to a specific user. The goals are the desired outcomes that shall be measured, like buying something. Both need to be persisted, obviously. In the original version both are recorded by sending a GET request to the backend. I changed this to the (arguably more "correct") POST method. The original version sends a random number parameter with each request. As far as I understand the code this is needed to get around caching issues. Using POST would allow to drop this parameter. Whatsmore, Sling requires no backend code at all for writing a new node when the request is sent using POST method.

    The second change involves the layout of the stored data. In a RDBMS-based system one (obviously) puts the different entities into different tables (which need to be defined beforehand). In a JCR-based system one possible, if not even the natural approach is to utilize the hierarchy - and potentially not define any node types at all, like I did. Since I store all variants and goals in nodes of type nt:unstructured there is no need to define a data schema or the like beforehand. One can start writing into the empty repository.

    For example, the variants are stored in one node of type nt:unstructured that stores all the properties like on wich domain the variant was shown. The actual genes are stored in a child node below. A similar approach is taken for the goals where there is a node for each goal (named like the goal) and child nodes for the achieved goals.

    It is actually possible to create a node hierarchy like this in one POST request by simply setting parameter names accordingly:

    ./param1=value2&./childnode/childparam2=value2
    

    (this approach is also used in the blog sample application where a blog post can have an attachment which is stored as a child node of the blog post's node).

    As said above, this part did not involve server-side scripting. However, the Genetify JS script not only writes the goals, but also retrieves information about the previous performance of the genes when it starts (in order to lean towards more successful genes in the long run). I have (hopefully correctly) reverse-engineered the PHP scripts that generate this response and written an ESP script (server-side JS) that should do the same. It should be noted that the original Genetify server-side scripts do a lot more error checking which is not implemented in the ESP.

    If you want to check out the Sling'ed Genetify version grab the attached zip file, unzip it into your CRX repository at /apps/gen and point your browser to http://localhost:7402/apps/gen/index.html. The upper part of the page displays the values of two genes (the first one is "rock", "paper", or "scissors"). If you click the "vary" link below the genes will change (because keeping always the same state on one particular browser with a cookie is switched off for development). Clicking one of the two links further down "want it!" or "badly!" will be counted as an achieved goal for the genes that are curently displayed. If you click one of them and reload the page afterward the stats table will have changed. The stats table represents the success of particular genes on a particular page. For restarting just delete the results stored in /content/gen.

    While it's fun to look at how do things in Sling and how little code is needed to get things running it needs to be said that the approach presented above will not scale very well. For once, all variations are stored flat, i.e. without a hierarchy. Since each page view creates a variation the number of child nodes will quickly become much too large to be handled efficiently. The second scaling problem is the calculation of the previous results which takes will take much too long as well. Both problems could be remedied by another JCR-typical approach "Observations". A listener for /content/gen could be registered and move old variations into a properly structured archive a s well as pre-calculate the previous results table.