NoSQL: A long-time relation(ship) comes to an end
OK, I admit it, declaring that "the RDBMS is dead" is a meme that has been going around the software industry for a while. Remember object-oriented data bases that were supposed to replace the relational ones? Well, guess who is still here. However, despite the RDBMS's amazing survival skills I would like to propose a related prediction:
I believe that the year 2009 will go down in history as the year when the "relational model default" ended. The term "relational model default" was coined by me to describe a peculiar thing that goes on in application development: start talking to your average application developer about some arbitrary business requirement and chances are that simultaneously he mentally constructs a relational model to fit those requirements.
That relational approach to modeling your problem may or may not be suitable. The real problem is that all too often this default does not get challenged. As a consequence, whatever the fitting data model might be, it gets shoehorned into tables and relations.
This default "thinking" has not yet changed for the masses, but I believe that it has changed for the early adopters (which means that invariably it will change for the masses in some years).
I see the default to change from:
"I need to store some data i.e. I need a relational database"
to:
"I need to store something, let me see the data to decide how to store it."
The most concrete and visible manifestation of the rising interest in non-relational data store is the "NoSQL" movement. NoSQL denotes a group of people interested in exploring and comparing alternatives to the traditional relational data storages like MySQL or Postgres. The inaugural get-together has been covered in Computerworld, see also Johan Oskarsson's post and there is, of course, a Hashtag.
Other than the NoSQL group I have a second data point to offer: there is a Cambrian Explosion happening in terms of projects exploring non-relational data stores. During the Cambrian Explosion a major diversification of organisms took place. Similarly a plethora of new projects that explore alternatives to relational models continue to gain interest. Here is an incomplete list:
AllegroGraph, Amazon's SimpleDB, Cassandra, CouchDB, Dynomite, Google's App Engine datastore, HBase, Hypertable, Kai, MemcacheDB, Mongo DB, Neo4J, OpenRDF, Project Voldemort, Redis, Ringo, Scalaris , ThruDB, Tokyo Cabinet (and Tokyo Tyrant and LightCloud)
Last, but certainly not least, there are Apache Jackrabbit and Apache Sling.
From my perspective there are three main areas of innovation in this Cambrian Explosion of data stores:
1. Models
In the relational model you break down your data into tables and relations. This model implies that the data is somewhat tabular. However, in some cases the data simply is not tabular.
Consider web content, which is hierarchical and mixes fine-granular data with binary files (this model is implemented in Jackrabbit). Other (not mutually exclusive) alternative models are document-oriented, key-value pairs, or Graphs/RDF.
One very important aspect of many alternative models is that they are schemaless. That means that they accommodate for Data First approaches where it is not required to define the data structure before one can actually store any data. This enables agile approaches to software development in the short term as well as more flexibility in the long term evolution of business requirements.
Without defining a data structure first it is not possible to store anything at all in an RDBMS. This fact is probably one of the root causes of the relational default thinking. An RDBMS-based developer simply cannot develop anything without thinking about table structure.
2. Scalability
A second area of innovation is scalability. This can be split down into two sections: One is scalability achieved by distributing the data store across separate machines, the approach pioneered by Google. Opposed to classical clustering of RDBMSs the order of magnitude of machines that are considered is hundreds rather than ten. Obviously, different trade-offs regarding consistency and availability of individual cluster nodes must be taken when architecting for such a high number of cluster nodes. Eventual consistency is one of the interesting concepts invented in this space.
While the commoditization of server hardware triggered this first approach to scalability, a second area is related to the rise of multi-core processors. For a number of years CPUs have not gotten faster, but rather the number of cores has increased. There is no explicit contradiction in running a classical RDBMS on a multi-core machine and even having the RDBMS take advantage of them. However, it seems to me that the SQL language is a poor fit for queries in a multi-core environment when compared with alternatives such as Map/Reduce which are parallel by design.
3. Web
The third area of innovation revolves around the fact that the web is the dominant paradigm for computing in our time. This is also acknowledged by the two considerations discussed above. However, a third one is that HTTP is used for accessing the data. Other types of connectivity that were typically implemented as JDBC or ODBC drivers are not needed/used anymore. In many cases the data store exposes its resources in a RESTful API. An obvious benefit is the ubiquitous availability of clients including the browser itself. The classical RDBMS approach involving a dedicated driver looks like a client-server architecture mindset in comparison (I wrote about this 1.5 years ago).
At this point let me re-iterate that RDBMSs are here to stay, just like mainframes never went away. Moreover, a couple of the innovation areas cited above are not that new at all, especially, when it comes to non-relational data models (for example, I recently dug into the foundations of the Lotus Notes document store and came out very impressed). However, it is only now that the relational model default will disappear.
What about content management systems?
Considering the content management system industry as a whole I am extremely happy about this shift away from RDBMSs. Especially the model aspect is crucial: RDBMSs embody a fundamentally wrong model for content. There are varying opinions in the industry about what "content" really is, but one thing is more or less universally accepted: it is (at least partially) unstructured. Well, RDBMSs are designed for structured data. Duh.
So why are there one gazillion LAMP-based CMSs? I blame the relational model default. But as this default vanishes we will see more and more CMSs that are not based on an RDBMS (see the Jackrabbit wiki for a list of JCR-based ones, as well as the recent PHP-based JCR implementations Jackalope or for Typo3 or the Midgard content repository).
Don't laugh, but I truly envision a better (CMS) world once more CMSs are built upon proper tools and not forced into a relational model anymore. It will be a better world for developers and consequently for the CMS users.
What about Day?
REST and content repositories were invented and evangelized by Day's Chief Scientist Roy and Day's CTO David years ago already. So it is no surprise that Day's content management systems are in an excellent shape with respect to these considerations. CQ5 is built upon Apache Jackrabbit, i.e. a data store that implements a content-centric model, and Apache Sling, a web framework designed to be RESTful right from the start.
When it comes to scaling: a week ago we gave a live demonstration on how to install and cluster CQ5 on Amazon's EC2 service. But, expect even more exciting news in this area.

I persist in my view point where the industry is at and where it is going. If you have different views I am happy to discuss them.
Cheers
Michael
As an old Notes/Domino developer I was happy to notice your mention of Lotus Notes, that fossily, pre-Cambrian, NoSQL technology.
An interesting thing about Notes application development projects in the early days is that they all too often employed relational model thinking, despite the schemaless, document orientation of the underlying technology.
I think this can be attributed this to the SQL mindset naturally prevalent in the mostly large corporate environments where Lotus Notes installations were initially adopted.
Because early versions of Notes allowed any user (a la 1-2-3) to design applications, small, departmental applications designed by non-corporate developers began to spring up everywhere.
The users designing these applications didn't care a fig (or know about) database design methodologies, they were interested in supporting the loosely structured, constantly evolving kinds of *processes* they had to deal with every day.
Because of the propagation of so many of these applications (and the splitting of design capabilities out of the standard Notes client), IT departments found themselves supporting these weirdo designs.
Although there was much gnashing of teeth and rending of garments ('how are we supposed to produce reports from this mush?!?), eventually corporate developers began to see the light, and found a tool to support business processes that could not be easily shoe-horned into the SQL mold.
I wonder if these new technologies will go through a similar evolution?
Are people looking at Sling or CouchDb and thinking, oh that's a just a niche technology for managing website pages or supporting blogs and wikis?
Or will they see a whole new way of enabling organic business processes as they really happen 'in the wild'?
Thanks for the article!
Tim
As an old Notes/Domino developer I was happy to notice your mention of Lotus Notes, that fossily, pre-Cambrian, NoSQL technology.
An interesting thing about Notes application development projects in the early days is that they all too often employed relational model thinking, despite the schemaless, document orientation of the underlying technology.
I think this can be attributed this to the SQL mindset naturally prevalent in the mostly large corporate environments where Lotus Notes installations were initially adopted.
Because early versions of Notes allowed any user (a la 1-2-3) to design applications, small, departmental applications designed by non-corporate developers began to spring up everywhere.
The users designing these applications didn't care a fig (or know about) database design methodologies, they were interested in supporting the loosely structured, constantly evolving kinds of *processes* they had to deal with every day.
Because of the propagation of so many of these applications (and the splitting of design capabilities out of the standard Notes client), IT departments found themselves supporting these weirdo designs.
Although there was much gnashing of teeth and rending of garments ('how are we supposed to produce reports from this mush?!?), eventually corporate developers began to see the light, and found a tool to support business processes that could not be easily shoe-horned into the SQL mold.
I wonder if these new technologies will go through a similar evolution?
Are people looking at Sling or CouchDb and thinking, oh that's a just a niche technology for managing website pages or supporting blogs and wikis?
Or will they see a whole new way of enabling organic business processes as they really happen 'in the wild'?
Thanks for the article!
Tim
thanks for commenting. Jackrabbit, the data store used in CRX and CQ5, is architected such that it can be used with various "Persistence Managers". For example, CRX ships with the TarPersistence Manager (a tar file) whereas Jackrabbit comes with the Derby DB. Other persistence managers could be distributed key-value stores, like e.g. the ones offered in the Amazon cloud services. But the idea is not limited to Amazon's services. You might just as well run one of the mentioned open source projects in that area in your internal IT infrastructure.
Having said that, from my very limited experience I agree that the sweet spot of EC2 is not cost, but scalability.
Cheers
Michael
thank you very much for sharing your thoughts and experience.
Wide adoption of non-relational data stores in the enterprises is not something I see happen in the near future. But I believe that will change once there is sufficient supply of developers that are able to think and design in a non-relational way. Eventually, these ideas will make it into large corps.
So re your question:
> I wonder if these new technologies
> will go through a similar evolution?
I believe they will. "Unlearning relational" and "Recognizing the value of non-relational models (when they fit)" might again be part of this long evolution - just as you described it for Notes.
Cheers
Michael
Interesting to note that Jackrabbit used Dery DB - a relational db. However, my intention is not to pick on that.
Some time ago (8 months or so) I did a comparison between Derby and H2 Database (http://www.h2database.com/) and found that H2 is many folds better performing than Derby DB. You might want to consider it.
thanks for pointing that out. You are right that when it comes to Jackrabbit/JCR one has to be careful if the data model or the persistence store is being discussed. Jackrabbit allows to plug in several persistence stores, the default being Derby I believe. In Day's commercial package of Jackrabbit (CRX) we use the "TarPersistenceManager" which stores the data in a tar file. However, I believe the crucial point is that whatever persistence store you use, the data modeling will be done in terms of the hierarchical model defined by JCR, i.e. certainly not the relational model, even if a RDBMS is used as a persistence store.
Thanks for your pointer about H2. The db's author Thomas Mueller will be glad to hear you like it - he works for Day on the Tar persistence manager mentioned above ;)