Latest Posts

Archives [+]

Data First is gaining traction in the industry [updated]

A while ago Stefano Mazzocchi has written an excellent post titled "Data First vs. Structure First". In it he describes a strategy called "Data First" where the data structures of an information system are, well, not structured in advance, but allow for data structures to emerge over time.

He proclaims that:

1. Data First is how we learn and how languages evolve. We build rules, models, abstractions and categories in our minds after we have collected information, not before. This is why it's easier to learn a computer language from examples than from its theory, or a natural language by just being exposed to it instead of knowing all rules and exceptions.

2. Data First is more incrementally reversible, complexity in the system is added more gradually and it's more easily rolled back.

3. Because of the above, Data First's Return on Investment is more immediately perceivable, thus lends itself to be more easily bootstrappable.

And gives these real-life examples for Data First approaches:

But look around now: the examples of 'data emergence' are multiplying and we use them every day. Google's PageRank, Amazon's co-shopping, Citeseer's co-citation, and Flickr co-tagging, Clusty clustering, these are all examples of systems that try to make structure emerge from data, instead of imposing the structure and pretend that people fill it up with data.

The opposite approach is Structure First. Stefano asks:

But then, one might ask, why is everybody so obsessed with design and order? Why is it so hard to believe that self-organization could be used outside the biological realm as a way to manage complex information systems?

One important thing can be noted:

On a local time-scale and once established, "Structure First" systems are more efficient.

This is a great and thought-provoking post, because I am, like many others, trained to think about data in terms of structures (first). But I realize that this way of thinking can also be a limitation in what can be achieved.

I would actually like to add one more aspect to Stefan's question why we are "so obsessed with design and order": our tools. In many developer minds thinking about data is equivalent to mentally setting up tables and rows in a relational model. In a good part it is the tools that shape our thinking.

But actually there are tools that do NOT force us to structure the data in advance or, even better, that allow us to structure as much as we like. As you might expect on this blog one tool to mention is a Java Content Repository like CRX. In a JCR you can go along the full structure route and fully define node types, but you can also leave all your data unstructured (like David suggests in his model) or do anything in between. That is why I have been suggesting that JCRs are well-suited for rapid application development. The structure is allowed to emerge as you go along.

(see Stefano again:)

But there is more: we all know that a complete mess is not a very good way to find stuff, so "data first" has to imply "structure later" to be able to achieve any useful capacity to manage information. Here is where things broke down in the past: not many believed that useful structures could emerge out of collected data.

Now, I am pleased to see that these ideas are gaining traction within the IT industry. Only recently two alternative implementations of these concepts have surfaced:

Amazon SimpleDB

Like all of the Amazon web services SimpleDB is a large (massively scalable, I presume) hosted service. Amazon describes it as a spreadsheet, but to me it looks more like hash map. What is important, the value part of the key-value hash map relation can take multiple attributes:

In Amazon SimpleDB, to add the items above, you would PUT the three itemIDs into your domain along with the attribute-value pairs for each of the items. Without the specific syntax, it would look something like this:

- PUT (item, 123), (description, sweater), (color, blue), (color, red)
- PUT (item, 456), (description, dress shirt), (color, white), (color, blue)
- PUT (item, 789), (description, shoes), (color, black), (material, leather)

Amazon SimpleDB differs from tables of traditional databases in several important ways. First, you have the flexibility to easily go back later on and add new attributes that only apply to certain items - for example, sleeve length for dress shirts. Additionally there is no need to pre-define data types.[...]

Amazon SimpleDB automatically indexes all of your data, enabling you to easily query for an item based on attributes and their values. In the above example, you could submit a query for items where (color = blue AND description = dress shirt), and Amazon SimpleDB would quickly return item 456 as the result.

Note that there is no schema or data structure to set up. In fact, it is even impossible (as opposed to a JCR).

David Dossot had the same idea I had when I stumbled across this: there should be a JCR interface to SimpleDB.

I would personally be interested in a JCR adapter for SimpleDB: this would enable a semantically meaningful data storage layer to be plugged on top of the Amazon service. Think about massively distributed content management system...


If you want to put big corporate Amazon at one end of the IT spectrum you might put CouchDB at quite the opposite end: it is an experimental geeky project in alpha state. It describes itself like:

What CouchDB is

- A document database server, accessible via a RESTful JSON API.
- Ad-hoc and schema-free with a flat address space.

And further:

Unlike SQL databases which are designed to store and report on highly structured, interrelated data, CouchDB is designed to store and report on large amounts of semi-structured, document oriented data.[...].

In an SQL database, as needs evolve the schema and storage of the existing data must be updated. This often causes problems as new needs arise that simply weren't anticipated in the initial database designs, and makes distributed "upgrades" a problem for every host that needs to go through a schema update.

With CouchDB, no schema is enforced, so new document types with new meaning can be safely added alongside the old. [...]

You get the picture. The key word is "no schema" again.

I welcome these new(*) approaches to storing data. While they will certainly not make relational data bases obsolete by any means they will broaden our minds when it comes to thinking about data. And they provide an additional tool in our tool chest.

(*) Well, "new". JCRs have been around for quite a while. The rest of the industry has woken up. I am tempted to quote "Imitation is the sincerest form of flattery" :)


While we are at "watching industry trends": it should also be noted that the two persistence technologies form above both expose a REST interface to applications. For JCRs this is implemented through Apache Sling or Microjax.

While this is not a real surprise given the REST's success it is still worth noting. Compare it to the situation a few years ago, when accessing data invariably meant installing a driver and opening a socket connection.

Update (3/1/2008)

Seems like IBM has "bought" CouchDB and plans to donate the code to Apache.



  • By Lars Trieloff - 10:01 AM on Dec 21, 2007   Reply
    Instead of a trackback: <a href="">Data First and Enterprise 2.0</a>.
  • By Alexander Klimetschek - 12:31 PM on Dec 21, 2007   Reply
    Dito: <a href="">Amazon SimpleDB vs. JCR</a>