Latest Posts

Archives [+]

Entries filed under 'modelling'

    Posted by Kas Thomas JUL 16, 2010

    Posted in content management, crx, crx gems, development, http, javascript, modelling, rest and tutorial Comment 1

    In a recent blog, I talked about how easy it is to store snippets of text from OpenOffice in a CRX repository using a little bit of JavaScript and the Sling REST API. While being able to store arbitrary bits of text this way is certainly useful, it would be even more useful to be able to store spreadsheet data. Of course, storing a spreadsheet in CRX, per se, is not much of a challenge: with WebDAV, it's a matter of drag and drop. But storing an entire spreadsheet as a single monolithic content item doesn't necessarily give you the greatest content-management bang for the buck. Often, what you really want to do is granularize the spreadsheet into records (or row data), and store individual rows as content items. (You could take it further and store individual cells as content items, but that would probably be overkill for most situations, although there's certainly nothing preventing you from doing it.)

    In the database world, where decisions often have to be made as to how best to decompose an XML document when mapping it to tables in a database, this general process (of decomposing a large document along the lines of its natural internal fine-structure) is known as shredding. What would be handy is to have an OpenOffice macro that could shred a spreadsheet into rows, and push the rows into nodes in CRX. That's what I propose to show you right now.

    It turns out to be pretty easy to parse a spreadsheet in an OpenOffice macro. Using JavaScript:

       // First, get the document object
       // from the scripting context
       oDoc = XSCRIPTCONTEXT.getDocument();

       // Next, get the XSpreadsheetDocument
       // interface from the document
       xSDoc = UnoRuntime.queryInterface(XSpreadsheetDocument, oDoc);

       // Then get a reference to the sheets for this doc
       var sheets = xSDoc.getSheets();

       // get Sheet1
       var sheet1 = sheets.getByName("Sheet1");

    Once you've gotten the sheet reference, you can use it to obtain a cell reference:

    var cell = sheet.getObject().getCellByPosition( column, row );

    The cell, in turn, contains data, which (dependening on whether you're dealing with a native OpenOffice spreadsheet versus a freshly imported CSV file) can be a floating-point value, a string, or something else. For purposes of this discussion I'm going to assume that you've just imported a CSV or tab-delimited file into OpenOffice, in which case all cells will automatically contain string data. To get the string data from a cell in a freshly imported CSV file, you have to do:

    var content = cell.getFormula();

    At least, that's what works in OpenOffice 3.2.

    The general plan of attack, then, is to come up with a function that can parse a row's worth of data out of a spreadsheet; and have another function that can persist a row of data as a content item in CRX. Then it should be possible to create a macro that simply loops over all rows in a spreadsheet and pushes them out to the repository.

    The row-parsing function is pretty straightforward:

    function getRow( sheet, rownumber, startColumn, endColumn )  {

        var obj = sheet.getObject();
        var record = [];

        for (var k = startColumn; k < endColumn ; k++) {
             var cell = obj.getCellByPosition( k, rownumber );
             var content = cell.getFormula();
             record.push( content );
        }

        return record;
    }

    Given a reference to a Sheet, along with a row number and the starting and ending column numbers, this function loops through cells and pushes cell values into an array. The returned array represents a row's worth of data.

    To persist a row to CRX, we have a function that looks like this:

    function persistRow( sheet, rownumber, startColumn, endColumn ) {

       // get first row of data (column names)
       var columnNames = getRow( sheet, 0, startColumn, endColumn );

       // get specified record
       var row = getRow( sheet, rownumber, startColumn, endColumn );

       // build the request
       var request = {};
       request[":nameHint"] = row[2]; // Title
       request["sling:resourceType"] = "films";
       for ( var i = 0; i < columnNames.length; i++) {
           request[ columnNames[ i ] ] = row[ i ];
       }   
       var data = createRequest( request );

       // where to store it
       var url = "http://localhost:7402/content/films/";

       // finally, hit the repository
       var response = doJavaPOST( url, data );

       return response;
    }

    Notice that the code assumes that the first row of "data" in the spreadsheet contains the column names. This was in fact the case with the test-spreadsheet I used for testing this macro, namely a spreadsheet called a1-film.csv, representing 1741 movies catalogued by Georgia Tech's College of Computing. Each row in the spreadsheet has information for a particular film, such as the film's title, the year the film was made, its genre, the name of the director, major actors and actresses, etc.

    Without further ado, here is the complete code for the OpenOffice macro:



    // Spreadsheet2CRX Macro
    // Kas Thomas, 15 July 2010
    // Public domain. Use at your own risk.
    // Tested with v3.2 of OpenOffice.org

    importClass(Packages.com.sun.star.uno.UnoRuntime);
    importClass(Packages.com.sun.star.sheet.XSpreadsheetDocument);

    // Do a POST
    function doJavaPOST( url, content ) {
            var reply = "";
            var responseCode = "";
            try {
                    var URL = new java.net.URL( url );
                    var urlConn =
                       URL.openConnection( );
                    urlConn.setDoOutput ( true );
                    urlConn.setRequestMethod( "POST" );
                    urlConn.setUseCaches( false );
                    urlConn.setRequestProperty ("Content-Type",
                    "application/x-www-form-urlencoded" );
                    var printout =
                    new java.io.DataOutputStream ( urlConn.getOutputStream ( ) );
                    printout.writeBytes ( content );
                    printout.flush ( );
                    printout.close ( );
                    responseCode = urlConn.getResponseCode();
            }
            catch(exception) {
                    java.lang.System.out.println( exception.toString() );
            }

            return responseCode;
    }

    // munge together the form data
    // into "name1=value1&name2=value2" etc
    function createRequest( object ){

            var data = [];
            for ( var i in object )
            data.push( i + "=" + object[ i ].toString( ) );

            var dataString = data.join( "&" );
            return dataString;
    }

    // Modal dialog with OK/cancel and a text field
    function prompt( msg ) {
            var swing = Packages.javax.swing;
            var text = swing.JOptionPane.showInputDialog(
            new java.awt.Frame(), msg );
            return ( null == text ) ? "" : text; // always return a string
    }

    // a Swing UI for displaying console info
    function EditorPane( ) {

            Swing = Packages.javax.swing;
            this.pane = new Swing.JEditorPane("text/html","" );
            this.jframe = new Swing.JFrame( );
            this.jframe.setBounds( 100,100,500,400 );
            var editorScrollPane = new Swing.JScrollPane(this.pane);
            editorScrollPane.setVerticalScrollBarPolicy(
            Swing.JScrollPane.VERTICAL_SCROLLBAR_ALWAYS);
            editorScrollPane.setPreferredSize(new java.awt.Dimension(250, 250));
            editorScrollPane.setMinimumSize(new java.awt.Dimension(10, 10));
            this.jframe.setVisible( true );
            this.jframe.getContentPane().add( editorScrollPane );

            // public methods
            this.getPane = function( ) { return this.pane; }
            this.getJFrame = function( ) { return this.jframe; }
    }

    function getRow( sheet, rownumber, startColumn, endColumn )  {

            var obj = sheet.getObject();
            var record = [];

            for (var k = startColumn; k < endColumn ; k++) {
                    var cell = obj.getCellByPosition( k, rownumber );
                    var content = cell.getFormula();
                    record.push( content );
            }

            return record;
    }

    function persistRow( sheet, rownumber, startColumn, endColumn ) {

            // get first row of data (column names)
            var columnNames = getRow( sheet, 0, startColumn, endColumn );

            // get specified record
            var row = getRow( sheet, rownumber, startColumn, endColumn );

            // build the request
            var request = {};
            request[":nameHint"] = row[2]; // Title
            request["sling:resourceType"] = "films";
            for ( var i = 0; i < columnNames.length; i++) {
                    request[ columnNames[ i ] ] = row[ i ];
            }
            var data = createRequest( request );

            // where to store it
            var url = "http://localhost:7402/content/test/";

            // finally, hit the repository
            var response = doJavaPOST( url, data );

            return response;
    }

    ( function main( ) {

            //get the document object from the scripting context
            oDoc = XSCRIPTCONTEXT.getDocument();

            //get the XSpreadsheetDocument interface from the document
            xSDoc = UnoRuntime.queryInterface(XSpreadsheetDocument, oDoc);

            // get a reference to the sheets for this doc
            var sheets = xSDoc.getSheets();

            // get Sheet1
            var sheet1 = sheets.getByName("Sheet1");

            // construct a new EditorPane
            var editor = new EditorPane( );
            var pane = editor.getPane( );

            var size = prompt("Enter total rows and total columns, separated by a comma (e.g., '100,8')");
            if ( !size )
            return "No row/column info supplied.";

            var rows = Number( size.substring(0,size.indexOf(",")) );
            var cols = Number( size.substring( size.indexOf(",")+1) );

            var errors = 0;
            for ( var i = 1; i <= rows; i++) {
                    var response = persistRow( sheet1, i, 0, cols );
                    var text = pane.getText();
                    pane.setText( text + "\nProcessing: " + i );
                    if ( response.toString().indexOf("5")==0 )
                    errors++;
                    // provide a little bit of throttling:
                    java.lang.Thread.sleep( 200 );
            }
            pane.setText( pane.getText() + "\n" + errors + " errors" );
    })();




    You'll notice that the code creates a JEditorPane window to act as an error console. When you run the macro, a JOptionPane dialog appears, asking you to supply the number of rows and columns in the spreadsheet. (For the Georgia Tech spreadsheet, you can enter "1741,8", minus quotes.) Once you dismiss the dialog, the code goes to work looping over all the rows in the spreadsheet, posting each row to CRX at a path of http://localhost:7402/content/films/.

    Each new node is named according to a :nameHint parameter based on the Title of the film.

    Notice also, we designate a sling:resourceType for each node of "films." (This happens in the persistRow() function.) This fact will be important in a later blog when I show how to write server-side scripts that handle various types of requests for film data.

    And that's about it: Now you know how to shred a spreadsheet (say that 3 times in a row fast...) and store the results in CRX, using OpenOffice.

    Posted by Michael Marth JUN 10, 2010

    Posted in cms, everything is content and modelling Add comment

    Last week I have uploaded a Twitter clone application to Day's Package Share. The application's content package not only contains some sample content and the jsp files with the application code. It also includes sample users and their respective access rights on different JCR nodes. Putting all this information in one content package is possible (and even simple) because users, ACLs etc are stored in the content repository as JCR nodes.

    The experience of putting together this package nicely reminded me of the power of the concept of storing all of a a web application's artefacts in the content repository - which can be considered the technical implementation of Day's mantra "everything is content".

    Classically, the image of web content management systems one has in mind looks something like this:

    file

    Content is the input and a web page is the end result of some rendering process. There is nothing wrong with that image, but considering "everything is content" an alternative prototypical image of a CMS came to my mind:

    file

    A web content management system's repository is the place to store and manage all aspects that make up your web site. The web page is not only the end result, but also the source.

    Posted by Michael Marth MAY 04, 2010

    Posted in agile, data first, davids model, jcr and modelling Comment 1

    Recently, I read up on quite a number of NoSQL protagonists. Of course, one dominant theme in NoSQL land is "schemaless" as opposed to the full-schema nature of relational databases. As usual, both approaches have their specific pros and cons. A common critism of schemaless data stores is that the entropy of the data would create problems in the long run when too much unstructured data has been amassed. On the other, hand full-schema data bases are much less flexible or downright the wrong tool for unstructured data.

    In this post I would like to point out that you do not necessarily have to choose between those extremes: JCR-based data stores allow you to store unstructured data, fully structured data and anything inbetween. In lack of a better term I would like to call this a "schema-optional" data store with "semi-structured" data.

    • The JCR node type nt:unstructured is designed to accept any properties, so you can dump at will strings, dates or even binaries into such a node. This node type is very useful to get started with coding an application when you do not know what the end result should look like. It allows for a development approach coined "data first, structure later" where structure emerges from data, rather than be defined a priori.
    • On the other end of the spectrum you can have rigidly defined node types. JCR allows you to specify e.g. mandatory properties, default values or the allowed child node types in a node hierarchy. The Apache Jackrabbit site has a good overview of the Compact Namespace and Node Type Definition which is a notation used to define such structure.

    In between these two extreme cases any middle ground is possible in JCR repositories:

    • First, a rigid node type definition for a specific node can define "residual" properties. Such an approach allows the application to set not only the properties that were defined a priori in the node type definition, but also anything else. This is particularly useful for scenarios were only a part of the requirements is known beforehand or where the requirements are known to evolve over time. You can define the known parts but an application can still freely write anything into the node as if it was unstructured.
    • Second, it should also be noted that these structured, unstructured and semi-structured nodes can happily live next to each other in the same repository tree. So different parts of your application can make use of different levels of structure not only through different node types, but also through different parts in the node hierarchy.

    With JCR 2.0 it has become quite a bit easier to evolve the structure (after all, the mantra is "data first, structure later", not "structure never"): one can now change the node types of existing nodes. That facilitates a migration from, say, nt:unstructured nodes to more structured types.

    Posted by Lars Trieloff JAN 25, 2008

    Posted in documentation, jcr and modelling Comment 1

    Since content-centric applications are content-driven, modeling the content structure is the most crucial part when documenting the architecture of your application. A big part of the general architecture is usually determined by the framework you chose to use: If you are using Sling, it is Content-Behavior-Appearance, if you are using Apache Cocoon, it is content pipelining, and so on. What makes your application special is the content structure or the content model. As understanding the content structure is a crucial part for communicating the architecture of your application, you should spend considerable amount of time on designing, documenting and communicating the content structure to other developers. In JCR content has two general properties that deserve documentation: one the one hand there is the location of nodes in the content tree. The most straight-forward approach of documenting this is simply expressing the tree structure in a diagram as the one below or using a JCR repository browser like the CRX explorer that comes with Day's CRX repository or the open source tool JCR Explorer.

    There are multiple downsides connected with this approach: One the one hand, these autogenerated tree models communicate importance and relation of portions of the content tree poorly, as they can only express parent-child relationships, and to a certain degree node types. Secondly as the tree grows, it becomes increasingly complex and confusing to the observer. If you really care about communicating your content structure, then drive structure documentation, do not let it happen.

    The second aspect of content modeling for JCR is the node type. JCR has a complex node typing system that allows multiple inheritance, mixins, child-nodes and references. For real-world application documentation three approaches can be found:

    • using standard CND notation - this is the most obvious approach as you have to write the CND files anyway and it provides a very compact notation that is able to express every aspect of the node type. Unfortunately, this CND notation is optimized for writes, not readability or comprehensibility. In order to make it easy to understand, the following two approaches are being used.
    • automatically generated HTML nodetype documentation, using a tool like Jackrabbit-NTdoc , which basically takes the node type definitions and automatically translates them into a number of HTML pages that are browsable similar to Javadoc and document every aspect that can be found in the node type definition.
    • ad-hoc graphical notations. These notations often are inspired by UML or entity relationship diagrams, but seldom reused or documented. While they are more readable than the CND notation or browsable HTML documentation, the lack of standardization and meta-documentation makes them hardly portable.

    A main advantage of these graphical notations however is that you as the architect can decide what is important, what is related and what is obvious and does not need to be documented at a high level. This again shows that you should drive your content model documentation and not let it happen.

    The notation proposed below uses a combination of a graphical treemap notation for describing the content tree and a UML-class-diagram inspired notation for documenting node types, node type inheritance and node references. A main advantage of this notation is, besides re-use of existing notations like UML or Fundamental Modeling Concepts (FMC) that it offers a connection between tree structure and node type.

    The upper part of the chart features an example content tree in treemap notation. Speaking in FMC terms, this content tree is a set of nested places and this nesting can be driven by the architect in order to express relation (places are next to each other), containment (one place in another) and importance (place is bigger). You can even "zoom in" parts of the chart to explain content structure more in-depth. A good example for variable content can be found in /apps/wiki/themes where any number of themes can be stored, but two "default" and "extra" are mentioned as examples.

    This treemap structure is both visually compelling and compact, so it can be combines with the UML-inspired node type notation at the bottom of the chart. This notation uses UML class diagrams to express node types (bold font, shaded background) and Mixins (italic font, white background). Node types can have three types of relations: inheritance, containment and reference. For inheritance the default solid line with a hollow triangle arrowhead at the super type is used. For child nodes and associations a basic "association" line without arrowheads is used. For the cardinality of relationships: as there is only one parent node or referencing node, only the cardinality indicator at the child or referenced node type is used. Here we use a simple-regular-expressions inspired syntax where * means: any number of node, + means at least one node, n means exactly n nodes, and so on.

    Using a dotted line you can map node types to places in the treemap where this node type can be used.

    To sum it up, the proposed notation is a tool that helps understanding and communicating content-centric software systems. It is not intended to be used to automatically generate code or to be generated automatically from code, instead it is a second description of your software system that lives beside the code of your system (as the primary description) and is suited for technical communication with humans.

    Posted by Michael Marth DEC 07, 2007

    Posted in jcr, modelling and poll Comments 2

    While writing about how this blog was built (soon to be published) I stumbled over a little question: how should I describe the way the content is structured. Sure, the blog has only a tiny content model, but still I wondered what would be the "right" way.

    One difficulty in describing a content model lies in the fact that you can structure your content with node types or you can use unstructured nodes and utilize the content hierarchy. Well, and then there is all sorts of mixtures of the two in between. This gives you basically two separate hierarchies (the node type hierarchy and the content hierarchy) that might be partially interwoven (ahm, no cms pun intended).

    There are also some communication aspects of describing the model. One aspect is the audience one tries to address. For example the documentation of a project might require a very formal way of describing the model. On the other hand, in a blog or in an email the main purpose of describing the model is getting across the essence, rather than being 100% accurate. Also, it probably makes a difference if one is concerned with a very large model with, say, 30 different node types or a very small one. However, these two aspects are not specific to content repository models, but also present in e.g. relational data modeling.

    So I started to investigate this question a bit and look around what others are doing to describe their content models. On one end of the spectrum there is the Compact Namespace and Node Type Definition (CND) is used. It is suitable for formally describing the node type hierarchy, but not the content hierarchy. The CRX Node Type Admin application exports definitions in this format. The CND is an offspring of the description language used in the JSR 170 specs which is considered to be more verbose.

    An obvious advantage of using this formal approach is that there is no room for interpretation and that it can be used by machines.

    On the other end of the spectrum is some ad-hoc ascii art like this:

    blog [nt:unstructured]
    |  +sling:resourceType[string]
    |--post [nt:unstructured]
    |    +title[string]
    |    +body[string]
    |    +sling:resourceType[string]
    |----comment [nt:unstructured]
    |      +body[string]
    |      +sling:resourceType[string]
    |----attachment [nt:file]

    It is not formalized but has the advantage that it is to understand. It is also capable of explaining a structure of interwoven hierarchies (node type and node hierarchy) at least for small content models. For larger or more models this approach will break down.

    Jukka Zitting has used a graphical ad-hoc method in this presentation (see slides 8-10):

    He displays both, node type hierarchy and content hierarchy. I like this approach, it is quite easy to understand (and will take you further regarding complex models than ascii art, I guess).

    I would really like to know what you people are using for communication your content models. Below, there is a little poll. Please let us know what you do. Cheers.