In a recent blog, I talked about how easy it is to store snippets of text from OpenOffice in a CRX repository using a little bit of JavaScript and the Sling REST API. While being able to store arbitrary bits of text this way is certainly useful, it would be even more useful to be able to store spreadsheet data. Of course, storing a spreadsheet in CRX, per se, is not much of a challenge: with WebDAV, it's a matter of drag and drop. But storing an entire spreadsheet as a single monolithic content item doesn't necessarily give you the greatest content-management bang for the buck. Often, what you really want to do is granularize the spreadsheet into records (or row data), and store individual rows as content items. (You could take it further and store individual cells as content items, but that would probably be overkill for most situations, although there's certainly nothing preventing you from doing it.)
In the database world, where decisions often have to be made as to how best to decompose an XML document when mapping it to tables in a database, this general process (of decomposing a large document along the lines of its natural internal fine-structure) is known as shredding. What would be handy is to have an OpenOffice macro that could shred a spreadsheet into rows, and push the rows into nodes in CRX. That's what I propose to show you right now.
It turns out to be pretty easy to parse a spreadsheet in an OpenOffice macro. Using JavaScript:
// First, get the document object
// from the scripting context
oDoc = XSCRIPTCONTEXT.getDocument();
// Next, get the XSpreadsheetDocument
// interface from the document
xSDoc = UnoRuntime.queryInterface(XSpreadsheetDocument, oDoc);
// Then get a reference to the sheets for this doc
var sheets = xSDoc.getSheets();
// get Sheet1
var sheet1 = sheets.getByName("Sheet1");
Once you've gotten the sheet reference, you can use it to obtain a cell reference:
var cell = sheet.getObject().getCellByPosition( column, row );
The cell, in turn, contains data, which (dependening on whether you're dealing with a native OpenOffice spreadsheet versus a freshly imported CSV file) can be a floating-point value, a string, or something else. For purposes of this discussion I'm going to assume that you've just imported a CSV or tab-delimited file into OpenOffice, in which case all cells will automatically contain string data. To get the string data from a cell in a freshly imported CSV file, you have to do:
var content = cell.getFormula();
At least, that's what works in OpenOffice 3.2.
The general plan of attack, then, is to come up with a function that can parse a row's worth of data out of a spreadsheet; and have another function that can persist a row of data as a content item in CRX. Then it should be possible to create a macro that simply loops over all rows in a spreadsheet and pushes them out to the repository.
The row-parsing function is pretty straightforward:
function getRow( sheet, rownumber, startColumn, endColumn ) {
var obj = sheet.getObject();
var record = [];
for (var k = startColumn; k < endColumn ; k++) {
var cell = obj.getCellByPosition( k, rownumber );
var content = cell.getFormula();
record.push( content );
}
return record;
}
Given a reference to a Sheet, along with a row number and the starting and ending column numbers, this function loops through cells and pushes cell values into an array. The returned array represents a row's worth of data.
To persist a row to CRX, we have a function that looks like this:
function persistRow( sheet, rownumber, startColumn, endColumn ) {
// get first row of data (column names)
var columnNames = getRow( sheet, 0, startColumn, endColumn );
// get specified record
var row = getRow( sheet, rownumber, startColumn, endColumn );
// build the request
var request = {};
request[":nameHint"] = row[2]; // Title
request["sling:resourceType"] = "films";
for ( var i = 0; i < columnNames.length; i++) {
request[ columnNames[ i ] ] = row[ i ];
}
var data = createRequest( request );
// where to store it
var url = "http://localhost:7402/content/films/";
// finally, hit the repository
var response = doJavaPOST( url, data );
return response;
}
Notice that the code assumes that the first row of "data" in the spreadsheet contains the column names. This was in fact the case with the test-spreadsheet I used for testing this macro, namely a spreadsheet called a1-film.csv, representing 1741 movies catalogued by Georgia Tech's College of Computing. Each row in the spreadsheet has information for a particular film, such as the film's title, the year the film was made, its genre, the name of the director, major actors and actresses, etc.
Without further ado, here is the complete code for the OpenOffice macro:
// Spreadsheet2CRX Macro
// Kas Thomas, 15 July 2010
// Public domain. Use at your own risk.
// Tested with v3.2 of OpenOffice.org
importClass(Packages.com.sun.star.uno.UnoRuntime);
importClass(Packages.com.sun.star.sheet.XSpreadsheetDocument);
// Do a POST
function doJavaPOST( url, content ) {
var reply = "";
var responseCode = "";
try {
var URL = new java.net.URL( url );
var urlConn =
URL.openConnection( );
urlConn.setDoOutput ( true );
urlConn.setRequestMethod( "POST" );
urlConn.setUseCaches( false );
urlConn.setRequestProperty ("Content-Type",
"application/x-www-form-urlencoded" );
var printout =
new java.io.DataOutputStream ( urlConn.getOutputStream ( ) );
printout.writeBytes ( content );
printout.flush ( );
printout.close ( );
responseCode = urlConn.getResponseCode();
}
catch(exception) {
java.lang.System.out.println( exception.toString() );
}
return responseCode;
}
// munge together the form data
// into "name1=value1&name2=value2" etc
function createRequest( object ){
var data = [];
for ( var i in object )
data.push( i + "=" + object[ i ].toString( ) );
var dataString = data.join( "&" );
return dataString;
}
// Modal dialog with OK/cancel and a text field
function prompt( msg ) {
var swing = Packages.javax.swing;
var text = swing.JOptionPane.showInputDialog(
new java.awt.Frame(), msg );
return ( null == text ) ? "" : text; // always return a string
}
// a Swing UI for displaying console info
function EditorPane( ) {
Swing = Packages.javax.swing;
this.pane = new Swing.JEditorPane("text/html","" );
this.jframe = new Swing.JFrame( );
this.jframe.setBounds( 100,100,500,400 );
var editorScrollPane = new Swing.JScrollPane(this.pane);
editorScrollPane.setVerticalScrollBarPolicy(
Swing.JScrollPane.VERTICAL_SCROLLBAR_ALWAYS);
editorScrollPane.setPreferredSize(new java.awt.Dimension(250, 250));
editorScrollPane.setMinimumSize(new java.awt.Dimension(10, 10));
this.jframe.setVisible( true );
this.jframe.getContentPane().add( editorScrollPane );
// public methods
this.getPane = function( ) { return this.pane; }
this.getJFrame = function( ) { return this.jframe; }
}
function getRow( sheet, rownumber, startColumn, endColumn ) {
var obj = sheet.getObject();
var record = [];
for (var k = startColumn; k < endColumn ; k++) {
var cell = obj.getCellByPosition( k, rownumber );
var content = cell.getFormula();
record.push( content );
}
return record;
}
function persistRow( sheet, rownumber, startColumn, endColumn ) {
// get first row of data (column names)
var columnNames = getRow( sheet, 0, startColumn, endColumn );
// get specified record
var row = getRow( sheet, rownumber, startColumn, endColumn );
// build the request
var request = {};
request[":nameHint"] = row[2]; // Title
request["sling:resourceType"] = "films";
for ( var i = 0; i < columnNames.length; i++) {
request[ columnNames[ i ] ] = row[ i ];
}
var data = createRequest( request );
// where to store it
var url = "http://localhost:7402/content/test/";
// finally, hit the repository
var response = doJavaPOST( url, data );
return response;
}
( function main( ) {
//get the document object from the scripting context
oDoc = XSCRIPTCONTEXT.getDocument();
//get the XSpreadsheetDocument interface from the document
xSDoc = UnoRuntime.queryInterface(XSpreadsheetDocument, oDoc);
// get a reference to the sheets for this doc
var sheets = xSDoc.getSheets();
// get Sheet1
var sheet1 = sheets.getByName("Sheet1");
// construct a new EditorPane
var editor = new EditorPane( );
var pane = editor.getPane( );
var size = prompt("Enter total rows and total columns, separated by a comma (e.g., '100,8')");
if ( !size )
return "No row/column info supplied.";
var rows = Number( size.substring(0,size.indexOf(",")) );
var cols = Number( size.substring( size.indexOf(",")+1) );
var errors = 0;
for ( var i = 1; i <= rows; i++) {
var response = persistRow( sheet1, i, 0, cols );
var text = pane.getText();
pane.setText( text + "\nProcessing: " + i );
if ( response.toString().indexOf("5")==0 )
errors++;
// provide a little bit of throttling:
java.lang.Thread.sleep( 200 );
}
pane.setText( pane.getText() + "\n" + errors + " errors" );
})();
You'll notice that the code creates a JEditorPane window to act as an error console. When you run the macro, a JOptionPane dialog appears, asking you to supply the number of rows and columns in the spreadsheet. (For the Georgia Tech spreadsheet, you can enter "1741,8", minus quotes.) Once you dismiss the dialog, the code goes to work looping over all the rows in the spreadsheet, posting each row to CRX at a path of http://localhost:7402/content/films/.
Each new node is named according to a :nameHint parameter based on the Title of the film.
Notice also, we designate a sling:resourceType for each node of "films." (This happens in the persistRow() function.) This fact will be important in a later blog when I show how to write server-side scripts that handle various types of requests for film data.
And that's about it: Now you know how to shred a spreadsheet (say that 3 times in a row fast...) and store the results in CRX, using OpenOffice.