Latest Posts

Archives [+]

Archive for 'November 2009'

    Posted by Michael Marth NOV 26, 2009

    Comments 3

    Drinking with David has inspired Jon Marks (aka @McBoof) to draw a brilliant drawing of the landscape of content technologies. Beer :)

    Posted by Joerg Hoh NOV 23, 2009

    Comments 2

    I often thought about how a perfect monitoring solution should like; then the following requirements come up:

    • 1. consistent: When the monitoring indicates a problem, there really is a problem with the application.
    • 2. reliable: When there is a problem that hinders the users from properly working on the application, the monitoring will indicate this, before the support line is overwhelmed by calls reporting that your application has a problem.
    • 3. informative: When a monitoring indicates a problem, there should a recommendation how this problem can be fixed with minimal impact. This recommendation can be either documented offline (operation guidelines) or in the monitoring itself.
    • 4. proactive: a monitoring resource should detect problems before they happen. Sounds strange, but many problems (excessive use of memory) can be reported before they actually have a big impact.

    The requirement 2. is a hard one; essentially it requires that all problem situations are pre-known and the presence of such a problem situation can be indicated by the monitoring resource. In fact most situations are not known to anybody, until they happen and are fully analyzed and understood.

    1. can fulfilled very easily: never report a problem. From a purely academical view this requirement is fulfilled then, but actually it is not really usable. A more usable approach is to report only when it's 99,99% clear that no user can work anymore (e.g. on a CQ authoring system when a vital OSGI service isn't available any more). But the more subtile problems cannot be caught that way.

    3. requires a certain amount of experience with the application and the will to write good documentation, that is kept up to date. And 4. requires knowledge about typical problems and the signs of problems.

    All these requirements often need an extension to the application to provide an interface to the monitoring, where the monitoring can fetch the data from and decide what to do with it. But monitoring is the poor child of application development; it is neither a functional requirement nor a non-functional requirement with the importance of usability, performance or availability (aah, by the way: how should we measure that? No-one cares, as long you guarantee 99,9% ...), but only a requirement of the operations team - no-one spends time or money on it, until operations ask for it. "Ups, we already spent our budget on other things". Only a few operations teams have the management standing, that they can deny such applications to run on their machines then, in most cases they are just overruled by management. Then, the operations team usually comes up with some thoughts and tries to fill the gap themselves, but in most times that does not work very well. Especially requirement 2) is then often violated and 3) and 4) are not implemented at all. But operations can show some green bulbs in the monitoring system to the management.

    So, as a last requirement (which should be the very first requirement, though):

    • 5. there should be a proper monitoring at all. A monitoring that only watches for a running process cannot detect that this process already internally has deadlocked and isn't working anymore.

    Finally, when you need to implement a complex application, make sure that some of its internal state can be exposed to an external monitoring solution, which helps to operate your application. Treat this as a normal "must-have" feature with specification, implementation and test. You will make your operations team really happy.

    If you do not do that, your monitoring system are actually the users, when they complain about non-working functionality, which must be fixed by your operations team then. And that brings costs (service calls) and negative management attention. Nothing one wants to have.

    Posted by Michael Marth NOV 21, 2009

    Add comment

    RT @joannekh:

    Day Ignite presentations now available at www.day.com/ignite

    (dayignite on slideshare)

    Posted by Michael Marth NOV 19, 2009

    Comments 13

    Joint post of Henri Bergius and Michael Marth cross-posted here and here.

    Web Content Repositories are more than just plain old relational databases. In fact, the requirements that arise when managing web content have led to a class of content repository implementations that are comparable on a conceptual level. During the IKS community workshop in Rome we got together to compare JCR (the Jackrabbit implementation) and Midgard's content repository. While in some cases the terminology might be different, many of the underlying ideas are identical. So we came up with a list of common traits and features of our content repositories. For comparison, there is also Apache CouchDB.

    So, why use a Content Repository for your application instead of the old familiar RDBMS? Repositories provide several advantages:

    • Common rules for data access mean that multiple applications can work with same content without breaking consistency of the data
    • Signals about changes let applications know when another application using the repository modifies something, enabling collaborative data management between apps
    • Objects instead of SQL mean that developers can deal with data using APIs more compatible with the rest of their desktop programming environment, and without having to fear issues like SQL injection
    • Data model is scriptable when you use a content repository, meaning that users can easily write Python or PHP scripts to perform batch operations on their data without having to learn your storage format
    • Synchronization and sharing features can be implemented on the content repository level meaning that you gain these features without having to worry about them
    feature JCR / Jackrabbit Midgard CouchDB
    content type system In JCR structured or unstructured nodes are supported and can be mixed at will in a content tree. Content types are defined in MgdSchema types. All content must be stored to an MgdSchema type, but types can be extended on content instance level using the "parameter" triplets Type-free
    type hierarchy Structured node types support inheritence of types, additional cross-cutting aspects can be added with "mixins". Node types can define allowed node types for child nodes in the content hierarchy. MgdSchemas allow inheritance, and an extended type can be instantiated either using the extended type or the base type Type-free
    IDs Nodes with mixin "referenceable" have GUID a UUID. In practice the node path is often used to reference nodes. Every object has a GUID used for referencing. Objects located in trees that have a "name" property can also be referred to using the path All objects can be accessed via a UUID
    References Nodes can reference each other with hard link (special property type) or soft link (by referring to the node path) MgdSchema types can have properties linking to other objects of same or different type. A link of "parentfield" type places an MgdSchema type in a tree. No reference support built-in
    content hierarchy All content is hierarchical / in a tree Content can exist in tree, or independently of it depending on the MgdSchema type definition flat structure
    interesting property types Multi-valued (like an array), binary properties (e.g. for files), nodes have an implicit sort-order Binary properties stored using the Midgard Attachment system Support for binary properties
    transactions Multiple content modifications are written in transactions. Transactions can be used optionally.  
    events JCR Observers can register for content changes on different paths and/or for different node types and/or CRUD, receive notification of changes as serialized node All transactions cause both process-internal GObject signals, and interprocess DBus signals Support for one external event notification shell script
    workspaces Workspaces provide separate root trees. No workspaces support in Midgard 9.03, coming in next version Multiple databases within one CouchDB instance
    import and export nodes or parts of the repository (or the whole repo) can be imported or exported in XML. 2 formats: docview for human-frindly representation, sysview including all technical aspects Objects can be exported and imported in XML format. There are tools supporting replication via HTTP, tarballs, XMPP, and the CouchDB replication protocol JSON serialization is the standard way of accessing the repository. CouchDB replication protocol supports full synchronization between instances
    versioning Checkin/checkout model to create new versions of nodes, optionally versions complete sub-trees, supports branching of versions. No versioning All versions of content are stored and accessible separately, no branching
    locking Nodes can be locked and unlocked Objects can be locked and unlocked  
    object mapping Not in standard, but implemented in Jackrabbit. Rarely used in practice. Object mapping is the standard way of accessing the repository All content is accessed via JSON objects
    queries In JCR1 Sql or XPath, in JCR2 also QueryBuilder. Query Builder Javascript map/reduce
    access control Done on repository level, i.e. all access control is independent of application. In Jackrabbit: pluggable authentication/authorization handlers. No access control in Midgard repository, usually implemented on application level. Midgard proves a user authentication API No access control
    persistence In Jackrabbit different Persistence Managers can be plugged in (RDBMS, tar file, ...) libgda allows storage to different RDBMS like MySQL, SQLite and Postgres CouchDB has its own storage
    architecture Jackrabbit: library (jar), JEE resource, OSGi bundle or standalone server Library Erlang-based daemon
    APIs Standard: Java-based, PHP coming up. In Jackrabbit: also WebDAV and HTTP-based API C, Objective-C, PHP, Python HTTP+JSON
    full-text search Included in repository. In Jackrabbit: Lucene bundled No (SOLR used on application level) Plugin for using Lucene, not installed by default
    standard metadata All nodes have access rights, jcr:primaryType and jcr:mixinTypes properties. JCR 2.0 standardizes a set of optional metadata properties. All objects have a set of standard metadata including creator, revisor, timestamps etc No standard properties

    Posted by Michael Marth NOV 18, 2009

    Add comment

    French IT mag LeMagIT has published an article about the IKS project including quotations from Bertrand Delacretaz. Bertrand emphasizes the need for concrete results:

    pour décoller, les technologies sémantiques ont besoin de cas d'utilisateur concrets

    In the comments section Bertrand mentions his tag line for semantic technologies that I can very well relate to:

    La sémantique "sous le capot" oui, la sémantique "dans la figure", non

    This roughly translates as: "semantics under the hood yes, semantics in your face, no".

    In Computerworld UK open source blogger Glyn Moody has described his first hands impressions from the IKS workshop in Rome. He comes to a similar conclusion:

    Paradoxically, semantic search will only ever really take off once it has receded so far into the fabric of computing that people aren't even aware it's there.

    Posted by Bertrand Delacretaz NOV 13, 2009

    Comments 5

    Update to "The IKS semantic engine - a pragmatist's view": here are the slides:

    The presentation went well, and will hopefully lead to a sprint to actually implement something along these lines. The two demos that used UIMA at the workshop made me think that UIMA should be part of that picture, at least as a plugin for semantic lifting. And I did the presentation in less than 8 minutes out of the 10 that were allocated. Bonus points?

    Posted by Carsten Ziegeler NOV 13, 2009

    Add comment

    This year's WJAX in Munich has been (again) a great success. The conference area was crowded up to the maximum capacity of the hotel I guess. Around 150 talks, different special days covering topics like persistence, OSGi, Scala, and the never dying SOA. My two talks about JCR and Apache Sling have been well attended, some interesting questions came up and I could spread the interest in these cool technologies. Now looking forward to JAX 2010 :)

    Posted by Michael Marth NOV 12, 2009

    Add comment

    CQ5 search comes with some improvements over JCR's search capabilities, e.g. adapting result rankings to what users choose or faceted search. Within the IKS project Bertrand and I have experimented with another possibility: link-based ranking, i.e. adjusting search results based on the content of link tags. For example: if page A links to page B with the link text "lorem ipsum" then page B should get a higher ranking when a user searches for "lorem ipsum". This is essentially what Google does, but we wanted to apply it to internal links (within the same site) only.

    To give away the results right away: for many web sites the results will probably not improve dramatically, because there are not enough internal links. However, it might help for some projects so our implementation approach is described below in case you want to give it a try in your project.

    In order to extract links from a node we opted for parsing the complete rendered HTML presentation of a node rather than looking only at the Rich Text properties of one node. In that way we could also catch programmatically generated links from templates. So we ended up by setting up a little spider on the publish server that retrieves HTML representations of all pages. The spider is deployed as an OSGi bundle within the server so it gets the locations of all pages from an internal repository query. For each page the HTML is retrieved and parsed. The found links are stored as child nodes below the page that is linked to. In the example from above: if page A links to page B with the link text "lorem ipsum" then page B gets a child node with properties source=A and text="lorem ipsum". Implemented in that way we could basically use the Jackrabbit indexer without further changes.

    We have also implemented a JCR Observer that catches changes to pages and fixes the corresponding links. Template updates are not caught, yet.

    The sources are attached to this post. The Java program can be used as a standalone application or deployed as an OSGi bundle. The standalone program takes a couple of optional arguments for running a full upfront spidering, deleting all found link nodes etc. In case you want to give it a try please be aware:

    • The standalone program requires RMI to be enabled on the repository which is not the case by default (in the code port 1235 is used).
    • The searches must take into account the new properties of the link nodes. One possibility is to re-configure the Jackrabbit indexing, which in CQ5 is done in the crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/indexing_config.xml file, by adding:
      <index-rule nodeType="nt:unstructured"
        condition="parent::backlinks/@jcr:primaryType=''{http://www.jcp.org/jcr/nt/1.0}unstructured">
        <property boost="5.0">linkedText</property>
      </index-rule>
      

    The boost factor in this configuration can be adjusted to give links a proper weight relative to the other properties of a node
    For reindexing delete these directories:
    crx-quickstart/repository/repository/index
    crx-quickstart/repository/workspaces/crx.default/index
    crx-quickstart/repository/workspaces/crx.system/index

    Results

    We tested the approach on the content of our corporate website (a rather small content corpus). Overall, the search results improved slightly, but not much (although we did not spend a lot of time on tweaking the boost factor). As stated above I believe that corporate websites in general will not benefit from link-based ranking very much as the majority of links in them are often reflecting the navigation (i.e. the hierarchical structure of the site) so they provide little additional information. Of course, on the other side there is no harm in using links for search relevance either.

    Alternative approach

    Marcel Reutegger (the MAN when it comes to JCR searches) gave a lot of great input to our experiment (thanks a lot for this). He also hinted how an alternative implementation could look like: using an output filter, which can process HTML content as it's being generated. In CQ5 the validity of links is already checked that way, so storing them would naturally fit there. Also, he suggested storing the links not below the pages themselves, but in a separate part of the repository. In a background processing job these links could be aggregated and the most relevant key words would eventually be written into the page nodes.

    Posted by Greg Klebus NOV 11, 2009

    Add comment

    After the Ignite in Zurich, there came Ignite in Chicago, where our American customers, prospects, partners, and Day staff met to share information, experiences, to network, and simply have a very good time. The event itself was slightly bigger than the event in Zurich, both in terms of number of participants and available room.


    Ignite was hosted by Day customers, in more than one way: by the City of Chicago itself, and by the grand Intercontinental Hotel, of the IHG Group, on Chicago's famous shopping avenue, the Magnificent Mile.


    Again, we had a lot of great presentations, panels, Q&A sessions, as well as informal chats. And the Foreigner concert at the end was the icing on the cake.


    Be sure to check out the conference hashtag was #dayignite and here are some Ignite pictures on Flickr, with lots of new coverage from Chicago:

    Looking forward to next year's Day Customer Summit!

    Posted by Michael Marth NOV 09, 2009

    Add comment

    Day's CMO Kevin Cochrane has been interviewed by Matthew Aslet of the 451 group about Day's open source strategy. I particularly liked:

    While many other vendors have chosen to retain control over their open source projects for commercial reasons, Day opted to relinquish control with the aim of ubiquity.

    Full interview here.

    Posted by Bertrand Delacretaz NOV 09, 2009

    Comment 1

    As work on the IKS project progresses, my (extremely) pragmatic mind keeps going back to the how can we make this simpler? question.

    One of the major goals of IKS is to create semantic extensions for content management systems, but what does that mean? The exact use cases are still vague, and in such a situation it is too easy to over-engineer things, just in case.

    We have been talking about RESTful interfaces to IKS components for a while now, but what does this mean exactly? How can we make a concrete step towards defining such interfaces?

    I'm a big fan of small concrete steps that lead us towards pragmatic solutions, so let's try to take one such step.

    Machine-level use cases

    Let's start by defining a few simple use cases, at the "machine level": a content management system is the client, and the IKS semantic engine the server. We have discussed this already within IKS, here's a synthetic summary:

    Semantic lifting
    Let IKS extract semantic information from (multimedia) content: person and place names, structured links between content items, etc. Optionally make this information editable/confirmable by the client system, as a human user might have to refine the system's suggestions.
    At the machine interface level, this requires registering content with the IKS semantic engine, reading the resulting semantically lifted document, and optionally modifying it.
    Classification and auto-tagging
    Let IKS suggest categories and/or tags for pieces of multimedia content. If an author validates the suggestions, inform IKS of what choices were made.
    From the machine interface point of view, this is very similar to semantic lifting.
    Query building assistance
    Let IKS assist users in formulating search queries, interactively.
    From the machine interface point of view, this is very similar to semantic lifting.
    Similarities, correlation
    Let IKS find similarities between pieces of multimedia content. The axes on which those similarities are found can vary: images, for example, can be graphically similar, or similar in terms of the real world entities that they display.
    At the machine interface level, this requires registering content with the IKS semantic engine, and later running queries against this content.

    This simple list already hides significant complexity, yet those use cases should be understandable by Joe Author.

    Enabling those four use cases could add a lot of value to existing and future content management systems, depending on the quality of the semantic components.

    RESTful interface

    Let's design a RESTful interface based on the machine interactions required to implement the above use cases.

    Remember that, in what follows, client designates a content management system that wants to use the IKS engine.

    Register content with IKS

    To build knowledge about our content, IKS needs to be able to find it. In RESTful terms this means providing IKS with an URL that points to said content, so we have:

    Rule #1: Content is registered with the IKS server by HTTP POST requests, containing lists of URLs that point to (created or modified) content items.

    Rule #2: IKS reads content by making HTTP GET requests to registered pieces of content. Those URLs must return Content-Types that IKS understands. Some Content-Types are preferred and allow IKS to better understand the content.

    Semantic Lifting

    Once content is registered, the client can request a semantic view of that content from IKS. That view lists semantic entities that have been extracted from the content.

    Depending on the IKS implementation, the semantic view can be editable. It is retrieved by a GET request that contains the IKS identifier (provided by IKS when content is registered) of the content item, and modified using an HTTP PUT request.

    The Content-Type and data formats use existing standards, as far as possible.

    The semantic view includes IKS-specific metadata, for example to indicate that some parts of the semantic view are still being computed.

    Rule #3: The semantic view of a content item is retrieved with a GET request, and if editable can be modified by a PUT request of the modified version.

    Semantic queries

    Semantic queries are implemented using GET methods on various query URLs, that define how the query is interpreted.

    Results are returned with similar Content-Types and data formats as used for semantic lifting.

    Rule #4: Semantic queries are executed via GET requests, and return the identifiers (URLs) of the selected content items, optionally with some contextual info to display on query result pages.

    IKS engine status

    Semantic lifting and indexing operations might take some time, so it's useful for the client to have information on the engine's status, in machine-readable form.

    Rule #5: The IKS server reserves part of its URL space for system status information, and provides status information in a structured format.

    Is that it?

    I think that's it - these simple RESTful interactions should be sufficient to implement our use cases.

    What's left is to define the Content-Types used, and for this we can most certainly use existing formats, no need to reinvent any wheel here.

    RESTful IKS framework

    The proof of the pudding is in the eating, and if we wait too long the pudding might lose its taste...so why not start buiding this right away?

    Purists might (rightly) argue that the above is not a design, just a somewhat vague set of principles. Yet, combined with a prototype implementation, this might be a very good way of making a step in the right direction, and of clarifying requirements and interfaces.

    My suggestion for the next steps is as follows:

    1. Implement the above interface, using dummy semantic components.
    2. Provide system interfaces to integrate actual semantic components (semantic lifting, classification, auto-tagging, querying) as plugins.
    3. Researchers can work on the semantic lifting components, and integrate them without requiring significant changes on the client side.

    Conclusion

    The best way to go forward with this is probably to create an open source project to collaborate on this RESTful IKS framework.

    Even if that framework is thrown away later as the IKS architecture progresses, if would allow IKS consortium members to build a much better understanding of what's actually needed to add "semantic value" to existing and future content management systems.

    Posted by Michael Marth NOV 09, 2009

    Add comment

    Posted by David Nuescheler NOV 05, 2009

    Add comment

    Today I had the opportunity to speak at the JBoye conference in Aarhus. It was a pleasure as every year since the audience and speakers really constitutes a who-is-who of WCM visionaries and insiders. I am definitely looking forward to coming back next year.

    Posted by Michael Marth NOV 04, 2009

    Add comment

    The Tuberculosis Project project is one of the Sling users registered on the Sling user wiki page. This is an interview with developer Audrey Colbrant who worked on the project.

    Audrey, can you please tell us a bit about the TibTec Tuberculosis Project? What are the project's aims and background?

    The TB project is developed by Tibtec, a nonprofit technology center based in Dharamsala (India) and directed by M. Phuntsok DORJEE. The aim of the project was to build a system to monitor the tuberculosis among tibetan communities in India, Nepal and Bhutan. Thanks to technology advances in mobile and web computing, it is now possible to design a recording and reporting web portal supporting the WHO DOTS protocol.

    The project of monitoring the tuberculosis among tibetan communities in India was born 1 year ago thanks to four actors: the DoH (Department of Health, Tibetan Government in Exile), Tibetan Delek Hospital (Gangchen Kyishong - India), AISPO (Italian Association for Solidarity Of Persons), and the Johns Hopkins University (USA). TibTec is working on a system for the above four actors.

    The main goal of the project is to build a simple, low-cost and versatile framework so that communities all over the world could benefit from it. The system could be easily customized for other works as well since it based on open source software.

    If you want to take a look at the architecture, follow the guide.

    So how did you end up using Sling? Did you compare Sling against some other frameworks?

    The implementation of the TB project was part of the master project in computer science of my university. Jacques Lemordant, researcher in the WAM project at INRIA was in contact with M. Dorjee, CEO of TibTec, since several years. Together they have defined headlines of the project and chosen the more efficient technologies to be used.

    Sling was chosen because we are very familiar with XML technologies (RELAX NG, XPATH, XSLT...) and hierarchical representation of data.

    Another point was the fact that we wanted to access data from Android (Apache http client) and a full REST API was the simplest way to access a JCR and manipulate data represented as trees. XML being very well supported in Android, Sling is a perfect match with Android to design agile mobile web framework.

    Sling is also part of a course in mobile and web technologies as the master level of the University Joseph Fourier of Grenoble.

    Now that you have completed an implementation project with Sling are there any lessons learned you would like to share with the community?

    The Sling approach is fairly new and I haven’t seen any other same kind of approach before. The concept is simple but it takes a little bit time to be used to the utilization. So never give up, solutions come slowly with perseverance.

    If you had one free wish from the Sling committers...

    Sling is a very interesting and powerful way to work with resources but difficult to handle for Sling beginners when you have a full and composite website to implement, mostly because of the lack of information on the internet.

    The harder thing that gives me a lot of headaches was to find a good syntax to use that changes according to the technology you mix up.

    So I think it could be helpful to have more tutorials on the syntax to use in each different case, what is better to do or not, and advice on choices to take in programming (for example I have met choices for protecting the access to the repository; choices about which kind of link is better to use: reference or path, etc).

    It could be also good to finalize all links of this useful webpage

    Posted by Michael Marth NOV 02, 2009

    Add comment

    ApacheCon US 09 starts today in Oakland. A couple of Day's engineers will give talks, not just about the usual suspects Sling and Jackrabbit, but also Tika and POI (details below).

    Also, Jukka Zitting has helped organize a NoSQL meetup in Oakland starting tonight where Bertrand Delacretaz will talk about JCR.

    Bertrand Delacretaz: Life in Open Source communities: Open Source communities often seem to have their own unwritten rules of operation and communication, their own jargon and their own etiquette, which sometimes make them appear obscure and closed to outsiders. In this talk, we'll provide recommendations on how to get touch with, and how to join, Open Source communities. Based on ten years of experience in various Open Source projects, we will provide practical information on how to communicate effectively on mailing lists, how to formulate questions in an effective way, how to contribute in ways that add value to the project, and generally how to interact with Open Source communities in ways that are mutually beneficial. This talk will help Open Source beginners get closer to the communities that matter to them, and help more experienced community members understand how to welcome and guide newcomers.

    Carsten Ziegeler: JCR in Action - Content-based Applications with Jackrabbit: The Java Content Repository API (JCR) is the ideal solution to store hierarchical structured content, and to develop content-oriented applications. This session provides a practical introduction to help you get started using JCR in your own application. To demonstrate the basic architecture of such applications, a sample content-based application will be developed during the session. Basic techniques will be explained, including navigation, searching, and observations, using the Apache Jackrabbit project.

    Embrace OSGi - A Developer's Quickstart: In theory, the first choice for highly modular, dynamic, and extensible applications is OSGi technology. The theory sounds very tempting, but what about the real world? Starting with the basics of OSGi, this session is focused on practical examples, tools, and procedures for a rapid adoption of OSGi in your own projects. Learn how to avoid the typical traps and how to get the most out of OSGi

    Felix Meschberger: Rapid JCR applications development with Sling: Apache Sling is an OSGi-based, scriptable applications layer, using REST principles, that runs on top of a JCR content repository. In this talk, we'll see how Sling enables rapid development of JCR-based content applications, by leveraging the JSR 223 scripting framework. We'll also look at the rich set of OSGi components provided by Sling. We will create a simple application from scratch in a few minutes, and explain a more complex multimedia application that does a lot with just a few lines of code. This talk will help you get started with Sling and understand how the different components fit together.

    Jukka Zitting: MIME Magic with Apache Tika: Apache Tika aims to make it easier to extract metadata and structured text content from all kinds of files. Tika is a subproject of Apache Lucene, and leverages libraries like Apache POI and Apache PDFBox to provide a powerful yet simple interface for parsing dozens of document formats. This makes Tika an ideal companion for Apache Lucene, or for any search engine that needs to be able to index metadata and content from many different types of files. This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity. The audience is expected to have basic understanding of Java programming and MIME media types.

    Paolo Mottadelli: Apache POI recipes: The Apache POI project provides Open Source Java APIs for the manipulation of Microsoft Office format files. It was developed to provide OLE2 Compound Document format support. POI support for the new format was necessitated by the proliferation of new Office Open XML (OOXML) documents, due to its standardization. As a result, a common challenge emerged for projects that leverage POI to read and write Excel, Word, and PowerPoint documents: supporting the new format while maintaining backward compatibility with the earlier one. This session provides an overview of how the new POI architecture makes that challenge easier, using the common interfaces package and their double implementation. Participants will also learn about the main new features provided by POI towards support of the new OOXML format. To demonstrate POI's features, this session will also drive through a collection of practical recipes to solve the tough problems of integrating Office documents in your enterprise applications.