## +1! ™ – a rosy financial future for the Apache Software Foundation

Google recently announced their +1 button, which will without a doubt make the Internet a better place. What’s not to like +1?

As everybody knows the +1 concept has been invented at the Apache Software Foundation (ASF) – and it seems like there’s event an ASF patent pending on it. (Update: see also here, via @jaaronfarr). Our voting process makes extensive use of this simple and very effective concept.

If you do your math the bandwidth (and thus power, greenhouse gas, etc.) saved by using +1 instead of I agree in all our emails does make a difference for the planet – it’s not just a fun gimmick.

In recognition of this invention, usually well informed sources tell us, Google is going to donate 3.141592654 cents (yeah that’s Pi – they’re Google, you know) to the ASF every time someone uses their +1 button, starting today!

That’s excellent news for the ASF – as with any volunteer organizations, more funds mean more action, more power and more fun! I haven’t yet been able to estimate how much money those N*Pi +1 clicks represent in a year, but that’s certainly in the pile of money range.

A small downside is that we’ll need to use +1(tm), with the trademark sign, from now on. That’s a small price to pay for what looks like a rosy financial future for the ASF.

Very impressive move. thanks Google! The Open Source world should mark today’s very special date with a white stone, as we say in French.

+1(tm)!

## glow.mozilla.org: smoke and mirrors, and RESTful design

When I was a kid, my aunt gave me a book called the art of engineering. The title sounded weird to me at first – isn’t engineering the opposite of art?

It’s not – artful design can be visible in the best pieces of software, and not only at the user interface level. I find the realtime display of Firefox 4 downloads by glow.mozilla.org fascinating, and being my curious self I wondered how the data is transferred.

Starting with the requirement of broadcasting real-time data to millions of clients simultaneously, many of us would end up with expensive message queuing systems, RPC, WebSockets, SOAP^H^H^H^H (not SOAP – don’t make me cry). Lots of fun ways to add some powers of ten to your budget.

Don’t believe anyone who tells you that software has to be complicated, or that engineering cannot be artful. Simplicity always wins, and glow.mozilla.org is an excellent example of that.

The first thing that I noticed when looking at how glow gets its data (which was very easy thanks to the use of sane http/json requests) is that glow is not real-time.

I’d call it smoke and mirrors real-time: the client just requests a new batch of data points every minute, and the server can change this interval at any time, which can be very handy if traffic increases. Fetching slightly old data every minute is more than enough for a human user who doesn’t care if the data is a bit outdated, and it makes the system a bit simpler.

The first of these two regular data requests is to an URL like http://glow.mozilla.org/data/json/2011/03/21/14/42/count.json. The path already tells you a lot about what this is, which although not required is often a sign of a good RESTful design.

The response contains an array of data points (number of downloads per minute), along with two very important items that control the data transfer:

{
"interval":60,
"data":[
[
[
2011,3,21,13,43
],
1349755
],
[
[
2011,3,21,13,44
],
1350332
],
...
],
"next":"2011/03/21/14/43/count.json"
}


The interval tells the client when to ask for data next, and the next item is the path to the next batch of data. At least that’s what I assume, I haven’t checked the client code in detail but that seems obvious.

Using URLs and data that seem obvious is the essence of the Web, and of a good RESTful design. Using RPC, WebSockets or any other supposedly more sophisticated mechanism would bring nothing to the user, and would only make things more complicated. Being able to throttle data requests from the server-side using the interval and next items is very flexible, obvious, and does not require any complicated logic on the client side.

The second data URL looks like http://glow.mozilla.org/data/json/2011/03/21/14/42/map.json, and if my quick analysis is correct it returns geographic coordinates of the dots that represent geolocated downloads. It uses the same interval/next mechanism for throttling requests.

All in all, an excellent example of engineering smoke and mirrors applied in the right way, and of simple and clean RESTful design. No need for “sophisticated” tools when the use case doesn’t really require them. Kudos to whoever designed this!

Update: The Mozilla team has more details on their blog. Thanks to Alex Parvulescu for pointing that out.

## Transforming Maven POM properties with Groovy

We’re moving to fragment bundles in Sling instead of using system properties, for example to export packages from the JVM’s classpath.

If you have no idea what I’m talking about, bear with me – this is just about a simple Maven trick to transform POM properties using bits of Groovy script.

Basically, an OSGi fragment bundle is a jar file that contains just metadata under META-INF, especially META-INF/MANIFEST.MF that contains the OSGi bundle headers.

One of these headers is Bundle-Version, which does not support values like 5.4.2-SNAPSHOT which are common in Maven. The dash is invalid in an OSGi bundle version number, that value needs to be converted to 5.4.2.SNAPSHOT

To avoid having a separate bundle.version property in your POM, which if you’re like me you’ll forget to update before a release, here’s how to transform the value using a bit of Groovy scripting:

<plugin>
<groupId>org.codehaus.groovy.maven</groupId>
<artifactId>gmaven-plugin</artifactId>
<version>1.0</version>
<executions>
<execution>
<phase>generate-resources</phase>
<goals>
<goal>execute</goal>
</goals>
<configuration>
<properties>
<rawVersion>${pom.version}</rawVersion> </properties> <source> // Convert POM version to valid OSGi version identifier project.properties['osgi.version'] = (project.properties['rawVersion'] =~ /-/).replaceAll('.') </source> </configuration> </execution> </executions> </plugin>  As usual in Maven POMs (though I think Maven 3.x can improve on that, feedback welcome) that’s a bit verbose to write, the actual Groovy code is just project.properties['osgi.version'] = (project.properties['rawVersion'] =~ /-/).replaceAll('.')  But even with the verbosity it’s cool to be able to do that without having to write a plugin. You can then use the ${osgi.version} property for the Bundle-Version header.

For the sake of completeness, here’s the other interesting part of that pom, which sets the required OSGi headers to create a fragment bundle. com.example,whatever is the package that we need to be exported by the system bundle.

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<index>true</index>
<manifest>
</manifest>
<manifestEntries>
<Bundle-Version>${osgi.version}</Bundle-Version> <Bundle-Description>${project.description}</Bundle-Description>
<Bundle-Name>${project.name}</Bundle-Name> <Bundle-DocURL>http://www.example.com/</Bundle-DocURL> <Bundle-ManifestVersion>2</Bundle-ManifestVersion> <Bundle-Vendor>YourCompanyAG</Bundle-Vendor> <Fragment-Host>system.bundle;extension:=framework</Fragment-Host> <Bundle-SymbolicName>${project.artifactId}</Bundle-SymbolicName>
<Export-Package>
com.example,whatever;version=1.0,
</Export-Package>
</manifestEntries>
</archive>
</configuration>
</plugin>


Update: a complete sample pom is available at http://svn.apache.org/repos/asf/sling/trunk/samples/framework-fragment/pom.xml

## Class.forName ? Probably not ...

After subscribing to the OSGi Planet feed I felt like starting to read some old blog posts and stumbled upon a series of posts by BJ Hargave around the issues of the Eclipse ContextFinder caused by the Class.forName methods.

So, these posts caused me to try and look how we behave in Apache Sling ... and of course hoped we would be clean.

Well, hmm, turns out we are not ... I found nine classes using Class.forName.

So we probably have to clean this up. Maybe or maybe not, these uses may be the cause for some strange failures we had over time. I cannot really tell. But I cannot exclude this possibility either.

BTW, this is what I did to find the classes:
$find . -name "*.java" -exec fgrep -l Class.forName {} \; ## Regular expression matching in <100 lines of code The recent discussion about the Yacc is dead paper on Lambda the Ultimate sparked my interest in regular expression derivatives. The original idea goes back to the paper Derivatives of Regular Expressions published in 1964 (!) by Janusz A. Brzozowski. For a more modern treatment of the topic see Regular-expression derivatives reexamined. The derivative of a set of strings with respect to a character, is the set of strings which results from removing the first character from all the strings in the set which start with that character. Let for example $S = \{foo, bar, baz\}$ then $\partial_b S = \{ar, az\}$. It turns out that regular languages are closed under derivatives. That is, any derivative of a regular language is again a regular language. Furthermore, it is possible to extend the notion of derivatives to regular expression such that given a regular expression $r$ which generates the language $\mathcal{L}(r)$ and a character $c$, one can derive a regular expression $\partial_c r$ such that $\mathcal{L}(\partial_c r) = \partial_c(\mathcal{L}(r))$. This is a key ingredient for a very elegant regular expression matching algorithm: to match a string against a regular expression repeatedly calculate the derivative of the regular expression for each characters in the string. When no character is left, check whether the last derivative accepts the empty string. If so we have a match and otherwise not. The exact algorithm for finding whether a regular expression is nullable (i.e. accepts the empty string) is given in Regular-expression derivatives reexamined as is the algorithm for calculating derivatives of regular expressions. Below is a direct implementation of that algorithm in Scala (with a slight modification to allow for strings instead of individual characters). trait RegExp { def nullable: Boolean def derive(c: Char): RegExp } case object Empty extends RegExp { def nullable = false def derive(c: Char) = Empty } case object Eps extends RegExp { def nullable = true def derive(c: Char) = Empty } case class Str(s: String) extends RegExp { def nullable = s.isEmpty def derive(c: Char) = if (s.isEmpty || s.head != c) Empty else Str(s.tail) } case class Cat(r: RegExp, s: RegExp) extends RegExp { def nullable = r.nullable && s.nullable def derive(c: Char) = if (r.nullable) Or(Cat(r.derive(c), s), s.derive(c)) else Cat(r.derive(c), s) } case class Star(r: RegExp) extends RegExp { def nullable = true def derive(c: Char) = Cat(r.derive(c), this) } case class Or(r: RegExp, s: RegExp) extends RegExp { def nullable = r.nullable || s.nullable def derive(c: Char) = Or(r.derive(c), s.derive(c)) } case class And(r: RegExp, s: RegExp) extends RegExp { def nullable = r.nullable && s.nullable def derive(c: Char) = And(r.derive(c), s.derive(c)) } case class Not(r: RegExp) extends RegExp { def nullable = !r.nullable def derive(c: Char) = Not(r.derive(c)) }  Having these constructors we need a way to match strings against regular expressions. object Matcher { def matches(r: RegExp, s: String): Boolean = { if (s.isEmpty) r.nullable else matches(r.derive(s.head), s.tail) } }  Here are some pimps to make usage of the regular expression constructors more convenient. object Pimps { implicit def string2RegExp(s: String) = Str(s) implicit def regExpOps(r: RegExp) = new { def | (s: RegExp) = Or(r, s) def & (s: RegExp) = And(r, s) def % = Star(r) def %(n: Int) = rep(r, n) def ? = Or(Eps, r) def ! = Not(r) def ++ (s: RegExp) = Cat(r, s) def ~ (s: String) = Matcher.matches(r, s) } implicit def stringOps(s: String) = new { def | (r: RegExp) = Or(s, r) def | (r: String) = Or(s, r) def & (r: RegExp) = And(s, r) def & (r: String) = And(s, r) def % = Star(s) def % (n: Int) = rep(Str(s), n) def ? = Or(Eps, s) def ! = Not(s) def ++ (r: RegExp) = Cat(s, r) def ++ (r: String) = Cat(s, r) def ~ (t: String) = Matcher.matches(s, t) } def rep(r: RegExp, n: Int): RegExp = if (n <= 0) Star(r) else Cat(r, rep(r, n - 1)) }  And finally here is how to use it: object Test { import Pimps._ val digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" val int = ("+" | "-").? ++ digit.%(1) val real = ("+" | "-").? ++ digit.%(1) ++ ("." ++ digit.%(1)).? ++ (("e" | "E") ++ ("+" | "-").? ++ digit.%(1)).? def main(args: Array[String]) { val ints = List("0", "-4534", "+049", "99") val reals = List("0.9", "-12.8", "+91.0", "9e12", "+9.21E-12", "-512E+01") val errs = List("", "-", "+", "+-1", "-+2", "2-") ints.foreach(s => assert(int ~ s)) reals.foreach(s => assert(!(int ~ s))) errs.foreach(s => assert(!(int ~ s))) ints.foreach(s => assert(real ~ s)) reals.foreach(s => assert(real ~ s)) errs.foreach(s => assert(!(real ~ s))) }  Now that’s 48 + 6 + 32 = 86 lines of code for a regular expression matching library! Filed under: Uncategorized Tagged: Regular Expression, Scala ## On speaking in URLs I’ve seen about five examples just today where speaking in URLs (I spoke about that before, slide 27) would have saved people from misunderstandings, and avoided wasting our collective time. When writing about something that has an URL on a project’s mailing list, for example, pointing to it precisely makes a big difference. You will save people’s time, avoid misunderstandings and over time create a goldmine of linked information on the Web. It’s not a web without links, ok? Writing https://issues.apache.org/jira/browse/SLING-931 (or at least SLING-931 if that’s the local convention) is so much clearer than writing about “the jcrinstall web console problem”. You might know what the latter is right now, but how about 6 months later when someone finds your message in the mailing lists archives? Of course, all your important technical things have stable URLs, right? ## The case for the digital Babel fish Just like Arthur Dent, who after inserting a Babel fish in his ear could understand Vogon poetry, a computer program that uses Tika can understand Microsoft Word documents.” This is how Tika in Action, our book on Apache Tika, introduces it’s subject. Download the freely available first chapter to read the the full introduction. Chris Mattmann and I started writing the Tika in Actionbook for Manning at the beginning of this year, and we’re now well past the half-way post. If we keep up this pace, the book should be out in print by next Summer! And thanks to the Manning Early Access Program (MEAP), you can already pre-order and access an early access edition of the book at the Tika in Action MEAP page. If you’re interested, use the “tika50″ code to get a 50% early access discount when purchasing the MEAP book. You’ll still receive updates on all new chapters and of course the full book when it’s finished. Note that this discount code is valid only until December 17th. We’re also very interested in all comments and other feedback you may have about the book. Use the online forum or contact us directly, and we’ll do our best to make the book more useful to you! ## Why the ASF disagrees with Oracle, straight from the Anonymous Coward’s mouth An Anonymous Coward (as they call them) on Slashdot provides the clearest explanation I’ve seen so far. I’m only quoting the original comment here, see the discussion on Slashdot for follow-ups: The problem is that to be a compatible Java implementation you must pass the TCK. To get a hold of the TCK you must agree that your Java implementation has a limited field of use, namely desktop computers. That means you have to add a clause to your licence that tells your users where they can use the software – no such clause exists in any open source licence I’m aware of. Sure you can use the OpenJDK, you can even fork it, but therein lies the problem… you can’t, because if you do and you want to claim it’s a compatible implementation you have to pass the TCK. So you have to licence the TCK, then you have to add a field of use restriction to your licence, but that’s incompatible with the GPL that the OpenJDK GPL requires you to licence under. End result, you can have Oracle Java or ‘Open’JDK The ASF don’t have a political axe to grind with the GPL, aren’t firing a salvo in some imaginary war based on their view of free; It’s about a contractual obligation Oracle has to release the TCK to the ASF. An obligation Sun had and failed to meet and that Oracle continues to fail to meet. The ASF was re-elected to the JCP with 95% of the vote. No other elected member had anywhere near that. The members spoke with their vote and consequently the ASF leaving the JCP would be big news in a war with Oracle, nobody else. The ASF is outside core Java and the work of the JCP probably the biggest single contributor to the Java ecosystem. Their threat to leave the JCP would seriously damage it and Oracle’s commitment to opensource’s credibility. You can only have Oracle Java or ‘Open’JDK – there’s no way out until Oracle honors the agreement. I have also started collecting a list of links about the whole thing, at delicious.com/bdelacretaz/oraclemess. ## Open sourcing made easy Open sourcing a closed codebase can be difficult. The typical approach is to decide that you’ll go open source, make big news about it and then try to figure out how to proceed. It’s no wonder many open source transitions end up being more painful than expected and fail to generate as much community interest and involvement as hoped. How can you do better? 0. Start small Even though your marketing people will be eager to use a good story, you should to avoid the temptation to make a big deal about your shiny new open source project. Instead, start with small, reversible steps that allow you to get comfortable with the new way of developing software before making public commitments. In other words, learn to walk before you try to run. The next sections outline how to do this. 1. Clean up the codebase Do you really know what’s inside your existing codebase? Do you have rights to use and redistribute all the included intellectual property? Are there trade secrets or other bits in the codebase that you’d rather not show everyone? Do you wish to keep parts of the codebase closed so you can keep selling them as an add-on components on top of the open source offering? Answering these questions should be your first task. You’ll need to spend some time auditing and possibly refactoring your code to prepare it for the public eye. Depending on the codebase this could be anything from a trivial exercise to a significant project. The nice thing is that the increased understanding and potential modularity you gain from this work will be quite valuable even if you never take the next step. 2. Open up your tools Now that your codebase is clean and ready for the public view, you can (and should!) start using public tools to develop the code. You can either make your existing version control, issue tracking and other tools public, or migrate to a new set of public tools. There are plenty of excellent free hosting services for open source projects, so you have a good opportunity to both lower your maintenance costs and improve your productivity through better tooling! There’s no need yet to worry about external users or contributors. In fact the fewer people you attract at this stage, the better! The main purpose of this step is to make your developers comfortable with the idea that anyone could come and see all their code and all the mistakes they are making. This is a big cultural change for many developers, and you’ll want to start small to give them time to adapt in peace. 3. Engage the community If you’ve followed the steps so far, you’ve actually already open sourced your codebase. Are you and your developers comfortable with the situation? It’s still possible to switch back to closed source with minimal disruption and no lost reputation if you’re having second thoughts. But if you are willing to move forward, now is the time to start enjoying the benefits of open development! Call in your marketing people to do their magic. Tell the world about the code you’re sharing, and invite everyone to participate! If you’re product is in any way useful to someone, you’ll start seeing people come in, ask questions, submit bug reports and perhaps even contribute fixes. At this point it is useful to have a few people ready to help such new users and contributors, but it’s surprising how quickly the community can become self-sufficient. More on that in a later post… ## Models of corporate open source There are many different ways and reasons for companies to develop their software as open source. Here’s some brief commentary on the main approaches you’ll encounter in practice. 0. Closed source Well, closed source is obviously not open, but I should mention it as not all software can or should be open. The main benefit of closed source software is that you can sell it. If you are working for profit, then you should only consider open sourcing your software if the benefits of doing so outweigh the lost license revenue. 1. Open releases Also known as code drops. You develop the software internally, but you make your release available as open source to everyone who’s interested. Allows you to play the “open source” card in marketing, and makes for a great loss leader for a “pro” or “enterprise” version with a higher price tag. And no changes are needed from more traditional closed source development processes. Unfortunately your users don’t have much of an incentive to get involved in the development unless they decide to fork your codebase, which usually isn’t what you’d want. 2. Open development Making it easy for your users to get truly involved in your project requires changes in the way you approach development. You’ll need to open up your source repositories, issue trackers and other tools, and make it easy for people to interact directly with your developers instead of going through levels support personnel. Do that, and you’ll start receiving all sorts of contributions like bug reports, patches, new ideas, documentation, support, advocacy and sales leads for free. You can even allow trusted contributors to commit their changes directly to your codebase without losing control of the project. 3. Open community Control, or the illusion of it, is a double-edged sword. If you’re the “owner” the project, why should others invest heavily in developing or supporting “your” code? To avoid this inherent limitation and to unlock the full potential of the open source community, you’ll need to let go of the idea of the project being yours. Instead you’re just as much a user and a contributor to the project as everyone else, with no special privileges. The more you contribute, the more you get to influence the direction of the project. This is the secret sauce of most truly successful and sustainable open source projects, and it’s also a key ingredient of the Apache Way. So what’s the right way? There’s no single best way to do open (or closed) source, and the right model for your project depends on many factors like your business strategy and environment. The right model can even vary between different codebases within the same company. For example in the “open core” model you increase the level of innovation in and adoption of your core technologies by open sourcing them (or using existing open source components), but you make money and maintain your competitive edge through closed source add-ons or full layers on top of the open core. This is the model we’ve been using quite successfully at Day (now a part of Adobe). If you’ve decided to go open source and you don’t have a strong need to maintain absolute control over your codebase (like I suppose Oracle now has over the OpenJDK!), I would recommend going all the way to the open community model. It can be a tough cultural change and often requires changes in your existing development processes and practices, but the payback can be huge. In military terms the community can act as a force multiplier not just for your developers, but also for the QA and support personnel and often even your sales and marketing teams! If you’re interested in pursuing the open community model as described above, the Apache Incubator is a great place to start! ## Generic array factory in Java: receipt for disaster Let’s implement a generic factory method for arrays in Java like this: static <T> T[] createArray(T... t) { return t; }  We can use this method to create any array. For example an array of strings: String[] strings = createArray("some", "thing");  Now let’s add another twist: static <T> T[] crash(T t) { return createArray(t); } String[] outch = crash("crash", "me");  Running this code will result in a ClassCastException on the last line: Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.lang.String;  At first this seems strange. There is no cast anywhere here. So what is going on? Basically the Java compiler is lying at us: calling the crash method with string arguments, it tells us that we get back an array of strings. Now looking at the exception we see that this is not true. What we really get back is an array of objects! Actually the Java compiler issues a warning on the createArray call in the crash method: Type safety : A generic array of T is created for a varargs parameter  This is how it tells us about its lying: “Since I don’t know the actual type of T, I’ll just return an array of Object instead.” I thinks this is wrong. And others seem to think along the same lines. Filed under: Uncategorized Tagged: Bug, Java ## Chongqing on the rise “The largest city you’ve never heard about.” That’s how the Foreign Policy magazine labeled Chongqing in a recent story about the city. Today the Finnish television showed an interesting documentary that centered on the same city, and I recall seeing it mentioned also in the Economist recently. A sign of things to come? I find it interesting that many of the above stories give the impression of Chongqing as a megacity of 30+ million people, when in fact (or at least according to Wikipedia) the urban population is “just” 5+ million people and a majority of the rest are farmers living in the surrounding areas that are administratively part of the city. ## Generating hard to guess content URLs in Sling In RESTful apps, it is often useful to create hard to guess URLs, as a simple privacy device. Here’s a self-explaining example (with hardcoded parameters) of how to do that in Sling. After installing this component, an HTTP POST to a node named ‘foo’ creates a child node with a somewhat long hex string as its name, instead of the usual simple names generated by Sling. package foo; import java.util.Random; import org.apache.felix.scr.annotations.Component; import org.apache.felix.scr.annotations.Service; import org.apache.sling.api.SlingHttpServletRequest; import org.apache.sling.servlets.post.NodeNameGenerator; /** Example that generates hard-to-guess node names in Sling, * for nodes added under nodes named 'foo' * * To test, build and install a bundle that includes this component, * and run * <pre> * curl -X MKCOL http://admin:admin@localhost:4502/foo * curl -F title=bar http://admin:admin@localhost:4502/foo/ * </pre> * The output of the second curl call should return something like * <pre> * Content created /foo/dd712dd234637bb9a9a3b3a10221eb1f * </pre> * Which is the path of the created node. */ @Component @Service public class FooNodeNameGenerator implements NodeNameGenerator { private static final Random random = new Random(System.currentTimeMillis()); /** @inheritDoc */ public String getNodeName( SlingHttpServletRequest request, String parentPath, boolean requirePrefix, NodeNameGenerator defaultNng) { if(parentPath.endsWith("/foo")) { final StringBuilder name = new StringBuilder(); for(int i=0; i < 2; i++) { name.append(Long.toHexString(random.nextLong())); } return name.toString(); } return null; } }  ## Pragmatic validation metrics for third-party software components Earlier this week at the IKS general assembly I was asked to present a set of industrial validation metrics for the open source software components that IKS is producing. Being my pragmatic self, I decided to avoid any academic/abstract stuff and focus on concrete metrics that help us provide value-adding solutions to our customers in the long term. Here’s the result, for a hypothetical FOO software component. Metrics are numbered VMx to make it clear what we’ll be arguing about when it comes to evaluating IKS software. VM1 Do I understand what FOO is? VM2 Does FOO add value to my product? VM3 Is that added value demonstrable/sellable to my customers? VM4 Can I easily run FOO alongside with or inside my product? VM5 Is the impact of FOO on runtime infrastructure requirements acceptable? VM6 How good is the FOO API when it comes to integrating with my product? VM7 Is FOO robust and functional enough to be used in production at the enterprise level? VM8 Is the FOO test suite good enough as a functionality and non-regression “quality gate”? VM9 Is the FOO licence (both copyright and patents) acceptable to me? VM10 Can I participate in FOO’s development and influence it in a fair and balanced way? VM11 Do I know who I should talk to for support and future development of FOO? VM12 Am I confident that FOO still going to be available and maintained once the IKS funding period is over? VM1 can be surprisingly hard to fulfill when working on researchy/experimental stuff ;-) Suggestions for improvements are welcome in this post’s comments, as usual. Thanks to Alex Conconi who contributed VM11. ## Twitter is the new CB…but it’s missing the channels! When I was a kid, Citizen Band Radio (aka “CB”) was all the rage if you could afford it. Those small unlicensed two-way radios have a relatively short range, actually extremely short if you compare to the global range of Twitter today. And they don’t have that many channels, 40 in most cases if I remember correctly. That works as long as the density of CB users is not too high in a given area. For general chat, CB etiquette requires you to start by calling on a common channel for whoever you want to talk to, and, once you find your partner(s), quickly agree on a different channel to move to, to avoid hogging the common channel. That “agree on a different channel to move to” feature is key to sharing a limited medium efficiently. As the Twitter population grows, the timeline that I’m getting is more and more crowded, with more and more stuff that I’m not interested in, although I’m willing to follow the general flow of a lot of people. The global reach of services like Twitter and ubiquitous Internet access makes CB mostly obsolete today. Twitter is the new CB, in many ways. What Twitter lacks, however, are the channels, as in: Could you guys at SXSW move to the #c.sxsw channel and stop boring us with your conference chitchat? We’re jealous, ok? Thanks. Direct messages don’t work for that, as they are limited to two users. A bit like a point-to-point channel, like the telephone, as opposed to multipoint as the CB provides. Twitter channels can also be very useful for data, like weather stations or other continuous data sources that can benefit from hierachically organized channels. But let’s keep that discussion for another post. Like my mom said, one topic, one post (not sure it was her actually). ##### What does Twitter need to support channels? I think the following rule is sufficient: Any message that contains a hashtag starting with #c. is not shown in the general timeline, except to people who are explicitely mentioned with their @id in the message. Such messages can then be retrieved by searching for channel hashtags, including partial hashtag values to support hierarchies. Using hierachical channel names by convention opens interesting possibilities. The ApacheCon conference general channel would be #c.apachecon for example, the java track #c.apachecon.j, etc. This channel filtering could of course be implemented in Twitter clients (@stephtara, remember you said you were going to mention that to @loic?), but in my opinion implementing it on the server side makes more sense as it’s a generally useful feature. Then again, I’m a server-side guy ;-) Opinions welcome, of course. ## Age discrimination with Clojure Michael Dürig, a colleague of mine and big fan of Scala, wrote a nice post about the relative complexity of Scala and Java. Such comparisons are of course highly debatable, as seen in the comments that Michi’s post sparked, but for the fun of it I wanted to see what the equivalent code would look like in Clojure, my favourite post-Java language. (use '[clojure.contrib.seq :only (separate)]) (defstruct person :name :age) (def persons [(struct person "Boris" 40) (struct person "Betty" 32) (struct person "Bambi" 17)]) (let [[minors majors] (separate #(<= (% :age) 18) persons)] (println minors) (println majors))  The output is: ({:name Bambi, :age 17}) ({:name Boris, :age 40} {:name Betty, :age 32})  I guess the consensus among post-Java languages is that features like JavaBean-style structures and functional collection algorithms should either be a built-in part of the language or at least trivially implementable in supporting libraries. ## So Java is more complex than Scala? You must be kidding My esteemed colleague Michael Duerig posts about Scala code being simpler than java. His Scala example is beautiful, no question about it: object ScalaMain { case class Person(name: String, age: Int) val persons = List( Person("Boris", 40), Person("Betty", 32), Person("Bambi", 17)) val (minors, majors) = persons.partition(_.age <= 18) def main(args: Array[String]) = { println (minors.mkString(", ")) println (majors.mkString(", ")) } }  Though I wonder how many Scala programmers are actually able to come up with such concise and elegant code. Michi’s corresponding java example, however, is…let’s say horrible. Like making things as complex and bloated as they can be. Here’s my (slightly) more elegant Java version: import java.util.List; import java.util.ArrayList; import java.util.HashMap; public class Person extends HashMap<String, Object> { public Person(String name, int age) { put("name", name); put("age", age); } public static void main(String args[]) { final Person [] persons = { new Person("Boris", 40), new Person("Betty", 32), new Person("Bambi", 17), }; List<Person> minors = new ArrayList<Person>(); List<Person> majors = new ArrayList<Person>(); for(Person p : persons) { if( (Integer)p.get("age") <= 18 ) { minors.add(p); } else { majors.add(p); } } System.out.println(minors); System.out.println(majors); // Output: // [{age=17, name=Bambi}] // [{age=40, name=Boris}, {age=32, name=Betty}] } }  Not bad hey? 37 lines all included, and although Java does require more boilerplate code, it’s not too bad. All this is kinda tongue in cheek, ok? We could start all sorts of flame wars about type safety, generics and dynamic programming – my point is just that elegant and ugly code can be written in any language. Scala definitely helps with conciseness, but in my opinion Java does not require things to be as bloated as some of those language wars examples show. I’m on my way to Michi’s office to sort this out face to face as well ;-) Update: face to face discussion went well, we agreed to not start religious wars…and in the meantime, here are two additional (and more serious) posts on the subject: ## So Scala is too complex? There is currently lots of talk about Scala being to complex. Instead of more arguing I implemented the same bit of functionality in Scala and in Java and let everyone decide for themselves. There is some nice example code in the manual to the The Scala 2.8 Collections API which partitions a list of persons into two lists of minors and majors. Below are the fleshed out implementations in Scala and Java. First Scala: object ScalaMain { case class Person(name: String, age: Int) val persons = List( Person("Boris", 40), Person("Betty", 32), Person("Bambi", 17)) val (minors, majors) = persons.partition(_.age <= 18) def main(args: Array[String]) = { println (minors.mkString(", ")) println (majors.mkString(", ")) } }  And now Java: import java.util.ArrayList; import java.util.Arrays; import java.util.Iterator; import java.util.List; class Person { private final String name; private final int age; public Person(String name, int age) { super(); this.name = name; this.age = age; } public String getName() { return name; } public int getAge() { return age; } @Override public boolean equals(Object other) { if (this == other) { return true; } else if (other instanceof Person) { Person p = (Person) other; return name == null ? p.name == null : name.equals(p.name) && age == p.age; } else { return false; } } @Override public int hashCode() { int h = name == null ? 0 : name.hashCode(); return 39*h + age; } @Override public String toString() { return new StringBuilder("Person(") .append(name).append(",") .append(age).append(")").toString(); } } public class JavaMain { private final static List<Person> persons = Arrays.asList( new Person("Boris", 40), new Person("Betty", 32), new Person("Bamby", 17)); private static List<Person> minors = new ArrayList<Person>(); private static List<Person> majors = new ArrayList<Person>(); public static void main(String[] args) { partition(persons, minors, majors); System.out.println(mkString(minors, ",")); System.out.println(mkString(majors, ",")); } private static void partition(List<? extends Person> persons, List<? super Person> minors, List<? super Person> majors) { for (Person p : persons) { if (p.getAge() <= 18) minors.add(p); else majors.add(p); } } private static <T> String mkString(List<T> list, String separator) { StringBuilder s = new StringBuilder(); Iterator<T> it = list.iterator(); if (it.hasNext()) { s.append(it.next()); } while (it.hasNext()) { s.append(separator).append(it.next()); } return s.toString(); } }  Impressive huh? And the Java version is not even entirely correct since its equals() method might not cope correctly with super classes of Person. Filed under: Uncategorized Tagged: Java, Scala ## Adobe, Day and Open Source: a dream and a nightmare What does the acquisition of Day by Adobe mean for Day’s open source activities? Some people are disappointed by the lack of comments about this in the official announcements to date. Thankfully, Erik Larson, senior director of product management and strategy at Adobe, commented on Glyn Moody’s blog post quite early in the frenzy of tweets and blog posts that followed yesterday’s announcement. Quoting him: …we are very excited for Day’s considerable “open source savvy” to expand Adobe’s already significant open source efforts and expertise. That is a strategic benefit of the combination of the two companies. I have personally learned a lot from David Nuscheler and his team in the past few months as we put the deal together. Not bad for a start, but we’re engineers right? Used to consider the worst case, to make sure we’re prepared for it. Me, I’m an engineer but also an optimistic, and I’m used to start with the ideal, happy case when analyzing situations. It helps focus my efforts on a worthy goal. So let’s do this and dream about the best and worst cases. This is absolutely 100% totally my own dreams, I’m not speaking for anyone here, not wearing any hat. Just dreamin’, y’know? ##### The Dream This is late 2011. The last few months have more than confirmed that Day’s acquisition by Adobe, one year ago, happened for strategic reasons: a big part of the deal was filling up gaps in Adobe’s enterprise offering, but Day’s open source know-how and network have brought a lot of value as well. Day folks have played an important role in expanding the open development culture inside Adobe; Photoshop will probably never be fully open source, but moving more key components of the Adobe technology stack to open source, and most importantly open development, has paid off nicely. In terms of reaching out to developers and customers, in getting much better feedback at all levels, and in terms of software quality of course. It’s those eyeballs. The Apache Software Foundation’s Incubator has been quite busy in the last few months. The new platinum sponsor enjoys a fruitful relationship with the foundation. With JCR moving to their core, Adobe’s enterprise applications are starting to reach a new level of flexibility. Customers are enthusiastic about being able to access their data via simple and standards-based interfaces. Enterprise-level mashups, anyone? JCR is not just that minor content repository API pushed by that small swiss software vendor anymore: being adopted by a major player has made a huge difference in terms of market recognition (I’m sure my friends at Hippo, Jahia and Sakai, among others, will love that one). The added resources have also helped improve the implementations, and people love the book! With this, Apache Jackrabbit and Apache Sling have reached new levels of community participation and quality. Although quite a few new committers are from Adobe, a number of other companies have also pushed their developers to participate more, due to the increased market visibility of JCR. Adobe’s additional resources, used wisely to take advantage of the Day team’s strengths, have enabled them to fully realize the CQ5 vision. Everything is content, really. As in all fairy tales, the former Day team and Adobe live happily ever after. (Editor’s note: this is not Disney, can we strike that one please?) ##### The Nightmare This is late 2011, and I can hear the programmers complaining in their bland cubicles. Aaarrggghhhhh. The few Day folks who still work at Adobe did try to convince their management to continue on the open source and open development track. No luck – you can’t argue with an US company making 4 billion a year, can you? CQ5 customers are too busy converting their websites to native PDF (this is about documents, right?) to realize what’s going on. The most desperate just switched to DrooplaPress, the newest kid on the LISP-based CMSes block. That won’t help business much but at least it’s fun to work with. If you love parentheses, that is. Adobe’s competitors who really jumped on the open source and open development train are gone for good, it is too late to catch up. You should have sold you shares a year ago. Luckily, Apache Jackrabbit and Apache Sling are still alive, and increased involvement of the “Benelux Gang” (ex-Day folks spread over a few Benelux content management companies) in those projects means there’s still hope. You wake up wondering why you didn’t accept that job at the local fast food. Computers are so boring. ##### Coda I know life is more complicated than dreams sometimes, but I like dreams much better than nightmares, and I’m a chronic optimistic. So you can easily guess which scenario I’m going to work towards! I’ll keep you posted about what really happens next. Once I wake up, that is. Just dreamin’, y’know? ##### Related reading Open Source at Adobe by my colleague and fellow Apache Member Jukka Zitting. Open innovation in software means Open Source, a recent post of mine. See also my collected links related to the announcement at http://delicious.com/bdelacretaz/adobeday. ## Open Source at Adobe? The news is just in about Adobe being set to acquire Day Software (see also the FAQ). Assuming the deal goes through, it looks like I’ll be working for Adobe by the end of this year. I’m an open source developer, so I’m looking forward to finding out how committed Adobe is in supporting the open development model we’re using for many parts of Day products. The first comments from Erik Larson, a senior director of product management and strategy at Adobe, seem promising and he also asked what the deal should mean for open source. This is my response from the perspective of the open source projects I’m involved in. First and foremost I’m looking forward to continuing the open and standards-based development of our key technologies like Apache Jackrabbit and Apache Sling. There’s no way we’d be able to maintain the current level of innovation and productivity in these key parts of our product infrastructure without our symbiotic relationship with the open source community. Second, I’m hoping that our experience and involvement with open source projects will help Adobe better interact with the various open source efforts that leverage Adobe standards and technologies like XMP, PDF and Flash. The Apache Software Foundation is a home to a growing collection of digital media projects like PDFBox, FOP, Tika, Batik and Sanselan, all of which are in one way or another related to Adobe’s business. For example as a committer and release manager of the Apache PDFBox project I’d much appreciate better access to Adobe’s deep technical PDF know-how. Similarly, in Apache Tika we’re considering using XMP as our metadata standard, and better access to and co-operation with the people behind Adobe’s XMP toolkit SDK (see more below) would be highly valuable. It would be great to see Adobe becoming more proactive in reaching out and supporting such grass-roots efforts that leverage their technologies. I’ve dealt with Adobe lawyers on such cases before with good results but it did take some time before I found the correct people to contact. Another area of improvement would be to make freely redistributable Adobe IP more easily accessible for external developers by pushing them out to central repositories like Maven Central, RubyGems or CPAN, for example like I did when making PDF core font information available on Maven Central. Finally, it would be great to see Adobe going further in embracing an open development model for some of their codebases like the XMP toolkit SDK that they already release under open source licenses. I’d love to champion or mentor the effort, should Adobe be willing to bring the XMP toolkit to the Apache Incubator! ## Dear Oracle, can we have our nice javadoc URLs back? If you support this request, please vote for it in the comments below and/or on twitter using the #E17476 hashtag! Update (2010/07/24): it looks like the old java.sun.com URLs are back, thanks Oracle and especially @mreinhold! Update (2010/07/27): see also Good Feedback and Happy Endings – The Ugly URLs. Dear Oracle, A while ago you bought Sun, and IIRC promised to do good things for Java. Or at least indicated you would. Or something like that. Now, a bad thing happened a few days ago. Not a bad bad bad thing, just a tiny annoying change in the cool URLs that Sun used to publish the JDK’s javadocs. Not annoying annoying annoying but not nice. Even Google remembers: today if I search for  IndexOutOfBoundsException on Google it returns the following URL: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html Which is a cool URL that shouldn’t change. Now, requesting this URL today causes a redirect to: http://download.oracle.com/docs/cd/E17476_01/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html Which is also somewhat cool, but not as much. Factor 10 down in coolness. It makes me assume that you’re serving javadocs from a CD, and that CD’s identifier is E17476_01. That’s useful info if you’re the filesystem driver who’s reading the CD, but I doubt filesystem drivers are searching for javadocs on Google. Also, I’m not looking at downloading anything. Just browsing, okay? Cool URLs shouldn’t change. Can we have the old one back? Ok, maybe with java.oracle.com instead of java.sun.com – you bought them anyway. But please please please, let the poor CD filesystem driver alone! Thanks. P.S. we’re having a little vote on Twitter about this, check it out at http://search.twitter.com/search?q=%23E17476 ## This is how we work at Apache I just had to (re-)explain how the Apache way of working makes a difference by enabling a continuous flow of information between developers. No more begging for reports, no more boring meetings where you only exchange information: who could say no to that? Here it is for your enjoyment. This is the same thing that I’ve been saying in my recent talks on this topic, reduced to the bare minimum. • All technical discussions and decisions on public mailing lists. • Speak in URLs: if you reference something (discussion, vote, code…anything) include its URL, which must be permanent. • Shared code repository, commit early, commit often (as in: daily at least, from day one) • Commit events sent to mailing lists and/or RSS feeds to which people can subscribe. • Shared issue tracker and “if you’re working on something it must be an issue in the tracker” so that progress reports are automatic. Also generates mail/RSS events. • Commits are linked to tracker issue IDs – by speaking in URLs in your commit messages, mostly. • Automatic archiving of all this information, for self-service access. All this is public and centrally accessible of course, so everybody gets the same information. The main reluctance that I see when trying to convince people to work in this way is the fear of exposing your mistakes and initial bad designs in public. My answer is to just get over it: you’d find tons of such blunders if you were to analyze my work at Apache in the last ten years, yet I’m reasonably alive and kicking. ## List all your Maven dependencies Here’s a one-liner (well, two) that neatly lists all the Maven dependencies from your project. Useful to check their licenses, for example. # first grab all dependencies mvn dependency:resolve # then list them with -o to keep noise low, # remove extra information and duplicates mvn -o dependency:list \ | grep ":.*:.*:.*" \ | cut -d] -f2- \ | sed 's/:[a-z]*$//g' \
| sort -u


The output looks like this:

asm:asm:jar:1.5.3
asm:asm:jar:3.1
biz.aQute:bnd:jar:0.0.169
cglib:cglib:jar:2.1_3
classworlds:classworlds:jar:1.1
classworlds:classworlds:jar:1.1-alpha-2
...


so it’s also useful to detect multiple versions of the same dependency in a multi-module project.

## Open innovation in software means open source

Here’s a “reprint” of an article that I wrote recently for the H, to introduce my talk at TransferSummit last week.

According to Henry Chesbrough[1], Open Innovation consists of using external ideas as well as internal ideas, and internal and external paths to market, to advance a company’s technology.

Software architects and developers are usually not short of ideas, but which of those ideas are the really good ones? How do you select the winning options and avoid wasting energy and money on the useless ones?

Feedback is the key to separating the wheat from the chaff. Fast and good quality feedback is required to steer any fast vehicle or sports device, and it works the same in software: without a performant feedback loop, you’re bound to fall on your face – or at least to be slower than your competitors on the road to success.

##### Innovation is not invention – it’s about value

In a recent blog post on the subject, Christian Verstraete, CTO at HP, rightly notes that innovation is not invention. Whereas the value of a new invention might be unknown, the goal of innovation is to produce value, often from existing ideas.

The output of our feedback loop must then be a measurement of value – and what better value for a software product than happy stakeholders? Other developers adopting your ideas, field testers happy with performance, experts suggesting internal changes which will make them feel good about your software’s structure. That kind of feedback is invaluable in steering your innovative software product in the right direction, quickly.

##### How fast is your feedback loop?

If you have to wait months to get that high-quality feedback, as you might in a corporate setting, your pace of innovation will be accordingly slow.

In the old world of committees, meetings and reports, things move at the speed of overstuffed schedules and overdue reports – slowly. In the new world of agile open source projects, fast and asynchronous Internet-based communication channels are your friends, helping people work at their own pace and on their own schedule, while collectively creating value quickly.

Open source organizations like the Apache Software Foundation provide standardised tools and best practices to foster efficient communications amongst project members. Shared source code repositories generate events to which project members can subscribe, to be informed immediately of any changes in modules that they’re interested in. Web-based issue trackers also use events and subscriptions to make it easy to collaborate efficiently on specific tasks, without requiring the simultaneous online presence of collaborators. Mailing lists also allow asynchronous discussions and decisions, while making all the resulting information available in self-service to new project members.

It is these shared, event-based and asynchronous communications channels that build the quick feedback loop that is key to software innovation. It is not uncommon for a software developer to receive feedback on a piece of code that they wrote, from the other end of the world, just a few minutes after committing that code to the project’s public code repository. Compared to a written problem report coming “from above” a few weeks later, when the developer has moved on to a different module, the value of that fast feedback is very high. It can feel a bit like a bunch of field experts looking over your shoulder while you’re working – scary but extremely efficient.

##### How good are your feedback “sensors”?

Fast feedback won’t help if it’s of low quality, and fortunately open source projects can also help a lot here. Successful projects can help bring together the best minds in the industry, to collectively solve a problem that benefits all of them. The Apache HTTP server project is one of the best examples, with many CTO-level contributors including a few that were involved in defining the protocols and the shape of today’s Web. If software developers (God forbid) were sold between companies the way soccer players are transferred between teams, we’d see millions of dollars flowing around.

Open source projects are very probably the best way to efficiently bring software experts together today. Industry associations and interest groups might fulfill that role in other industries, but developers like to express themselves in code, and open source projects are where that happens today.

You could of course hire experts to give feedback on your software inside your company, but it’s only a handful of companies who have enough money to bring in the level and number of experts that we are talking about – and that might well turn out to be much slower than the open source way of working.

##### What’s the right type of project?

Creating or joining an open source project that helps your business and attracts a community of experts is not that easy: the open source project space is somewhat crowded today, and those experts are busy people.

Judging from the Apache Software Foundation’s achievements in the last ten years, infrastructure projects have by far the highest success rate. If you can reduce (part of) your problem to a generalised software infrastructure that appeals to a wide range of software developers, those experts will see value in joining the project. Apache Hadoop is another very successful example of software architects and developers from different companies joining forces to solve a hard problem (large scale distributed computing) in a way that can benefit a whole industry. On a smaller scale, Apache Jackrabbit , one of the projects in which my employer is very active, brings together many experts from the content management world, to solve the problem of storing, searching and retrieving multimedia content efficiently. Those types of software modules are used as central infrastructure components in systems that share a similar architecture, while offering very different services to their end users.

Projects closer to the user interface level are often harder to manage in an open group, partly because they are often more specific to the exact problem that they solve, and also because it is often hard for people coming from different companies and cultural backgrounds to agree on the colour of the proverbial bike shed. An infrastructure software project can be well defined by an industry specification (such as JCR in Jackrabbit’s case), and/or by automated test suites. These are usually much easier to agree on than user interface mock-ups.

##### Where next?

I hope to have convinced you that open source projects provide the best feedback loop for innovative software. As a next step, I would recommend getting involved in open source projects that matter to you. There are many ways to contribute, from reporting bugs in a useful way, to writing tutorials, contributing new modules or extensions, or simply reporting on your use of the software in various environments.

Contributing, in any small or big way, to a successful open source project is the best way to see this high-quality feedback loop in action. You might also try to use the open source ways of working inside your company, to create or improve your own high-quality “innovation feedback loop”.

I cannot pretend to have the definitive answer to the “how do you select and execute the right ideas to innovate?” question. When it comes to software, however, the fast and high-quality feedback loop that open source projects provide is, in my opinion, the best selection tool.

[1] Chesbrough, H.W. (2003). Open Innovation: The new imperative for creating and profiting from technology. Boston: Harvard Business School Press

## My new Flyer ebike: fast and fun!

I recently bought a new Flyer electric bike, got a faster T8 HS ex-demo for a good price.

I sold the previous C8+ to my nephew, happy that it’s staying in the family! In about 15’000km all year in any weather (including snow and accompanying salt on the road) over 4 1/2 years, I have had exactly zero problems with the C8, which says a lot about the build quality and maturity of those bikes. It needed just the usual bike maintenance, and an expected change of battery after about 600 charging cycles, but zero maintenance related to the electronics or motor. Not to mention only two punctures in 15’000km, thanks to the Schwalbe Marathon puncture-proof tires. Over time, those tires get full of small superficial holes which are mostly punctures that didn’t happen, very cool!

The new bike is an HS model, as in high speed: contrary to the old one which would only assist me until 25kmh (so you’d be faster on flat with a good bike), this one happily helps up to 45kmh or more, and it’s also a better bike to start with: 28″ wheels, more rigid frame and thinner higher-pressure tires. As with all so-called pedelec bikes, the Flyers don’t go anywhere if you don’t pedal, the assistance only kicks in (very naturally) when you ride normally, and gives you more power.

And this thing is fast: I just beat my record on the commute back from the office, 350m elevation over 12km, getting home in 29:30 which means 24kmh average speed on Lausanne’s steep hills. Not bad – you do have to pedal hard to reach such speeds uphill, but it’s a lot of fun and I get home almost as fast as any other transportation, considering the traffic density – and I don’t need to spend time at the gym after that, so I’m probably saving time all in all! The morning downhill ride takes about 20 minutes, unbeatable at 8AM unless you ride a helicopter.

The equipment is very good: Magura hydraulic rim brakes (almost as good as discs, I guess newer models have them), LED lights, lockable front fork and the SRAM dual drive which combines a 3-gear hub with an 8-gear rear derailleur to get 24 usable combinations (no “forbidden” ones like dual derailleurs). You cannot have a front derailleur on the Flyer due to the motor wheel which drives the chain, and the dual drive is really the best of both worlds in the city: gear hub for quick downchanging when stopping or surprised, and derailleur for fine tuning.

All in all, an excellent commuter’s bike if your ride is steep, or just for the fun of riding faster. The big plus with the ebike is that you can use less of your own energy if you’re tired or if conditions are bad, while still getting to your destination in a reasonable amount of time.

You do have to ride very carefully as sleepy car drivers and pedestrians often don’t realize how fast you ride on that thing, nor that you’re actually faster than cars in many tight or bumpy places. After years of motorcycling and cycling I’m used to being very clear about my intentions on the road, using obvious positioning in lanes, and that helps a lot! The city of Lausanne is also doing an excellent job in helping cyclists find safe space to ride, and most of my commute is on very low-traffic roads as a result.

Do I sound enthusiastic? That’s because I am – electric bikes are by far the best way of commuting in a steep city like Lausanne. They are somewhat expensive to buy, but maintenance costs almost nothing, and you save a lot on gym costs (and doctor’s fees I guess – cycling is good for your health). And if you drive a car or motorbike to work, you should really calculate how much that costs and draw the right conclusions!

## Type Level Programming: Equality

Apocalisp has a great series on Type Level Programming with Scala. At some point the question came up whether it is possible to determine equality of types at run time by having the compiler generate types representing true and false respectively. Here is what I came up with.

trait True { type t = True }
trait False { type t = False }

case class Equality[A] {
def check(x: A)(implicit t: True) = t
def check[B](x: B)(implicit f: False) = f
}
object Equality {
def witness[T] = null.asInstanceOf[T]
implicit val t: True = null
implicit val f: False = null
}

// Usage:
import Equality._

val test1 = Equality[List[Boolean]] check witness[List[Boolean]]
implicitly[test1.t =:= True]
// Does not compile since tt is True
// implicitly[test1.t =:= False]

val test2 = Equality[Nothing] check witness[AnyRef]
// Does not compile since ft is False
// implicitly[test2.t =:= True]
implicitly[test2.t =:= False]


Admittedly this is very hacky. For the time being I don’t see how to further clean this up. Anyone?

Filed under: Uncategorized Tagged: Meta-Programming, Scala

## Can I haz web?

How many people today still think information is only valid or “serious” when represented on an A4 piece of paper?

Way too many, if you ask me.

I’m always disappointed when people push out important content as PDF documents (or much worse…I won’t even name that format) attached to web pages or email messages, instead of just including the content in those web pages or messages, as a first-class citizen.

For some reason, people seem to think that information presented in A4 format has more value than the same information presented as a simple and clean web page. It is quite the opposite actually: web pages can be linked to, easily indexed, reformatted for efficient reading (thanks readability), etc.

Ted Nelson, the inventor of hypertext, wrote in 1999 already [1]:

We must overthrow the paper model, with its four prison walls and peephole one-way links

And also, in the same paper:

WYSIWYG generally means “What You See Is What You Get” — meaning what you get when you print it out. In other words, paper is the flat heart of most of today’s software concepts.

Granted, we haven’t fully solved the two-way links problem yet, but I hope you get the idea. Who needs paper or A4 pages? This is 2010, and this is the Web.

Please think about it next time you publish an important piece of information. Does it really need to live in the prison walls of a “document”? In what ways is that more valid than a web page or plain text email message?

Most of the time, almost always, the answer is: it’s not more valid, it’s just less usable.

Can I haz web? kthxbye.

## Working around type erasure ambiguities (Scala)

In my previous post I showed a workaround for the type erasure ambiguity problem in Java. The solution uses vararg parameters for disambiguation. As Paul Phillips points out in his comment, this solution doesn’t directly port over to Scala. Java uses Array to pass varargs, Scala uses Seq. Unlike Array, Seq is not reified so Seq[String] and Seq[Int] again erase to the same type putting us back to square one.

However, there is another way to add disambiguation parameters to the methods: implicits! Here is how:

implicit val x: Int = 0
def foo(a: List[Int])(implicit ignore: Int) { }

implicit val y = ""
def foo(a: List[String])(implicit ignore: String) { }

foo(1::2::Nil)
foo("a"::"b"::Nil)


Filed under: Uncategorized Tagged: Puzzle, Scala

## Working around type erasure ambiguities

In an earlier post I already showed how to work around ambiguous method overloads resulting from type erasure. In a nut shell the following code wont compile since both overloaded methods foo erase to the same type.

Scala:

def foo(ints: List[Int]) {}
def foo(strings: List[String]) {}


Java:

void foo(List<Integer> ints) {}
void foo(List<String> strings) {}


It turns out that there is a simple though somewhat hacky way to work around this limitations: in order to make the ambiguity go away, we need to change the signature of foo in such a way that 1) the erasure of the foo methods are different and 2) the call site is not affected.

Here is a solution for Java:

void foo(List<Integer> ints, Integer... ignore) {}
void foo(List<String> strings, String... ignore) {}


We can now call foo passing either a list of ints or a list of strings without ambiguity:

foo(new ArrayList<Integer>());
foo(new ArrayList<String>());


This doesn’t directly port over to Scala (why?). However, there is a similar hack for Scala. I leave this as a puzzle for a couple of days before I post my solution.

Filed under: Uncategorized Tagged: Java, Puzzle, Scala

## Second FISE Hackathon

At this week's IKS meeting at Paderborn the second FISE Hackathon took place. FISE is an open source semantic engine that provides semantic annotation algorithms like semantic lifting. The actual annotation algorithms are pluggable through OSGi. Existing CMSs can integrate the engine through an HTTP interface (inspired from Solr). Last week, Bertrand gave an introductory talk about FISE that is available online.

There was no explicitly set goal for the second Hackathon. Rather, the existing code base was extended in various different directions. Some examples:

• a language detection enhancement engine (I am particularly glad to see this - automatic language detection in CMSs is a pet passion of mine)
• a UI for FISE users that allows humans to resolve ambiguities
• myself, I coded a JCR-based storage engine for the content and annotations

There was also a good amount of work done on the annotation structure used by FISE and documented on the IKS wiki.

A complete report of the Hackathon is available on the IKS wiki (the only thing it fails to mention: the event's good spirit).

One major non-code step was to get many participants up to speed with the FISE engine and enable them to deploy the engine as well as get accustomed with the architecture and code base.

It was only last week that I took a deeper look into FISE. I like its architecture a lot. The HTTP interface makes it easy to play with FISE as well as integrate it. Even more important, the pluggable archirecture that is mostly inherited from the OSGi services architecture makes FISE very flexible and extensible. This is particularly important given the different natures of the enhancement engines that we want to be able to deploy (hosted services, proprietary, open source, etc). I consider FISE to be a particularly well suited use case for OSGi.

(cross-posting from here)

## Forking a JVM

The thread model of Java is pretty good and works well for many use cases, but every now and then you need a separate process for better isolation of certain computations. For example in Apache Tika we’re looking for a way to avoid OutOfMemoryErrors or JVM crashes caused by faulty libraries or troublesome input data.

In C and many other programming languages the straightforward way to achieve this is to fork separate processes for such tasks. Unfortunately Java doesn’t support the concept of a fork (i.e. creating a copy of a running process). Instead, all you can do is to start up a completely new process. To create a mirror copy of your current process you’d need to start a new JVM instance with a recreated classpath and make sure that the new process reaches a state where you can get useful results from it. This is quite complicated and typically depends on predefined knowledge of what your classpath looks like. Certainly not something for a simple library to do when deployed somewhere inside a complex application server.

But there’s another way! The latest Tika trunk now contains an early version of a fork feature that allows you to start a new JVM for running computations with the classes and data that you have in your current JVM instance. This is achieved by copying a few supporting class files to a temporary directory and starting the “child JVM” with only those classes. Once started, the supporting code in the child JVM establishes a simple communication protocol with the parent JVM using the standard input and output streams. You can then send serialized data and processing agents to the child JVM, where they will be deserialized using a special class loader that uses the communication link to access classes and other resources from the parent JVM.

My code is still far from production-ready, but I believe I’ve already solved all the tricky parts and everything seems to work as expected. Perhaps this code should go into an Apache Commons component, since it seems like it would be useful also to other projects beyond Tika. Initial searching didn’t bring up other implementations of the same idea, but I wouldn’t be surprised if there are some out there. Pointers welcome.

## Apache meritocracy vs. architects

Ceki Gülcü recently wrote an interesting post on the Apache community model and its vulnerability in cases where consensus can not be reached with reasonable effort. Also the discussion in the comments is interesting.

Ceki’s done some amazing work especially on Java logging libraries, and his design vision shines through the code he’s written. He’s clearly at the high edge of the talent curve even among a community of highly qualified open source developers, which is why I’m not surprised that he dislikes the conservative nature of the consensus-based development model used at Apache. And the log4j history certainly is a sorry example of conservative forces more or less killing active development. In hindsight Ceki’s decision to start the slf4j and logback projects may have been the best way out of the deadlock.

Software development is a complex task where best results are achieved when a clear sense of architecture and design is combined with hard work and attention to details. A consensus-based development model is great for the latter parts, but can easily suffer from the design-by-committee syndrome when dealing with architectural changes or other design issues. From this perspective it’s no surprise that the Apache Software Foundation is considered a great place for maintaining stable projects. Even the Apache Incubator is geared towards established codebases.

Even fairly simple refactorings like the one I’m currently proposing for Apache Jackrabbit can require quite a bit of time-consuming consensus-building, which can easily frustrate people who are proposing such changes. In Jackrabbit I’m surrounded by highly talented people so I treat the consensus-building time as a chance to learn more and to challenge my own assumptions, but I can easily envision cases where this would just seem like extra effort and delay.

More extensive design work is almost always best performed mainly by a single person based on reviews and comments by other community members.  Most successful open and closed source projects can trace their core architectures back to the work of a single person or a small tightly-knit team of like-minded developers. This is why many projects recognize such a “benevolent dictator” as the person with the final word on matters of project architecture.

The Apache practices for resolving vetos and other conflicts work well when dealing with localized changes where it’s possible to objectively review two or more competing solutions to a problem, but in my experience they don’t scale that well to larger design issues. The best documented practice for such cases that I’ve seen is the “Rules for revolutionaries” post, but it doesn’t cover the case where there are multiple competing visions for the future. Any ideas on how such situations should best be handled in Apache communities?

## ICSE Most Influential Paper award

On the same day that Liam was born, I received news that one of my two papers published at the ICSE 2000 conference has been given the International Conference on Software Engineering’s Most Influential Paper Award for its impact on software engineering research over the past decade. The paper, A case study of open source software development: the Apache server, is co-authored by Audris Mockus, myself, and James Herbsleb. The MIP is an important award within the academic world; my thanks to the award committee and congrats to Audris and Jim. I wish I could have been there in South Africa for the presentation. This year’s award is shared with a paper by Corbett et al. on Bandera.

Interestingly, my other paper in ICSE 2000 was the first conference paper about REST, co-authored with my adviser, Dick Taylor. That must have caused some debate within the awards committee. As I understand it, the MIP award is based on academic citations of the original paper and any follow-up publication in a journal. Since I encouraged people to read and cite my dissertation directly, rather than the ICSE paper’s summary or its corresponding journal version, I am not surprised that the REST paper is considered less influential. However, it does make we wonder what would have happened if I had never published my dissertation on the Web. Would that paper have been cited more, or would nobody know about REST? shrug. I like the way it turned out.

The next two International Conferences on Software Engineering will be held in Hawaii (ICSE 2011), with Dick as the general chair, and Zürich (ICSE 2012). That is some fine scheduling on the part of the conference organizers! Fortunately, I have a pretty good excuse to attend both.

## Some people call him Liam

After years of planning and hoping and preparing and learning and worrying and just getting on with life, I became a Daddy in March. It came as a bit of a shock, in spite of the eight months of watching the ultrasounds and taking classes and helping Cheryl as the little pod grew. We had just moved to a bigger place, still had dozens of boxes left to unpack before the weekend’s baby shower, and I had only been asleep for a few hours when Cheryl woke me up with the news: Hospital, now!

Three weeks early. Twenty-two days early, to be exact. All the books say that the range of 38-42 weeks is “normal”, so he was only eight days ahead of the curve and (thank goodness) beyond the stage of preemie health concerns. 2600 grams (5.732 lbs.) of joy, and a healthy Mommy as well. Woohoo! Of course, that also meant we were tossed out of the hospital about 40 hours after birth, thanks to our wonderful US healthcare system.

The staff and facilities at Hoag Hospital were excellent, but the whole experience was marred by the rush out of the hospital and then a corresponding rush back to the hospital three days later after a test for jaundice turned up in the critical range. We really weren’t prepared for that one; I am still peeved that the test wasn’t automatically scheduled for day 4 (instead of waiting for our pediatrician to see him on day 5). However, a night in the ICU tanning bed, with extra feeding to help evacuate the bilirubin, was enough to get him back to a safe zone and he was good to go home again.

Twenty-two days early doesn’t sound like much, but it is huge. Most of our friends went long for their first baby, so I had this schedule in the back of my mind of all the things that I was going to finish by April so that I could take a long, relaxing break into parenthood. Bzzt! The Anaheim IETF meeting was being held the following week, just twelve miles from my house, and my fellow HTTP standard editors had planned a whole week of editing httpbis at or near my place. Bzzt! We had delayed buying a bunch of baby things until after the shower. Bzzt! We had all these classes on what to expect in terms of sensing the arrival and onset of labor. Bzzt!

None of those plans truly mattered, in the grand scheme of things, but it taught me a quick lesson about my limitations as a working Daddy. At least some of my planning worked out, such as saving my vacation time so that I could spend the better part of six weeks at home. He is almost at two months now and still has to eat every three hours. I usually take the night shift and catch up with email while he sleeps on my shoulder. This weekend I discovered that I can actually type this way, with Liam sliding down a bit to warm his legs on my laptop, though I have to watch out when his little feet brush over the multitouch trackpad.

I’ll be catching up on the backblog soon. Now, if I can just get him to sleep long enough to edit a specification …

BTW, Liam is his nickname.

## Buzzword conference in June

Like the Lucene conference I mentioned earlier, Berlin Buzzwords 2010 is a new conference that fills in the space left by the decision not to organize an ApacheCon in Europe this year. Going beyond the Apache scope, Berlin Buzzwords is a conference for all things related to scalability, storage and search. Some of the key projects in this space are Hadoop, CouchDB and Lucene.

I’ll be there to make a case for hierarchical databases (including JCR and Jackrabbit) and to present Apache Tika project. The abstracts of my talks are:

The return of the hierarchical model

After its introduction the relational model quickly replaced the network and hierarchical models used by many early databases, but the hierarchical model has lived on in file systems, directory services, XML and many other domains. There are many cases where the features of the hierarchical model fit the needs of modern use cases and distributed deployments better than the relational model, so it’s a good time to reconsider the idea of a general-purpose hierarchical database.

The first part of this presentation explores the features that differentiate hierarchical databases from relational databases and NoSQL alternatives like document databases and distributed key-value stores. Existing hierarchical database products like XML databases, LDAP servers and advanced filesystems are reviewed and compared.

The second part of the presentation introduces the Content Repositories for the Java Technology (JCR) standard as a modern take on standardizing generic hierarchical databases. We also look at Apache Jackrabbit, the open source JCR reference implementation, and how it implements the hierarchical model.

and:

Text and metadata extraction with Apache Tika

Apache Tika is a toolkit for extracting text and metadata from digital documents. It’s the perfect companion to search engines and any other applications where it’s useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats.

This presentation introduces Apache Tika and shows how it’s being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity.

I hear there are still some early bird tickets available. See you in Berlin!

## Commit early, commit often!

A huge commit was made in a log4j branch yesterday. The followup discussion:

“I haven’t had a chance to review the rest of the commit, but it seems like a substantial amount of work that was done in isolation. While things are still fresh, can you walk through the whats in this thing and the decisions that you made.”

“I didn’t want to commit code until I had the core of something that actually functioned. I struggled for a couple of weeks over how to attack XMLConfiguration. [...] See below for what I came up with.”

Followed by ten bullet points about the changes made. Unfortunately the only thing our version control system now knows about these changes is “First version”.

## Lucene conference in May

This year there is no ApacheCon Europe, but a number of more focused events related to projects at Apache and elsewhere are showing up to fill the space.

The first one is Apache Lucene EuroCon, a dedicated Lucene and Solr user conference on 18-21 May in Prague. That’s the place to be if you’re in Europe and interested in Lucene-based search technology (or want to stop by for the beer festival). I’ll be there presenting Apache Tika, and the abstract of my presentation is:

Apache Tika is a toolkit for extracting text and metadata from digital documents. It’s the perfect companion to search engines and any other applications where it’s useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats.

This presentation introduces Apache Tika and shows how it’s being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity.

The rest of the conference program is also now available. See you there!

## “SIMPLE”.toLowerCase() is simple, right?

It turns out that "SIMPLE".toLowerCase().equals("simple") is not true if your default locale is Turkish, but your code is written in English. Turkish has two “i” characters, one with a dot and one without, which throws the above code off balance. The fix is to write the expression either as "SIMPLE".toLowerCase(Locale.ENGLISH).equals("simple") or even better as "SIMPLE".equalsIgnoreCase("simple").

I just stumbled on this issue with Apache Tika (see TIKA-404), and it seems like I’m not the only one.

Recently I discovered two very nice features of the java.util.LinkedHashMap: accessOrder and removeEldestEntry(Entry). These features combined let you implement simple LRU caches in under two minutes.

accessOrder

The accessOrder flag is set when creating the LinkedHashMap instance using the LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder) constructor. This boolean flag specifies how the entries in the map are ordered:

accessOrder=true

The elements are ordered according to their access: When iterating over the map the most recently accessed entry is returned first and the least recently accessed element is returned last. Only the get, put, and putAll methods influence this ordering.

accessOrder=false

The elements are ordered according to their insertion. This is the default if any of the other LinkedHashMap constructors is used. In this ordering read access to the map has no influence on element ordering.

removeEdlestEntry(Entry)

The second feature of interest is the removeEldestEntry(Entry) method. This method is called with the eldest entry whenever an element is added to the map. Eldest means the element which is returned last when iterating over the map. So the notion of eldest is influenced by accessOrder set on the map. The removeEldestElement in its default implementation just returns false to indicate, that nothing should happen. An extension of the LinkedHashMap may overwrite the default implementation to do whatever would be required:

• If the implementation decides to remove the eldest element for any one reason, say a size limitation, it just returns true and the eldest element is removed from the map

• The implementation may also decide to modify the map itself in some way or the other. But in this case, the implementation should return false, otherwise the eldest element will still be removed.

A simple LRU Cache

Taking the two features together, a very simple LRU Cache may be implemented in just a few lines of code:

public class LRUCache<K, V> extends LinkedHashMap<K, V> {private final int limit;public LRUCache(int limit) {  super(16, 0.75f, true);  this.limit = limit;}@Overrideprotected boolean removeEldestEntry(Map.Entry<K,V> eldest) {  return size() > limit;}}

The mechanism is very easy: The LRUCache(int) constructor initializes the map with the default initial size and load factor and sets the map into accessOrder mode. The removeEldestEntry just checks the current map size (after the addition of a new entry) against the limit and returns true if the limit has been reached.

A real world implementation would of course have to check and handle the limit value on the constructor.

To see a LinkedHashMap based LRU Cache in action, have a look at the BundleResourceCache.BundleResourceMap. This implements a simple entry cache to speed access to OSGi Bundle entries. To not waste memory, the size of the cache is limited.

## [ANN] Talking at Scala Days 2010 in Lausanne next Thursday

I’ll be talking at Scala Days 2010 in Lausanne on April 15th about the Scala scripting engine for Apache Sling. While my talk at Jazoon 09 was mainly about using Scala from Sling, this session will be more focused on internals of the Scala scripting engine.

Unfortunately (or fortunately depending on the point of view) the conference is sold out already. Watch my Scala for scripting page for the session slides and other upcoming support material.

Filed under: Uncategorized Tagged: Conference, Scala

## NoSQL talk at Developer Summit

Three days ago I had to chance to talk about NoSQL at the Internet Briefing's Developer Summit. On top of general ideas and concepts like the CAP theorem I chose to talk about Apache Jackrabbit, CouchDB and Cassandra. My slides are embedded below.

It was a really good event with interesting speakers and a knowledgeable audience. I was especially pleased that when I talked about CouchDB's HTTP API someone from the audience mentioned that Apache Sling does something very similar for Jackrabbit.

Special kudos to Christian Stocker of Liip for daring to do a live demo of the "real-time web" - he took a picture from his phone and had it pop up on Jabber and Twitter in about 5 secs.

Vlad Trifa has posted a good summary of the whole event (part 1, part 2) - he also gave a great presentation about the application of the REST architectural style to the "Web of Things".

No Sql
View more presentations from mmarth.

## True size of Finland

Whenever you see a map, the chances are that it uses the Mercator projection. It’s a fine enough projection especially on a local scale, but I’ve always disliked the way it makes places that are far from the equator seem much larger than they really are. Since I’ve lived most of my life in Finland (i.e. above 60° N or as high up north as Alaska), I find that this distortion heavily affects my ability to accurately estimate distances in other parts of the world even when I’m well aware of this problem.

To illustrate this issue, I’ve constructed the below image that shows how Finland compares to Central Europe and Southern China (the areas I’m most interested in) in the Mercator projection and the Goode homolosine projection that accurately represents the relative areas of any two places on the earth. The difference is really quite striking:

I’m considering purchasing a poster with such an equal-area world map and hanging it on a wall somewhere I can see it every day. That way I could perhaps overcome the systematic error that the Mercator projection has taught me.

## CMS vendors now and then

CMS analyst Janus Boye has just published a post on CMS vendors that discontinue their products (because they get bought out or similar)
During the past 10 years, a number of software products used by online professionals have been discontinued
That sentence reminded me that I had given a talk almost 10 years ago (it was in 2001 exactly) that contained a slide on the CMS market at that time:

The circles denote vendors that were part of CMS market overview articles by popular German IT magazines in that year (I wanted to show how differently the market place could be perceived). A vendor placed in any of the circles had enough attention to be part of at least one evaluation. The vendors outside of the circles were not part of any of these overview articles, but somehow present in the market place - at least I knew their names back then.

It is interesting to look at the landscape from that time. Of course there are a number of well-known vendors that got bought (Vignette, Obtree, Gauss), but the majority still seems to linger on - at least, a web site still exists, for example iRacer, Schema Text, or Contens.

On the other hand, one can ask how many vendors that were important enough to make it into a (German) market overview are still relevant in the market place today. I have used Janus Boye's spreadsheet of relevant European CMS vendors as a benchmark and checked which vendor's of today's list were already in 2001's presentation: Day, Coremedia and Open Text were "in the circles". Tridion was there, but outside of the circles. The rest of the vendors that Janus considers relevant today were not on my radar in 2001.

The end of my presentation involved a couple of CMS-related predictions. Let's see how I did. I predicted:
• product borders between CMS, DMS and app servers will blur further - my take now: wrong. I do not think that these border are more blurry than they were in 2001
• more standards and standards-based software (Java, JSP/ASP, XML, XSL) - true. The underlying technologies of CMSs are more homogeneous than they were at that time. Remember TCL?
• But no true compatibility. True. Nothing more to say.
• Improved Personalization. Improved Multi-Channel support. Both not really true, but rather fads of those days.
• Improved DMS features and Office integration. Don't ask me why I said that.
• No quick market consolidation in sight. Right on the money here.
Mostly correct on general market considerations, mostly wrong on features.

## The new BASIC

I’m seeing many posts that worry about computing devices like iPhones and the new iPad preventing people from having direct control over the hardware. Mark is telling us about a Ctrl+Reset and a BASIC prompt. Nowadays you get started with the following on an HTML page:

    <script type="text/javascript">
document.write("Hello, World!");
</script>

And you can do anything! Don’t tell me the days of tinkering are over.

## Scala type level encoding of the SKI calculus

In one of my posts on type level meta programming in Scala the question of Turing completeness came up already. The question is whether Scala’s type system can be used to force the Scala compiler to carry out any calculation which a Turing machine is capable of. Various of my older posts show how Scala’s type system can be used to encode addition and multiplication on natural numbers and how to encode conditions and bounded loops.

Motivated by the blog post More Scala Typehackery which shows how to encode a version of the Lambda calculus which is limited to abstraction over a single variable in Scala’s type system I set out to further explore the topic.

##### The SKI combinator calculus

Looking for a calculus which is relatively small, easily encoded in Scala’s type system and known to be Turing complete I came across the SKI combinator calculus. The SKI combinators are defined as follows:

$Ix \rightarrow x$,
$Kxy \rightarrow x$,
$Sxyz \rightarrow xz(yz)$.

They can be used to encode arbitrary calculations. For example reversal of arguments. Let $R \equiv S(K(SI))K$. Then

$R x y \equiv$
$S(K(SI))K x y \rightarrow$
$K(SI)x(Kx)y \rightarrow$
$SI(Kx)y \rightarrow$
$Iy(Kxy) \rightarrow$
$Iyx \rightarrow yx$.

Self application is used to find fixed points. Let $\beta \equiv S(K\alpha)(SII)$ for some combinator $\alpha$. Then $\beta\beta \rightarrow \alpha(\beta \beta)$. That is, $\beta\beta$ is a fixed point of $\alpha$. This can be used to achieve recursion. Let $R$ be the reversal combinator from above. Further define

$A_0 x \equiv c$ for some combinator $c$ and
$A_n x \equiv x A_{n-1}$.

That is, combinator $A_n$ is the combinator obtained by applying its argument to the combinator $A_{n-1}$. (There is a bit of cheating here: I should actually show that such combinators exist. However since the SKI calculus is Turing complete, I take this for granted.) Now let $\alpha$ be $R$ in $\beta$ from above (That is we have $\beta \equiv S(KR)(SII)$ now). Then

$\beta\beta A_0 \rightarrow c$

and by induction

$\beta\beta A_n \rightarrow \beta\beta A_{n-1} \rightarrow c$.

##### Type level SKI in Scala

Encoding the SKI combinator calculus in Scala’s type system seems not too difficult at first. It turns out however that some care has to be taken regarding the order of evaluation. To guarantee that for all terms which have a normal form, that normal form is actually found, a lazy evaluation order has to be employed.

Here is a Scala type level encoding of the SKI calculus:

trait Term {
type ap[x <: Term] <: Term
type eval <: Term
}

// The S combinator
trait S extends Term {
type ap[x <: Term] = S1[x]
type eval = S
}
trait S1[x <: Term] extends Term {
type ap[y <: Term] = S2[x, y]
type eval = S1[x]
}
trait S2[x <: Term, y <: Term] extends Term {
type ap[z <: Term] = S3[x, y, z]
type eval = S2[x, y]
}
trait S3[x <: Term, y <: Term, z <: Term] extends Term {
type ap[v <: Term] = eval#ap[v]
type eval = x#ap[z]#ap[y#ap[z]]#eval
}

// The K combinator
trait K extends Term {
type ap[x <: Term] = K1[x]
type eval = K
}
trait K1[x <: Term] extends Term {
type ap[y <: Term] = K2[x, y]
type eval = K1[x]
}
trait K2[x <: Term, y <: Term] extends Term {
type ap[z <: Term] = eval#ap[z]
type eval = x#eval
}

// The I combinator
trait I extends Term {
type ap[x <: Term] = I1[x]
type eval = I
}
trait I1[x <: Term] extends Term {
type ap[y <: Term] = eval#ap[y]
type eval = x#eval
}


Further lets define some constants to act upon. These are used to test whether the calculus actually works.

trait c extends Term {
type ap[x <: Term] = c
type eval = c
}
trait d extends Term {
type ap[x <: Term] = d
type eval = d
}
trait e extends Term {
type ap[x <: Term] = e
type eval = e
}


Eventually the following definition of Equals lets us check types for equality:

case class Equals[A >: B <:B , B]()

Equals[Int, Int]     // compiles fine
Equals[String, Int] // won't compile


Now lets see whether we can evaluate some combinators.

  // Ic -> c
Equals[I#ap[c]#eval, c]

// Kcd -> c
Equals[K#ap[c]#ap[d]#eval, c]

// KKcde -> d
Equals[K#ap[K]#ap[c]#ap[d]#ap[e]#eval, d]

// SIIIc -> Ic
Equals[S#ap[I]#ap[I]#ap[I]#ap[c]#eval, c]

// SKKc -> Ic
Equals[S#ap[K]#ap[K]#ap[c]#eval, c]

// SIIKc -> KKc
Equals[S#ap[I]#ap[I]#ap[K]#ap[c]#eval, K#ap[K]#ap[c]#eval]

// SIKKc -> K(KK)c
Equals[S#ap[I]#ap[K]#ap[K]#ap[c]#eval, K#ap[K#ap[K]]#ap[c]#eval]

// SIKIc -> KIc
Equals[S#ap[I]#ap[K]#ap[I]#ap[c]#eval, K#ap[I]#ap[c]#eval]

// SKIc -> Ic
Equals[S#ap[K]#ap[I]#ap[c]#eval, c]

// R = S(K(SI))K  (reverse)
type R = S#ap[K#ap[S#ap[I]]]#ap[K]
Equals[R#ap[c]#ap[d]#eval, d#ap[c]#eval]


Finally lets check whether we can do recursion using the fixed point operator from above. First lets define $\beta$.

  // b(a) = S(Ka)(SII)
type b[a <: Term] = S#ap[K#ap[a]]#ap[S#ap[I]#ap[I]]


Further lets define some of the $A_n$s from above.

trait A0 extends Term {
type ap[x <: Term] = c
type eval = A0
}
trait A1 extends Term {
type ap[x <: Term] = x#ap[A0]#eval
type eval = A1
}
trait A2 extends Term {
type ap[x <: Term] = x#ap[A1]#eval
type eval = A2
}


Now we can do iteration on the type level using a fixed point combinator:

  // Single iteration
type NN1 = b[R]#ap[b[R]]#ap[A0]
Equals[NN1#eval, c]

// Double iteration
type NN2 = b[R]#ap[b[R]]#ap[A1]
Equals[NN2#eval, c]

// Triple iteration
type NN3 = b[R]#ap[b[R]]#ap[A2]
Equals[NN3#eval, c]


Finally lets check whether we can do ‘unbounded’ iteration.

trait An extends Term {
type ap[x <: Term] = x#ap[An]#eval
type eval = An
}
// Infinite iteration: Smashes scalac's stack
type NNn = b[R]#ap[b[R]]#ap[An]
Equals[NNn#eval, c]


Well, we can

$scalac SKI.scala Exception in thread "main" java.lang.StackOverflowError at scala.tools.nsc.symtab.Types$SubstMap.apply(Types.scala:3165)
at scala.tools.nsc.symtab.Types$SubstMap.apply(Types.scala:3136) at scala.tools.nsc.symtab.Types$TypeMap.mapOver(Types.scala:2735)


Filed under: Uncategorized Tagged: Meta-Programming, Scala

## [LOTD] Standards Diagram for Content Management

Following up on Jon Marks' post on standards relevant for content management Justin Cormack has put together a "Standards Diagram for Content Management" Prezi landscape. Nice work!

The part "structuring" in Justin's presentation contains Docbook and DITA. Theresa compared these two standards a while ago:

## Declarative Services: Configuration

OSGi Declarative Services components are configured by properties defined by the component developer in the component descriptor and by configuration managed by system administrators using the OSGi Configuration Admin Service. The combined set of properties is traditionally made available to the component as a Dictionary calling the ComponentContext.getProperties() method. The component context is provided to the activate method called when the component is activated.

This makes configuration of components very simple, since the component itself does not have to care where configuration comes from and how it is maintained. In addition, a component always knows it is starting from scratch when the activate method is called. This is because a component instance is never reused and a component reconfiguration means the component is deactivated and activated with the new configuration.

There are some drawbacks to this solution, though:

• Reactivation of a component may be an expensive operation. Particularly if the component provides a service which is heavily used and relied upon, such as for example the SlingRepository service in Communiqu&eactue 5.
• If a component is keeping internal state, e.g. gathering statistics, reactivation of the component will cause loss of this state.

This is where the new Declarative Services specification version 1.1 kicks in and improves much. First of all, configuration may be updated dynamically without reactivating the component. Second a component may declare itself as not needing configuration or even as requiring configuration thus only activating the component if configuration is actually available from the Configuration Admin service.

To use dynamic reconfiguration you have to declare a method to take this updated configuration in the modified attribute of the component element. When using the Apache Felix Maven SCR Plugin, use the modified attribute of the @scr.component tag:

 @scr.component modified="modified"

and define a method taking the configuration, for example:

 private void modified(Map config) {     // apply the configurtion dynamically }

Unless the configuration may cause the component's references to be modified the component configuration is now dynamically provided without reactivating the component.

The method defined in the modified attribute has the same requirements as the activate method: It may be have any access modifier (but should not be public) and it may take any combination of ComponentContext, BundleContext, and Map arguments. Of course for a modified the primary useful type Map as in the example above<.

If you know your component cannot be configured or if you absolutely need configuration of your component, you can declare this desire using the configuration-policy attribute of the component element (or the policy attribute of the @scr.component JavaDoc tag):

optional
Configuration from the Configuration Admin Service is provided to the component if available. This is the default setting and is the same as in the previous Declarative Services specification.
require
Configuration from the Configuration Admin Service is required for the component to be activated. If the configuration is deleted, the component will be deactivated. This setting allows for a component to be controlled by the existence of configuration.
ignore
Configuration from the Configuration Admin Service is never retrieved on behalf of and provided to the component. If your component has no configurable properties, this setting makes sense.

Note: To use the functionality described in this article, you have to use a Declarative Services 1.1 implementation such as Apache Felix SCR 1.2.0 or newer.

## Declarative Services: Delayed Components

The OSGi Declarative services specification defines three types of components:

1. Immediate Components are immediately created when the providing bundle is started and may or may not provide services
2. Delayed Components provide services but are only created when used by a service consumer.
3. Factory Components are created on demand by calling the ComponentFactory.newInstance(Dictionary) method of the Component Factory service registered for the component.

This blog is about a special behaviour of delayed components which may seem unexpected in the first place: Delayed components are created (activated) on-demand (when they are first requested) and have to be deleted (deactivated) as soon as there is no user any longer (Chapter 112.5.4, Delayed Component, in the Compendium Spec; see also FELIX-1825)

The intent of this behaviour is to reduce the system (or bundle) startup time in that delayed components are only instantiated when really used. In addition memory consumption may be reduced at times when the service is not used. The drawback is, that the service is activated and deactivated if there is a single consumer which uses the service for short periods of times only. An example of such an oft used service is an OSGi EventHandler service, which is got by the Event Admin service when ever an event must be delivered and released after the delivery.

How does a Component become a delayed component ?

By default a component providing a service is a delayed component unless the component is explicitly declared as an immediate component. If you are using the Apache Felix Maven SCR plugin, a component is (by default) delayed if the @scr.service tag is used. Thus the above activation and deactivation rules apply. To turn a service component into an immediate component, you have to set the immediate attribute to, as in :

 @scr.component immediate="true"

Shall I change all my components to be immediate components ?

The short answer is: It depends.

Here are some general rules of thumb:

• If your service can be expected to be immediately used and not released until system shutdown, defining the component as delayed does not make much sense. In this case, it is probably better to explicitly define the component as immediate. An example of such a component is a Servlet service in Sling, which is immediately used by the Sling Servlet Registry.
• If your service is in fact a service factory (using the servicefactory attribute, you cannot declare the component immediate, because service factory components are always delayed
• If your component is used a lot for short periods of time, you should probably define your service as an immediate component. An example of such a service is an OSGi EventHandler service.
• If you want to maintain state in your component and make that state available to clients by registering a service, the component should be defined as immediate. An example of such a service might be statistics provider, which gathers statistics and provides them through its service API.
• If your component is only seldom used it would be best to define it as a delayed component. An example of such a component in Communiqué 5 might be a workflow step service, which implements very specific behaviour.

## mp3tagger on GitHub

On the mp3 tagger post I have received quite a bit of feedback and feature requests. Therefore, I thought it might be a good idea to do "social coding" and put the code on GitHub where it can easily be forked (and the forks can be watched).

Other than that, the latest version of the tagger contains these improvements:
• the Last.fm keys and secret are not stored in the code anymore, but entered on the first run and stored in ~/.mp3tagger.cfg
• you can run the script in two additional modes: simulation and ask. In simulation mode no changes to mp3s will be saved, in ask mode you will be asked to save each change. Start the script with flags "-m simulation" or "-m ask", respectively.
• It is now possible to specify a list of genre tags that will be considered (additionally to the mp3 default genre tags). The list needs to be stored in a config file at ~/.mp3tagger_genres.cfg (in the "generic" section of the file). The full format this file needs to have is shown below.
• The last improvement is a tricky one: after tagging all my mp3s I ended up with hundreds of albums tagged with genre Electronic or Indie. I wanted to refine these genres into sub-genres. This again works by putting a list of possible sub-genres into ~/.mp3tagger_genres.cfg and running the tagger with flag "-r genre", e.g. "-r Electronic". You would run this option when you find that you have too many albums of one genre and want to split them up.
So in summary my config file ~/.mp3tagger_genres.cfg looks like:

[generic]genres=Shoegaze,Dubstep,Grime,Dub,Drum And Bass[refinements]Electronic=Idm,Turntableism,Techno,Minimal,Dub,Big Beat,Ambient,Breakbeat,House,Lounge,Electroclash,Drum And Bass,ChilloutIndie=Indie Rock,Indie Pop,Singer-Songwriter,Indie Pop,Shoegaze,Post-Rock,Americana,New Wave,Alt-CountryReggae=Dancehall,Dub,Ska

## Identifications of OSGi Bundles

Looking at the Bundle details in the Web Console you will notice a number of entries providing information about the bundles. In this article I will explain a bit more about the information regarding the identification of bundles.

Looking at the following screen dump you will notice a number of identifaction details of a bundle:

1. The bundle identification number. This is a number uniquely identifying the bundle. This number assigned to the bundle at the installation time of the bundle and never changes. These numbers are also unique within the framework. Never will a bundle ever be assigned the same number, even if the currently installed bundle is uninstalled.
2. The bundle name. This is a descriptive name of the bundle which is ignored by the OSGi framework. This name is provided by the bundle developer as the contents of the Bundle-Name manifest header.
3. The bundle symbolic name. This is a symbolic name of the bundle, which is used by the OSGi framework together with the bundle version number to identify the bundle. That is, no two bundles with the same symbolic name and version may be installed in a single OSGi framework. But multiple bundles may be installed which have the same symbolic name but differ in their version number. This symbolic name is provided by the bundle developer as the contents of the Bundle-SymbolicName manifest header.
4. The bundle version. The version of the bundle is used to convey the development state of the bundle as a whole. Together with the bundle symbolic name the version number must be unique within an OSGi framework (see the description of the bundle symbolic name). Note that the OSGi framework places only synthatic restrictions on version numbers. This version is provided by the bundle developer as the contents of the Bundle-Version manifest header.
5. The bundle location. The bundle location is basically just a string. It is recommended that this string follows the syntax of an URL and it may even be used as an URL to update the bundle from: The Bundle.update() method uses the bundle location to try to access an updated bundle version. This is not used as such though in Communiqué 5 where the bundle location merely indicates where the bundle has initially been installed from. This value is provided by the administrator as the location parameter when installing the bundle through the BundleContext.installBundle(String location) or BundleContext.installBundle(String location, InputStream input) method. This value will not changed when the bundle is updated.
6. The last modification time. The last modification time is set by the framework when the bundle is modified. A bundle is modified when it is installed, updated or uninstalled. Starting and stopping the bundle does not change the last modification time. This information can be used to verify that a bundle update has really been executed by the OSGi framework.

As can be seen a number of identification details are set on installation time and never change afterwards and some details may change over time when a bundle is updated.

So here are the rules regarding these details:

• The bundle identification number is assigned by the framework and is never reused.
• The bundle location is assigned from the location parameter when the bundle is installed and never changes during the existence of the bundle. No two bundles will ever be installed at the same time with the same bundle location. As such this location is also a unique identifier for bundles. But in contrast to the bundle identification number, the location may be "reused". That is, once a bundle with a given location has been uninstalled another bundle may be installed with the same location.
• The bundle symbolic name and version together also uniquely identify an installed bundle. These values are taken from the bundle manifest and therefore may change when a bundle is updated. But it is not allowed for multiple bundles to be installed with the same symbolic name and version at the same time.

So, whenever you update a bundle, either placing it into an install folder in the repository or through the Web Console, expect the bundle version and/or symbolic name to change. But both the bundle identification number and the bundle location will not be modified. The bundle last modification time will reflect the time at which the bundle has actually been updated by the framework.

## Custom CQ5 workflow step that integrates Twitter and Jabber

As part of the IKS project each CMS vendor completes a couple of benchmarks in order to establish a baseline against which future semantic improvements can be measured. For benchmark 3 "Workflow Service" Bertrand and I chose to implement the task "Create a multi-channel (email, SMS, instant messaging, Twitter,...) notification service for workflow transitions". We have created an automated workflow step that can be inserted into a custom workflow and either send an e-mail, send a direct message on Twitter or send a chat message on GTalk/Jabber. The corresponding message's payload is the path to the content node in the workflow plus an optional custom text.

Below follows a description how this functionality was implemented in CQ5. The complete code is attached to this post as a CQ5 package. I will outline of some of the considerations and gotchas regarding this particular feature, but some issues apply to CQ5 development in general as well. The environment I used for development was CRXDE Lite (the web-based IDE available at /crxde of your CQ5 installation) and a beta version of the upcoming CQ5 release 5.3. It is probably helpful to install the package (see setup section below) and read the code alongside with this post.

###### OSGi services

A good way to hook up external services like Twitter etc. is to create a custom OSGi service that exposes only the business functionality and hides the internal classes. Moreover, it is good practice to provide a Java interface and the separate the implementation of the service (allowing the replacement of the implementation without affecting relying parties). The services will show up in the Sling configuration console at /system/console/configMgr. This allows the administrator to configure the service's private parameters at deployment time (in our case Twitter account credentials and Jabber user credentials). The config is consumed by the service like e.g.:

/** @scr.property */
public static final String GTALK_USER = "gtalk.service.user";
/** @scr.property */

protected void activate(ComponentContext context) {
Dictionary config = context.getProperties();
user = (String) config.get(GTALK_USER);
}
###### 3rd party libraries

In order to use Twitter and Jabber I utilized the open source libraries Twitter4J and Smack, respectively. With CRXDE (Lite) it is very simple to include such 3rd party jars in a custom OSGi bundle: just drop them in the bundle's /libs folder. When building the bundle CRXDE will embed them. Compilation and deployment is done by executing "Build Bundle" (right-click on the .bnd file in the bundle root).

###### A note on 3rd party jar's dependencies

It might well be that the bundle compiles and deploys, but does not start. Check the OSGi console at /system/console/bundles to find out if your bundle's state is "Active" (good) or just "Installed" (not good). The latter happens e.g. when the embedded jar has dependencies on other jars that are not embedded. In such a case check the bundle's details page in the Sling console to find out which dependencies are missing and either add them to /libs as well or take them out of the OSGi imports. That is achieved by editing the .bnd file's import directive, e.g.

Import-Package: !com.sun.syndication.*, !dalvik.system, *

###### Workflow action

The last needed piece is a workflow step that can be added into a custom workflow. For that purpose one simply needs to create a class that implements the Interface JavaProcessExt. The method execute will receive the workflow's payload - from there is is trivial to obtain the services described above and pass them the content. CQ Workflow Actions can be customized for each particular workflow they are used in. I use this feature to customize the accounts to which a message shall be sent (the custom format is explained in the setup section below). The customization string is passed to the execute method as well: comma-separated values will arrive as a String[] array.

###### Setting up the package

To get this up and running download the attached CQ5 package and install it throught the package manager. In the Sling console configure the the services com.day.cq.mailer.impl.MailerService, com.day.iks.service.impl.TwitterServiceImpl and com.day.iks.service.impl.GtalkServiceImpl. For Twitter and Gtalk you need to supply the credentials of the (technical) user that shall send the DMs or chat messages, respectively. In the case of e-mail you need to configure your mail server.

Next, create a custom workflow in the CQ5 workflow section and add the workflow action (name). The configuration options are:

• for sending an e-mail: email,user@mydomain.com,some_message

• for sending a chat message on Gtalk: gtalk,user@gmail.com,some_message

The (optional) message will be appended with the content item's path.

Here is an example for GTalk:

In the cases of Twitter DM and GTalk make sure that the recipient has opted-in to receive messages from the technical user you have configured as a sender.

## JCR and Rails revisited

Pengchao Wang of Thoughtworks has published his approach to using JCR as a backend for a JRuby-based Rails app. Interestingly, Pengchao uses unstructured nodes (at least, that is what I gather from reading the code), so he can take full advantage of the schema-free side of JCR and freely add properties to JCR-persisted Ruby objects. The code is on GitHub, have a look at jcr-rails-demo/lib/jcr/record_base.rb from which the models classes inherit. If Rails and JCR is of interest to you: also have a look at what Ngoc Dao wrote on how to use JCR in Rails applications

## [LOTD] The Skinny on JCR, CMIS and OSGi

Remember Jon Mark's overview picture of content technologies I mentioned the other day? Jon has just published his first column on CMSWire where he delivers a brilliant textual description of his diagram, spanning across JCR, CMIS and OSGi.

Jon rightfully points out that CMIS is of very limited value for managing web content and explains that further in the comments section. It is at best tricky to store in a CMIS repository an HTML document that contains references to other items in the same repository, i.e. you will struggle to use CSS, images or hyperlinks.

## Differences between JCR 1 and JCR 2 on API level

My colleague Sebastian Hoogenberk has run JDiff over the JCR 1 and JCR 2 Javadocs. The results are useful to get a clear overview over the changes on API level. Note that JDiff unfortunately seems to get confused with some methods and marks them both as "added" and "replaced".

## Open Innovation in Software means Open Source

I’m giving a talk today at the Open Source, Open Development, Open Innovation workshop in Oxford:

Open source software is more than just a licence, it is also a software development methodology that allows companies to share resources and collaborate on non-core parts of their software/service offering. When managed well, open development enables a reduction in cost, and an increase in innovation as a result of the convergence of the best minds in the problem space. In this presentation Bertrand Delacretaz will describe how Day Software has embraced open development by positioning itself as the leaders in both open standards and open source software. We will examine how Day’s active engagement with 25 open source projects and numerous standards groups has enabled the company to become a world leader in their market and in the open source projects they participate in.

The funny thing is that the above abstract was written by Ross Gardler while waiting for my own version of it – and it says exactly what I was trying to say, only better ;-)

The event is covered by a live blog, and you can ask questions there.

To put it simply, my conclusion is that quick feedback from users and customers is key to open innovation – and open source. if done right, provides lots of feedback, fast.

## Daily Shoot, week 3

Another week of @dailyshoot:

PS. Check out the updated dailyshoot.com web site.

## Top 10 Features in the Upcoming CQ 5.3 Release

Here is a short presentation with my personal top ten features in our upcoming release. Of course, this would ideally be accompanied by short fast paced demos, so if you are interested in getting personalized demo or a video, please reach out to us.

## Daily Shoot, week 2

As I mentioned last week, I’ve been following @dailyshoot for a series of daily photo assignments. Here’s what I shot this week:

## Sling over HTTP

A few days ago I posted about Jackrabbit, and now it’s time to follow up with Sling as a means of accessing a content repository over HTTP. Apache Sling is a web framework based on JCR content repositories like Jackrabbit and among other things it adds some pretty nice ways of accessing manipulating content over HTTP.

The easiest way to get started with Sling is to download the “Sling Standalone Application” from the Sling downloads page. Unpack the distribution package and start the Sling application with “java -jar org.apache.sling.launchpad.app-5-incubator.jar”. Like Jackrabbit, Sling can by default be accessed at http://localhost:8080/. There’s a 15 minute tutorial that you can check out to learn more about Sling.

Since Sling comes with an embedded Jackrabbit repository, it also supports much of the WebDAV functionality covered in my previous post. Instead of rehashing those points, this post takes a look at the additional HTTP content access features in Sling.

CR1: Create a document

Like with Jackrabbit, all documents in Sling have a path that is used to identify and locate the document. Sling solves the problem of having to come up with the document name by supporting a virtual “star resource” that’ll automatically generate a unique name for a new document. Thus instead of having to think of a URL like “http://localhost:8080/hello” in advance, the new document can be created by simply posting to the star resource at “http://localhost:8080/*”.

The Sling POST servlet is a pretty versatile tool, and can be used to perform many content manipulation operations using normal HTTP POST requests and the application/x-www-form-urlencoded format used by normal HTML forms. With the POST servlet, the example document can be created like this:

$curl --data 'title=Hello, World!' --data 'date=2009-11-17T12:00:00.000Z' \ --data 'date@TypeHint=Date' --user admin:admin \ http://localhost:8080/*  The 201 Created response will contain a Location header that points to the newly created document. In this case the returned URL is “http://localhost:8080/hello_world_” based on some document title heuristics included in Sling. If you run the command again you’ll get a different URL since the Sling star resource will automatically avoid overwriting existing content. Pros: • A single standard POST request is enough • The HTML form format is used for the POST body • Automatically generated clean and readable document URL Cons: • The star resource URL pattern is fixed and creates an unnecessarily tight binding between the client and the server CR2: Read a document Sling contains multiple ways of accessing the document content in different renderings. In fact much of the power of Sling comes from the extensive support for rendering underlying content in various different and easily customizable ways. Unfortunately at least the latest 5-incubator version of the Sling Application doesn’t support any reasonable default rendering at the previously returned document URL. The client needs to explicitly know to add a “.json” or “.xml” suffix to the document URL to get a JSON or XML rendering of the document. $ curl http://localhost:8080/hello_world_.json
{
"title":           "Hello, World!",
"date":            "Tue Nov 17 2009 12:00:00 GMT+0100",
"jcr:primaryType": "nt:unstructured"
}
$curl http://localhost:8080/hello_world_.xml <?xml version="1.0" encoding="UTF-8"?> <hello_world_ xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:fn_old="http://www.w3.org/2004/10/xpath-functions" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:mix="http://www.jcp.org/jcr/mix/1.0" xmlns:sv="http://www.jcp.org/jcr/sv/1.0" xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:rep="internal" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" jcr:primaryType="nt:unstructured" date="2009-11-17T12:00:00.000+01:00" title="Hello, World!"/>  The JCR document view format is used for the XML rendering. Pros: • A single GET request is enough • Both the JSON and XML formats are easy to consume Cons: • Simply GETting the document URL doesn’t return anything useful • The “.json” and “.xml” URL patterns create an unnecessary binding between the client and the server • Neither rendering contains property type information • The XML rendering contains unnecessary namespace declarations CR3: Update a document The Sling POST servlet supports also document updates, so we can just POST the updated properties to the document URL: $ curl --data 'history=Document date updated' \
--data 'date=2009-11-18T12:00:00.000Z' \

http://localhost:8080/hello_world_

Pros:

• A single standard POST request is enough
• The HTML form format is used for the POST body

Cons:

• None.

CR4: Delete a document

You can either use the special “:operation=delete” feature of the Sling POST servlet or a standard DELETE request to delete a document:

$curl --data ':operation=delete' --user admin:admin \ http://localhost:8080/hello_world_$ curl --request DELETE --user admin:admin \

http://localhost:8080/hello_world_

Pros:

• A standard DELETE or POST request is all that’s needed

Cons:

• None.

## Increasing Interest in JCR and Apache Sling

I’m attending some tech conferences in the Java space for several years now; for the last years I’m trying to push JCR, Apache Jackrabbit, and Apache Sling at various occasions – mostly through several talks. It seems to me that there is a change/increase in interest for these topics.
While two years ago, people have not been really that interested in JCR and usually asked questions along the line of “why should i use this? I have a database, it works fine” etc., this has definitly changed now. There are more and more people interested in alternatives to a POD (plain old database). It seems to me that the pain with traditional dbs is now too much and they’re searching for nosql solutions. Don’t get me wrong, JCR is not the golden hammer for data storage – there are valid use cases for PODs, but there are also many use cases where an alternative like JCR is much better suited.
Today people ask questions like “I looked at jcr, I have this problem, and I think I could do it this way. What do you think?” and of course variations on this theme.
I don’t think that this is motivated through the NOSQL hype, I ratherthink that this is a parallel movement which has the same origin. In addition, these people seems to know a lot about Apache Jackrabbit and usually ask deep going questions. As I’m not working on Jackrabbit itself – I’m a user of Jackrabbit – these questions usually give me a hard time
While I have a hard time with Jackrabbit questions, people seem often to have a hard time understanding the needs for Apache Sling. For one this might be cause they are still living in their POD driven world and know how to handle applications based on databases. Building applications on top of NOSQL solutions is a slightly different thing. When it comes to Apache Sling which is a web framework, it is immediately compared against web application frameworks with all the nice ui features, widget libraries and whatnot. And as Sling does not provide a UI library it is often immediately discarded.
But on the bright side as soon as people realize that they need something like JCR and now need a way to get all this content stored in the repository out to the users in a nice, elegant and flexible way, they also realize that Apache Sling helps a lot.
So my hope is that the adaption of JCR is increasing (and I think that is already the case) and that this drives the adoption of Apache Sling as well
And of course we are not day dreaming in Apache Sling – we will continue it’s development and add missing pieces or provide bridges etc.

## [LOTD] CMIS, JCR and OSGi for Idiots

Drinking with David has inspired Jon Marks (aka @McBoof) to draw a brilliant drawing of the landscape of content technologies. Beer :)

## Jackrabbit over HTTP

Last week I posted a simple set of operations that a “RESTful content repository” should support over HTTP. Here’s a quick look at how Apache Jackrabbit meets this challenge.

To get started I first downloaded the standalone jar file from the Jackrabbit downloads page, and started it with “java -jar jackrabbit-standalone-1.6.0.jar”. This is a quick and easy way to get a Jackrabbit repository up and running. Just point your browser to http://localhost:8080/ to check that the repository is there.

Jackrabbit comes with a built-in advanced WebDAV feature that gives you pretty good control over your content. The root URL for the default workspace is http://localhost:8080/server/default/jcr:root/ and by default Jackrabbit grants full write access if you specify any username and password.

Note that Jackrabbit also has another, filesystem-oriented WebDAV feature that you can access at http://localhost:8080/repository/default/. This entry point is great for dealing with simple things like normal files and folders, but for more fine-grained content you’ll want to use the advanced WebDAV feature as outlined below.

CR1: Create a document

All documents (nodes) in Jackrabbit have a pathname just like files in a normal file system. Thus to create a new document, we first need to come up with a name and a location for it. Let’s call the example document “hello” and place it at the root of the default workspace, so we can later address it at the path “/hello”. The related WebDAV URL is http://localhost:8080/server/default/jcr:root/hello/.

You can use the MKCOL method to create a new node in Jackrabbit. An MKCOL request without a body will create a new empty node, but you can specify the initial contents of the node by including a snippet of JCR system view XML that describes your content. In our case we want to specify the “title” and “date” properties. Note that JCR does not support date-only properties, so we need to store the date value as a more accurate timestamp.

The full request looks like this:

$curl --request MKCOL --data @- --user name:pass \ http://localhost:8080/server/default/jcr:root/hello/ <<END <sv:node sv:name="hello" xmlns:sv="http://www.jcp.org/jcr/sv/1.0"> <sv:property sv:name="message" sv:type="String"> <sv:value>Hello, World!</sv:value> </sv:property> <sv:property sv:name="date" sv:type="Date"> <sv:value>2009-11-17T12:00:00.000Z</sv:value> </sv:property> </sv:node> END The resulting document is available at the URL we already constructed above, i.e. http://localhost:8080/server/default/jcr:root/hello/. Pros: • A single standard WebDAV MKCOL request is enough • The standard JCR system view XML format is used for the MKCOL body • The XML format is easy to produce Cons: • We need to decide the name and location of the document before it can be created • The name of the document is duplicated, once in the URL and once in the sv:name attribute • The date property must be specified down to the millisecond • While standardized, the MKCOL method is not as well known as PUT or POST • While standardized, the JCR system view format is not as well known as JSON, Atom or generic XML • The system view XML format is quite verbose CR2: Read a document Now that the document is created, we can read it with a standard GET request: $ curl --user name:pass http://localhost:8080/server/default/jcr:root/hello/
<?xml version="1.0" encoding="UTF-8"?>
<sv:node sv:name="hello"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:fn_old="http://www.w3.org/2004/10/xpath-functions"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:jcr="http://www.jcp.org/jcr/1.0"
xmlns:mix="http://www.jcp.org/jcr/mix/1.0"
xmlns:sv="http://www.jcp.org/jcr/sv/1.0"
xmlns:rep="internal"
xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
<sv:property sv:name="jcr:primaryType" sv:type="Name">
<sv:value>nt:unstructured</sv:value>
</sv:property>
<sv:property sv:name="date" sv:type="Date">
<sv:value>2009-11-17T12:00:00.000Z</sv:value>
</sv:property>
<sv:property sv:name="message" sv:type="String">
<sv:value>Hello, World!</sv:value>
</sv:property>
</sv:node>

Note that the result includes the standard jcr:primaryType property that is always included in all JCR nodes. Also all namespaces registered in the repository are included even though strictly speaking they add little value to the response.

Pros:

• A single GET request is enough
• The XML format is easy to consume

Cons:

• The system view format is a bit verbose and generally not that well known

CR3: Update a document

The WebDAV feature in Jackrabbit does not support setting multiple properties in a single request, so we need to use separate requests for each property change. The easiest way to update a property is to PUT the new value to the property URL. The only tricky part is that unless the node type explicitly says otherwise the new value is by default stored as a binary stream. You need to specify a custom jcr-value/… content type to override that default.

$curl --request PUT --header "Content-Type: jcr-value/date" \ --data "2009-11-18T12:00:00.000Z" --user name:pass \ http://localhost:8080/server/default/jcr:root/hello/date$ curl --request PUT --header "Content-Type: jcr-value/string" \
--data "Document date updated"  --user name:pass \

http://localhost:8080/server/default/jcr:root/hello/history

GETting the document after these changes will give you the updated property values.

Pros:

• Standard PUT requests are used
• No XML or other wrapper format needed, just send the raw value as the request body

Cons:

• More than one request needed
• Need to use non-standard jcr-value/… media types for non-binary values

CR4: Delete a document

Deleting a document is easy with the DELETE method:

$curl --request DELETE --user name:pass \ http://localhost:8080/server/default/jcr:root/hello/ That’s it. Trying to GET the document after it’s been deleted gives a 404 response, just as expected. Pros: • A standard DELETE request is all that’s needed Cons: • None. ## The perfect monitoring I often thought about how a perfect monitoring solution should like; then the following requirements come up: 1) consistent: When the monitoring indicates a problem, there really is a problem with the application. 2) reliable: When there is a problem that hinders the users from properly working on the application, the monitoring will indicate this, before the support line is overwhelmed by calls reporting that your application has a problem. 3) informative: When a monitoring indicates a problem, there should a recommendation how this problem can be fixed with minimal impact. This recommendation can be either documented offline (operation guidelines) or in the monitoring itself. 4) proactive: a monitoring resource should detect problems before they happen. Sounds strange, but many problems (excessive use of memory) can be reported before they actually have a big impact. The requirement 2) is a hard one; essentially it requires that all problem situations are pre-known and the presence of such a problem situation can be indicated by the monitoring resource. In fact most situations are not known to anybody, until they happen and are fully analyzed and understood. 1) can fulfilled very easily: never report a problem. From a purely academical view this requirement is fulfilled then, but actually it is not really usable. A more usable approach is to report only when it's 99,99% clear that no user can work anymore (e.g. on a CQ authoring system when a vital OSGI service isn't available any more). But the more subtile problems cannot be caught that way. 3) requires a certain amount of experience with the application and the will to write good documentation, that is kept up to date. And 4) requires knowledge about typical problems and the signs of problems. All these requirements often need an extension to the application to provide an interface to the monitoring, where the monitoring can fetch the data from and decide what to do with it. But monitoring is the poor child of application development; it is neither a functional requirement nor a non-functional requirement with the importance of usability, performance or availability (aah, by the way: how should we measure that? No-one cares, as long you guarantee 99,9% ...), but only a requirement of the operations team - no-one spends time or money on it, until operations ask for it. "Ups, we already spent our budget on other things". Only a few operations teams have the management standing, that they can deny such applications to run on their machines then, in most cases they are just overruled by management. Then, the operations team usually comes up with some thoughts and tries to fill the gap themselves, but in most times that does not work very well. Especially requirement 2) is then often violated and 3) and 4) are not implemented at all. But operations can show some green bulbs in the monitoring system to the management. So, as a last requirement (which should be the very first requirement, though): 5) there should be a proper monitoring at all. A monitoring that only watches for a running process cannot detect that this process already internally has deadlocked and isn't working anymore. Finally, when you need to implement a complex application, make sure that some of its internal state can be exposed to an external monitoring solution, which helps to operate your application. Treat this as a normal "must-have" feature with specification, implementation and test. You will make your operations team really happy. If you do not do that, your monitoring system are actually the users, when they complain about non-working functionality, which must be fixed by your operations team then. And that brings costs (service calls) and negative management attention. Nothing one wants to have. ## Daily Shoot, week 1 A week ago James Duncan Davidson and Mike Clark launched @dailyshoot, a Twitter feed that posts daily photo assignments. The idea is to encourage people who want to learn photography to practice it every day with the help of a simple assignment that fits a single tweet. I’m following Duncan’s blog, so I found out about Daily Shoot the day it was launched. So far I’ve completed all the assignments and I’ve already learned quite a bit doing so. It’s very interesting to see how other people interpret the same assignments. I avoid looking at other responses before completing an assignment so that I don’t end up just copying someone else’s approach. Once I’m done I look at what other’s have done for some nice insight on what I could have done differently. The process is quite educational. Here’s what I’ve shot this week: You can click on the pictures for more background on each assignment and how I approached it. For more information on Daily Shoot, see the recently launched website. ## Ignite slides RT @joannekh: Day Ignite presentations now available at www.day.com/ignite ## What is a content repository Joint post of Henri Bergius and Michael Marth cross-posted here and here. Web Content Repositories are more than just plain old relational databases. In fact, the requirements that arise when managing web content have led to a class of content repository implementations that are comparable on a conceptual level. During the IKS community workshop in Rome we got together to compare JCR (the Jackrabbit implementation) and Midgard's content repository. While in some cases the terminology might be different, many of the underlying ideas are identical. So we came up with a list of common traits and features of our content repositories. For comparison, there is also Apache CouchDB. So, why use a Content Repository for your application instead of the old familiar RDBMS? Repositories provide several advantages: • Common rules for data access mean that multiple applications can work with same content without breaking consistency of the data • Signals about changes let applications know when another application using the repository modifies something, enabling collaborative data management between apps • Objects instead of SQL mean that developers can deal with data using APIs more compatible with the rest of their desktop programming environment, and without having to fear issues like SQL injection • Data model is scriptable when you use a content repository, meaning that users can easily write Python or PHP scripts to perform batch operations on their data without having to learn your storage format • Synchronization and sharing features can be implemented on the content repository level meaning that you gain these features without having to worry about them  feature JCR / Jackrabbit Midgard CouchDB content type system In JCR structured or unstructured nodes are supported and can be mixed at will in a content tree. Content types are defined in MgdSchema types. All content must be stored to an MgdSchema type, but types can be extended on content instance level using the "parameter" triplets Type-free type hierarchy Structured node types support inheritence of types, additional cross-cutting aspects can be added with "mixins". Node types can define allowed node types for child nodes in the content hierarchy. MgdSchemas allow inheritance, and an extended type can be instantiated either using the extended type or the base type Type-free IDs Nodes with mixin "referenceable" have GUID a UUID. In practice the node path is often used to reference nodes. Every object has a GUID used for referencing. Objects located in trees that have a "name" property can also be referred to using the path All objects can be accessed via a UUID References Nodes can reference each other with hard link (special property type) or soft link (by referring to the node path) MgdSchema types can have properties linking to other objects of same or different type. A link of "parentfield" type places an MgdSchema type in a tree. No reference support built-in content hierarchy All content is hierarchical / in a tree Content can exist in tree, or independently of it depending on the MgdSchema type definition flat structure interesting property types Multi-valued (like an array), binary properties (e.g. for files), nodes have an implicit sort-order Binary properties stored using the Midgard Attachment system Support for binary properties transactions Multiple content modifications are written in transactions. Transactions can be used optionally. events JCR Observers can register for content changes on different paths and/or for different node types and/or CRUD, receive notification of changes as serialized node All transactions cause both process-internal GObject signals, and interprocess DBus signals Support for one external event notification shell script workspaces Workspaces provide separate root trees. No workspaces support in Midgard 9.03, coming in next version Multiple databases within one CouchDB instance import and export nodes or parts of the repository (or the whole repo) can be imported or exported in XML. 2 formats: docview for human-frindly representation, sysview including all technical aspects Objects can be exported and imported in XML format. There are tools supporting replication via HTTP, tarballs, XMPP, and the CouchDB replication protocol JSON serialization is the standard way of accessing the repository. CouchDB replication protocol supports full synchronization between instances versioning Checkin/checkout model to create new versions of nodes, optionally versions complete sub-trees, supports branching of versions. No versioning All versions of content are stored and accessible separately, no branching locking Nodes can be locked and unlocked Objects can be locked and unlocked object mapping Not in standard, but implemented in Jackrabbit. Rarely used in practice. Object mapping is the standard way of accessing the repository All content is accessed via JSON objects queries In JCR1 Sql or XPath, in JCR2 also QueryBuilder. Query Builder Javascript map/reduce access control Done on repository level, i.e. all access control is independent of application. In Jackrabbit: pluggable authentication/authorization handlers. No access control in Midgard repository, usually implemented on application level. Midgard proves a user authentication API No access control persistence In Jackrabbit different Persistence Managers can be plugged in (RDBMS, tar file, ...) libgda allows storage to different RDBMS like MySQL, SQLite and Postgres CouchDB has its own storage architecture Jackrabbit: library (jar), JEE resource, OSGi bundle or standalone server Library Erlang-based daemon APIs Standard: Java-based, PHP coming up. In Jackrabbit: also WebDAV and HTTP-based API C, Objective-C, PHP, Python HTTP+JSON full-text search Included in repository. In Jackrabbit: Lucene bundled No (SOLR used on application level) Plugin for using Lucene, not installed by default standard metadata All nodes have access rights, jcr:primaryType and jcr:mixinTypes properties. JCR 2.0 standardizes a set of optional metadata properties. All objects have a set of standard metadata including creator, revisor, timestamps etc No standard properties ## Content Repository over HTTP Two weeks ago during the BarCamp at the ApacheCon US I chaired a short session titled “The RESTful Content Repository”. The idea of the session was to discuss the various ways that existing content repositories support RESTful access over HTTP and to perhaps find some common ground from which a generic content repository protocol could be formulated. The REST architectural style was generally accepted as a useful set of constraints for the architecture of distributed content-based applications, but as an architectural style it doesn’t define what the bits on the wire should look like. This is what we set out to define with the HTTP protocol as a baseline. We didn’t get too far, but see below for some collected thoughts and a useful set of “test cases” that I hope to use to further investigate this idea. Existing solutions Many existing content repositories and related products already support one or more HTTP-based access patterns: Apache Jackrabbit exposes two slightly different WebDAV-based access points. Apache Sling adds the SlingPostServlet and default JSON and XML renderings of content. Apache CouchDB uses JSON over HTTP as the primary access protocol. Apache Solr uses XML over HTTP. Midgard doesn’t have a built-in HTTP binding for content, but makes it very easy to implement such bindings. This list just scratches the surface… There are even existing generic protocols that match at least parts of what we wanted to achieve. WebDAV has been around for ten years already, but the way it extends HTTP with extra methods makes it harder to use with existing HTTP clients and libraries. The AtomPub protocol solves that issue, but being based on the Atom format and leaving much of the server behaviour undefined, AtomPub may not be the best solution for generic content repositories. Content repository operations over HTTP To better understand the needs and capabilities of existing solutions, we should come up with a simple set of content operations and find out if and how different systems support those operations over HTTP. The most basic such set of operations is CRUD, i.e. how to create, read, update, and delete a document, so let’s start with that. I’m giving each operation a key (CRn, as in “Content Repository operation N”) and a brief description of what’s expected. In later posts I hope to explore how these operations can be implemented with curl or some other simple HTTP client accessing various kinds of content repositories. I’m also planning to extend the set of required operations to cover features like search, linking, versioning, transactions, etc. CR1: Create a document Documents with simple properties like strings and dates are basic building blocks of all content applications. How can I create a new document with the following properties? • title = “Hello, World!” (string) • date = 2009-11-17 (date) At the end of this operation I should have a URL that I can use to access the created document. CR2: Read a document Given the URL of a document (see CR1), how do I read the properties of that document? The retrieved property values should match the values given when the document was created. CR3: Update a document Given the URL of a document (see CR1), how do update the properties of that document? For example, I want to update the existing date property and add a new string property: • date = 2009-11-18 (date) • history = “Document date updated” (string) When the document is read (see CR2) after this update, the retrieved information should contain the original title and the above updated date and history values. CR4: Delete a document Given the URL of a document (see CR1), how do I delete that document? Once deleted, it should no longer be possible to read (see CR2) or update (see CR3) the document. ## [LOTD] IKS in the press French IT mag LeMagIT has published an article about the IKS project including quotations from Bertrand Delacretaz. Bertrand emphasizes the need for concrete results: pour décoller, les technologies sémantiques ont besoin de cas d'utilisateur concrets In the comments section Bertrand mentions his tag line for semantic technologies that I can very well relate to: La sémantique "sous le capot" oui, la sémantique "dans la figure", non This roughly translates as: "semantics under the hood yes, semantics in your face, no". In Computerworld UK open source blogger Glyn Moody has described his first hands impressions from the IKS workshop in Rome. He comes to a similar conclusion: Paradoxically, semantic search will only ever really take off once it has receded so far into the fabric of computing that people aren't even aware it's there. ## What Makes Apache Tick? Looking at the diversity of Apache Software Foundation communities, one can see a recipe for failure: people from different cultural backgrounds, different mother tongues, different employers, different timezones…all working together to create some of the best software on the planet? You must be kidding. How can this very loose collage of disparate people pump out dozens of high-quality releases every year, often working better than more structured corporate teams? This “mystery” has been on my mind for a while, and I have identified four drivers that influence the way we use collaboration tools that play a major part in our success. The first driver is a common vision amongst project members. The Biblical saying, “Without a vision, my people perish” is quite valid for our projects. Both using a central development mailing list for each project, and spending time to collectively define our project’s charter, helps us foster this common vision amongst project members. Every member should have the same answer to the “what are our goals?” question, so it’s important to get them to talk in a central place, where they all get the same information, as opposed to undocumented, one-to-one discussions. Secondly, providing real-time status updates to project members is key in helping them stay on track. At Apache, this is implemented by the many events generated by our collaboration tools: commit events to indicate code changes, issue tracker events to provide updates about the status of bugs and new features, success/failure events from continuous build systems, and standardized ways of announcing releases so that other projects are informed. Project members subscribe freely to as few or as many event channels that they want to, so as to stay on top of things in near real-time, and without having to actively ask others about what happened. Status meetings? No need for those, as the information is flowing all the time. The third success driver lies in enabling real-time help requests. In an immediate crisis of the “we need to deliver this by tomorrow” type, especially when working with a big team, you need to be able to ask for help without necessarily 1) knowing who specifically will help, and 2) bothering others with direct person-to-person requests, especially if they work in a different timezone. The key here is using issue trackers, where one web page stores key data and parts of the dialog that leads to resolving an issue. Posting an issue on the tracker, with sufficiently detailed instructions about how to reproduce the problem, along with attributes such as severity level, affected modules, etc., is the best way to expose a problem to the group quickly and with precision. Using an issue tracker also allows you to quickly and efficiently change priorities as well as re-assign issues and tasks – key elements that make all the difference in a crisis. Finally, having searchable archives of this information allows new project members, or those returning after a period of absence, to learn what transpired and why things have been done in a certain way. Without self-service archives, new participants would have to talk to everybody to find out about the project’s history, past decisions, conventions, etc., which is neither efficient nor scalable. Most of our archives are automatically built as project activities progress: mailing lists are archived, source code control history is kept forever, and issue trackers write the full history of the project’s micro-decisions. Combined with Apache’s principles of meritocracy and consensus-based decision making, these four collaboration drivers allow our project teams to work very efficiently, and, in many cases, even more so than structured teams that do not establish those central hubs of information exchange. Does your project team foster a common vision and provide tools for real-time status updates, real-time help requests and self-service archives to its members? If yes, congratulations: you’re on a good track to becoming as effective as an Apache project! Many thanks to Sally Khudairi for reviewing and copy editing. ApacheCon US 2009 picture by Ted Leung / Creative Commons License (CC BY-NC-SA 2.0). See also my Open Source Tools are Good For You presentation, which discusses the tools that Apache projects use to implement this. ## What does Apache provide that other code repositories don’t? People thinking about creating an open source project might rightly consider hosting on one of the various hosting services available: Google Code, SourceForge, kenai, bitbucket and github come to mind. Quick and easy, create a repository or request some resources and you’re in business. Incubating a project at the Apache Software Foundation (ASF) takes a lot more effort than just requesting a hosting space on one of those services, so why would you do that? One can perfectly host code on one of those services with an Apache License, so what’s the difference? I think the big difference lies in the governance model, and in fact calling the ASF just a code repository is very wrong. Let’s discuss some key elements of that. The Apache voting process has been tried and tested since 1999, or even earlier. This is one of the things that projects coming through the incubator have to learn, led by their mentors. Learning is usually very easy as people quickly see the benefits of those simple no-nonsense rules. The ASF also provides a well-defined structure for managing projects, and the foundation as a whole, in a fair and consensus-driven way. One could argue that structure gets in the way, and sometimes it does, but when things go wrong having a well-defined way of getting back on track helps tremendously. And this structure leaves a lot of freedom to the project’s management committee (PMC), there’s a lot of room for adapting a project’s way of working to its community and goals. Creating an Apache project is certainly not required for all open source projects (and the foundation couldn’t scale to thousands of projects right now anyway), but for the critical infrastructure parts of one’s business (what’s sometimes called “open core”), having an established governance model makes all the difference. The governance model is just one of the benefits that Apache projects get – there’s also the visibility, brand recognition, nice build services, and other tools, and, last but not least, the many friends that you make along the way! As everybody now knows, there are no jerks at Apache! ## Running the iTunes genre tagger script with OS X Automator Due to public demand here's a little recipe how to run last post's mp3 tagger without using the command line on OS X: • Open Automator • Start a new "Application" project • Drag the "Run Shell Script" action into the right workflow panel, set the "pass input" drop-down to "as arguments" and edit the script to (see screenshot below): for f in "$@"
do
/opt/local/bin/python /Users/michaelmarth/Development/Code/mp3tagger/tag_groupings.py -d "\$f"
done

• Save the application and happily start dropping mp3 folders onto the application's icon.

## Back from W-JAX 09

This year's WJAX in Munich has been (again) a great success. The conference area was crowded up to the maximum capacity of the hotel I guess. Around 150 talks, different special days covering topics like persistence, OSGi, Scala, and the never dying SOA. My two talks about JCR and Apache Sling have been well attended, some interesting questions came up and I could spread the interest in these cool technologies. Now looking forward to JAX 2010 :)

## Update: The IKS semantic engine - a pragmatist's view

Update to "The IKS semantic engine - a pragmatist's view": here are the slides:

The presentation went well, and will hopefully lead to a sprint to actually implement something along these lines. The two demos that used UIMA at the workshop made me think that UIMA should be part of that picture, at least as a plugin for semantic lifting. And I did the presentation in less than 8 minutes out of the 10 that were allocated. Bonus points?

## IKS Search Benchmark

CQ5 search comes with some improvements over JCR's search capabilities, e.g. adapting result rankings to what users choose or faceted search. Within the IKS project Bertrand and I have experimented with another possibility: link-based ranking, i.e. adjusting search results based on the content of link tags. For example: if page A links to page B with the link text "lorem ipsum" then page B should get a higher ranking when a user searches for "lorem ipsum". This is essentially what Google does, but we wanted to apply it to internal links (within the same site) only.

To give away the results right away: for many web sites the results will probably not improve dramatically, because there are not enough internal links. However, it might help for some projects so our implementation approach is described below in case you want to give it a try in your project.

In order to extract links from a node we opted for parsing the complete rendered HTML presentation of a node rather than looking only at the Rich Text properties of one node. In that way we could also catch programmatically generated links from templates. So we ended up by setting up a little spider on the publish server that retrieves HTML representations of all pages. The spider is deployed as an OSGi bundle within the server so it gets the locations of all pages from an internal repository query. For each page the HTML is retrieved and parsed. The found links are stored as child nodes below the page that is linked to. In the example from above: if page A links to page B with the link text "lorem ipsum" then page B gets a child node with properties source=A and text="lorem ipsum". Implemented in that way we could basically use the Jackrabbit indexer without further changes.

We have also implemented a JCR Observer that catches changes to pages and fixes the corresponding links. Template updates are not caught, yet.

The sources are attached to this post. The Java program can be used as a standalone application or deployed as an OSGi bundle. The standalone program takes a couple of optional arguments for running a full upfront spidering, deleting all found link nodes etc. In case you want to give it a try please be aware:

• The standalone program requires RMI to be enabled on the repository which is not the case by default (in the code port 1235 is used).

• The searches must take into account the new properties of the link nodes. One possibility is to re-configure the Jackrabbit indexing, which in CQ5 is done in the crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/indexing_config.xml file, by adding:

<index-rule nodeType="nt:unstructured"
</index-rule>


The boost factor in this configuration can be adjusted to give links a proper weight relative to the other properties of a node
For reindexing delete these directories:
crx-quickstart/repository/repository/index
crx-quickstart/repository/workspaces/crx.default/index
crx-quickstart/repository/workspaces/crx.system/index

###### Results

We tested the approach on the content of our corporate website (a rather small content corpus). Overall, the search results improved slightly, but not much (although we did not spend a lot of time on tweaking the boost factor). As stated above I believe that corporate websites in general will not benefit from link-based ranking very much as the majority of links in them are often reflecting the navigation (i.e. the hierarchical structure of the site) so they provide little additional information. Of course, on the other side there is no harm in using links for search relevance either.

###### Alternative approach

Marcel Reutegger (the MAN when it comes to JCR searches) gave a lot of great input to our experiment (thanks a lot for this). He also hinted how an alternative implementation could look like: using an output filter, which can process HTML content as it's being generated. In CQ5 the validity of links is already checked that way, so storing them would naturally fit there. Also, he suggested storing the links not below the pages themselves, but in a separate part of the repository. In a background processing job these links could be aggregated and the most relevant key words would eventually be written into the page nodes.

## Ignite Chicago

After the Ignite in Zurich, there came Ignite in Chicago, where our American customers, prospects, partners, and Day staff met to share information, experiences, to network, and simply have a very good time. The event itself was slightly bigger than the event in Zurich, both in terms of number of participants and available room.

Ignite was hosted by Day customers, in more than one way: by the City of Chicago itself, and by the grand Intercontinental Hotel, of the IHG Group, on Chicago's famous shopping avenue, the Magnificent Mile.

Again, we had a lot of great presentations, panels, Q&A sessions, as well as informal chats. And the Foreigner concert at the end was the icing on the cake.

Be sure to check out the conference hashtag was #dayignite and here are some Ignite pictures on Flickr, with lots of new coverage from Chicago:

Looking forward to next year's Day Customer Summit!

## On Version Numbers

I have been thinking about using version numbers lately while working on some API extension of the Sling Engine bundle. So here is what I think versions are all about and that we all should be very careful when changing code and assigning versions to it.

On a high level versions have various aspects:

Syntax

There is no global agreement on the correct syntax of versions. I tend to like the OSGi syntax specification: The version has four parts separated by dots. The first three parts are numbers, called the major, minor and micro version. The fourth part is a plain (reduced character set) string which may be used to describe a particular version. Version numbers are compared as you would expect, except that the fourth part is employs case-sensitive string comparison comparing the actual Unicode codepoints of the characters.

Semantic

The semantics of a version define what it means to increment each place of a version. In the world of software development there is even less agreement on the semantics of version numbers than there is agreement on the syntax. The OSGi specification just defines suggested semantics.

Expectations

When seeing product version numbers people tend to have expectations towards the products. For example when Firefox went from 2.x to 3.0 we expected a major change. Likewise when Day upgraded the version number to 5 for the newest version of Communiqué the expectation is correct, that it is a major new version of the product. In fact we completely rewrote Communiqué for the 5.0 release.

Version Items

When it comes to apply version numbers to things there are quite a number of things in a single product, which may be numbered. Take for example Day Communiqué 5. There is a product - the thing you take out of the box and install on your server. Then there are OSGi bundles. Finally there are Java packages shared between the bundles and used by the application scripts.

So here are my definitions of the version number aspects layed out above.

Syntax
IMHO the Syntax for version numbers as defined in the OSGi Core specification (Section 3.2.4, Version) is good enough and clear for most uses. The nice thing about this specification is that in Section 3.2.5, Version Ranges, a syntax is defined to define ranges of versions. Such ranges are of great use when depending on other items. Most importantly of course this would be list of imported Java packages.

Semantics
As for the semantics the main problem comes from the fact, that not all versioned items understand version numbers in the same way. For example on a product level, c.f. Day Communiqué, the version number of a release is generally defined by marketing and/or product management.

I will not dive into how product numbers are to be defined. This is outside of my working knowledge and beyond may abilities ;-)

On the OSGi bundle level on the other hand and even more so on the Java package level (for OSGi package exports), the version number is more a call of the developer. Version numbers on this level are intended to convey to other developers something about the evolution of the bundle and/or package.

Let's start with exported Java packages. I tend to attribute the following semantics to the parts of a a version number:

• Increasing the major version number means the API has been modified in an incompatible way. Mostly this means public classes, interfaces, methods, fields have been removed or renamed. As a consequence code using and implementing the API will break and has to be modified.

• Increasing the minor version number means the API has just been enhanced in a way that is compatible for use. Increasing the minor version number, though, means that code implementing the API might have to be modified to comply with the added API like the definition of new methods.

• Increasing the micro version number means that there have been some bug fixes. Generally, a pure API consisting of just interfaces has little chance for bugs which do not ammount to minor or even major version number increase. If the exported packages of a bundle happen to contain concrete or abstract classes with implementation code, bugs cannot be excluded. As such it is conceivable that a the micro version number of an exported package might be increased.

• As for the qualifier part, as the fourth part of a version number is called by the OSGi specification, this meaning of this part is completely free. On a package export level, I would go as far as to say, it should not generally be used. The qualifier part may be interesting on an OSGi bundle level to create inter-release builds.

Expectations
Peoples expections as it comes to version numbers is not ease to convey. Most people expect different things. But I think one thing is common to all: If there is a version number increase something must have changed.

So, I think to use developers it is important to understand, that we only increase the version number of an item if there is a change -- not sure whether a fixed spelling error in some Java comment is change enough. Again, your mileage may vary if you happen to be product manager for a product to be soled ....

Recommendations
Based on how I understand the version number parts in terms of exported packages, here are my recommendations for package imports and bundle versions.

• If you implement the exported API of another bundle, import the API package using a version range of the form [x.y,x.y+1). This means accept any increment in the micro and qualifier parts. But as soon as the minor version number changes, consider this an incompatibility.

• If you use an exported API, import the API package using a version range of the form [x.y,x+1). This means accept any version starting with a minimum number upto the next breaking API chnage identified by a new major version number.

• Don't increase the version number of an API package if nothing in that package has changed at all.

• Bundles should be versions following versioning of exported packages. So if at least one of the exported packages has a major version number increase, the bundle's version should also have a major version number increase. Likewise for the minor number. The use of qualifiers is optional and sometimes helpful.

• Apart from being driven by versioning of exported packages, bundle versions may also be increased depending on the extent of changes in the bundle. For example in the case of a pure implementation bundle, greatly increasing the functionality might give rise to a major version number increase of the bundle.

• If you are using Maven to build your projects, always depend on the lowest version of a dependent module which has the API functionality you need.

The Eclipse site contains a very interesting and IMHO very reality proven text about versioning of products, bundles and packages: Version Numbering

## [LOTD] How Day Software stumbled upon an open source business strategy

Day's CMO Kevin Cochrane has been interviewed by Matthew Aslet of the 451 group about Day's open source strategy. I particularly liked:

While many other vendors have chosen to retain control over their open source projects for commercial reasons, Day opted to relinquish control with the aim of ubiquity.

Full interview here.

## The IKS semantic engine - a pragmatist's view

As work on the IKS project progresses, my (extremely) pragmatic mind keeps going back to the how can we make this simpler? question.

One of the major goals of IKS is to create semantic extensions for content management systems, but what does that mean? The exact use cases are still vague, and in such a situation it is too easy to over-engineer things, just in case.

We have been talking about RESTful interfaces to IKS components for a while now, but what does this mean exactly? How can we make a concrete step towards defining such interfaces?

I'm a big fan of small concrete steps that lead us towards pragmatic solutions, so let's try to take one such step.

###### Machine-level use cases

Let's start by defining a few simple use cases, at the "machine level": a content management system is the client, and the IKS semantic engine the server. We have discussed this already within IKS, here's a synthetic summary:

Semantic lifting
Let IKS extract semantic information from (multimedia) content: person and place names, structured links between content items, etc. Optionally make this information editable/confirmable by the client system, as a human user might have to refine the system's suggestions.
At the machine interface level, this requires registering content with the IKS semantic engine, reading the resulting semantically lifted document, and optionally modifying it.
Classification and auto-tagging
Let IKS suggest categories and/or tags for pieces of multimedia content. If an author validates the suggestions, inform IKS of what choices were made.
From the machine interface point of view, this is very similar to semantic lifting.
Query building assistance
Let IKS assist users in formulating search queries, interactively.
From the machine interface point of view, this is very similar to semantic lifting.
Similarities, correlation
Let IKS find similarities between pieces of multimedia content. The axes on which those similarities are found can vary: images, for example, can be graphically similar, or similar in terms of the real world entities that they display.
At the machine interface level, this requires registering content with the IKS semantic engine, and later running queries against this content.

This simple list already hides significant complexity, yet those use cases should be understandable by Joe Author.

Enabling those four use cases could add a lot of value to existing and future content management systems, depending on the quality of the semantic components.

###### RESTful interface

Let's design a RESTful interface based on the machine interactions required to implement the above use cases.

Remember that, in what follows, client designates a content management system that wants to use the IKS engine.

Register content with IKS

To build knowledge about our content, IKS needs to be able to find it. In RESTful terms this means providing IKS with an URL that points to said content, so we have:

Rule #1: Content is registered with the IKS server by HTTP POST requests, containing lists of URLs that point to (created or modified) content items.

Rule #2: IKS reads content by making HTTP GET requests to registered pieces of content. Those URLs must return Content-Types that IKS understands. Some Content-Types are preferred and allow IKS to better understand the content.

Semantic Lifting

Once content is registered, the client can request a semantic view of that content from IKS. That view lists semantic entities that have been extracted from the content.

Depending on the IKS implementation, the semantic view can be editable. It is retrieved by a GET request that contains the IKS identifier (provided by IKS when content is registered) of the content item, and modified using an HTTP PUT request.

The Content-Type and data formats use existing standards, as far as possible.

The semantic view includes IKS-specific metadata, for example to indicate that some parts of the semantic view are still being computed.

Rule #3: The semantic view of a content item is retrieved with a GET request, and if editable can be modified by a PUT request of the modified version.

Semantic queries

Semantic queries are implemented using GET methods on various query URLs, that define how the query is interpreted.

Results are returned with similar Content-Types and data formats as used for semantic lifting.

Rule #4: Semantic queries are executed via GET requests, and return the identifiers (URLs) of the selected content items, optionally with some contextual info to display on query result pages.

IKS engine status

Semantic lifting and indexing operations might take some time, so it's useful for the client to have information on the engine's status, in machine-readable form.

Rule #5: The IKS server reserves part of its URL space for system status information, and provides status information in a structured format.

Is that it?

I think that's it - these simple RESTful interactions should be sufficient to implement our use cases.

What's left is to define the Content-Types used, and for this we can most certainly use existing formats, no need to reinvent any wheel here.

###### RESTful IKS framework

The proof of the pudding is in the eating, and if we wait too long the pudding might lose its taste...so why not start buiding this right away?

Purists might (rightly) argue that the above is not a design, just a somewhat vague set of principles. Yet, combined with a prototype implementation, this might be a very good way of making a step in the right direction, and of clarifying requirements and interfaces.

My suggestion for the next steps is as follows:

1. Implement the above interface, using dummy semantic components.

2. Provide system interfaces to integrate actual semantic components (semantic lifting, classification, auto-tagging, querying) as plugins.

3. Researchers can work on the semantic lifting components, and integrate them without requiring significant changes on the client side.

###### Conclusion

The best way to go forward with this is probably to create an open source project to collaborate on this RESTful IKS framework.

Even if that framework is thrown away later as the IKS architecture progresses, if would allow IKS consortium members to build a much better understanding of what's actually needed to add "semantic value" to existing and future content management systems.

## Slides from the NoSQL Meetup and ApacheCon US 09

ApacheCon US 09 in Oakland and the NoSQL meetup are over. Find below the slides of the talks given by Day's engineers covering Apache projects Sling, Jackrabbit, Tika and POI as well as OSGi.

## JBoye Presentation: WCM Trends for 2010

Today I had the opportunity to speak at the JBoye conference in Aarhus. It was a pleasure as every year since the audience and speakers really constitutes a who-is-who of WCM visionaries and insiders. I am definitely looking forward to coming back next year.

The Tuberculosis Project project is one of the Sling users registered on the Sling user wiki page. This is an interview with developer Audrey Colbrant who worked on the project.

Audrey, can you please tell us a bit about the TibTec Tuberculosis Project? What are the project's aims and background?

The TB project is developed by Tibtec, a nonprofit technology center based in Dharamsala (India) and directed by M. Phuntsok DORJEE. The aim of the project was to build a system to monitor the tuberculosis among tibetan communities in India, Nepal and Bhutan. Thanks to technology advances in mobile and web computing, it is now possible to design a recording and reporting web portal supporting the WHO DOTS protocol.

The project of monitoring the tuberculosis among tibetan communities in India was born 1 year ago thanks to four actors: the DoH (Department of Health, Tibetan Government in Exile), Tibetan Delek Hospital (Gangchen Kyishong - India), AISPO (Italian Association for Solidarity Of Persons), and the Johns Hopkins University (USA). TibTec is working on a system for the above four actors.

The main goal of the project is to build a simple, low-cost and versatile framework so that communities all over the world could benefit from it. The system could be easily customized for other works as well since it based on open source software.

If you want to take a look at the architecture, follow the guide.

So how did you end up using Sling? Did you compare Sling against some other frameworks?

The implementation of the TB project was part of the master project in computer science of my university. Jacques Lemordant, researcher in the WAM project at INRIA was in contact with M. Dorjee, CEO of TibTec, since several years. Together they have defined headlines of the project and chosen the more efficient technologies to be used.

Sling was chosen because we are very familiar with XML technologies (RELAX NG, XPATH, XSLT...) and hierarchical representation of data.

Another point was the fact that we wanted to access data from Android (Apache http client) and a full REST API was the simplest way to access a JCR and manipulate data represented as trees. XML being very well supported in Android, Sling is a perfect match with Android to design agile mobile web framework.

Sling is also part of a course in mobile and web technologies as the master level of the University Joseph Fourier of Grenoble.

Now that you have completed an implementation project with Sling are there any lessons learned you would like to share with the community?

The Sling approach is fairly new and I haven’t seen any other same kind of approach before. The concept is simple but it takes a little bit time to be used to the utilization. So never give up, solutions come slowly with perseverance.

If you had one free wish from the Sling committers...

Sling is a very interesting and powerful way to work with resources but difficult to handle for Sling beginners when you have a full and composite website to implement, mostly because of the lack of information on the internet.

The harder thing that gives me a lot of headaches was to find a good syntax to use that changes according to the technology you mix up.

So I think it could be helpful to have more tutorials on the syntax to use in each different case, what is better to do or not, and advice on choices to take in programming (for example I have met choices for protecting the access to the repository; choices about which kind of link is better to use: reference or path, etc).

It could be also good to finalize all links of this useful webpage

## Python script to set genre in iTunes with Last.fm tags

Now that I have started to seriously use iTunes I figured it might be nice to have the genre tag set in a meaningful way. Since I have a reasonably large collection of mp3s doing that manually was out of question - I wrote me a Python script to do that. There seems to be a large demand for such a functionality (at least I found a lot of questions on how to automatically set the genre tag) so maybe someone else finds the script useful. It is pasted below.

General Strategy

The basic idea is to use Last.fm's tags for genre tagging. In iTunes the genre tag is IMO best used when it only contains one single genre, i.e. something like "Electronica", not something like "Electronica / Dance". On the other hand dropping all but one tag would lose a lot of information, so I decided to use the groupings tag for additional information that is contained in the list of tags that an artist has on Last.fm. In the example above that would be something like "Electronica, Dance, 80s, German". In that way it is simple to use iTunes' Smart Playlist feature to create play lists of all, say, dance music. This approach is probably not suitable for classical music..

The ID3 field that is exposed in iTunes' UI as "grouping" is defined in the ID3v2 spec as:
TIT1
The 'Content group description' frame is used if the sound belongs to a larger category of sounds/music. For example, classical music is often sorted in different musical sections (e.g. "Piano Concerto", "Weather - Hurricane").
So, the strategy I described above seems to be kind of in line with the spec. In general, it is a good idea to have a look at the ID3v2 spec if you consider dabbling with mp3 tags.

Practical Considerations

If one would just take an artist's highest-rated Last.fm tag for the genre one would end up with pretty inconsistent genre tags (think "hip-hop", "hip hop", and "hiphop"). Therefore, I chose to use a fixed set of values for genre. In a previous version of ID3 the list of possible genres was fixed. While this is clearly a terrible idea to start with it came along handy in this case. I used his as a fixed list for genres.

The second practical consideration was which Last.fm tags to include. In Last.fm parlance each artist tag comes with a weight (values form 0 to 100). Selecting only the tags with weight larger than 50 worked out fine for me (usually I had 1-5 tags per artist).

A third thing you might want to be aware of: if you programmatically change tags in an mp3 iTunes will not pick up these changes automatically. A simple way of letting it know: select the "Get Info" command on these items. This will trigger a reload of the new tag values.

Script

To run the script you will need the Python libraries mutagen and pylast installed. Run it with the option
-d directory_with_mp3s
The script will walk along this directory and modify all mp3s it finds. Also, you will need a Last.fm API key and set your API_KEY and API_SECRET accordingly in the script.

#!/usr/bin/env python# encoding: utf-8"""tag_groupings.pyCreated by Michael Marth on 2009-11-02.Copyright (c) 2009 marth.software.services. All rights reserved."""import sysimport getoptimport pylastimport os.pathfrom mutagen.id3 import TCON, ID3, TIT1help_message = '''Adds ID3 tags to mp3 files for genre and groupings. Tag values are retrieved from Last.FM. Usage:-d mp3_directory'''class Usage(Exception):   def __init__(self, msg):       self.msg = msgall_genres = TCON.GENRESgenre_cache = {}groupings_cache = {}API_KEY = "your key here"API_SECRET = "your secret here"network = pylast.get_lastfm_network(api_key = API_KEY, api_secret = API_SECRET)def artist_to_genre(artist):   if genre_cache.has_key(artist):       return genre_cache[artist]   else:       tags = network.get_artist(artist).get_top_tags()             for tag in tags:           if all_genres.__contains__(tag[0].name.title()):               genre_cache[artist] = tag[0].name.title()               print "%20s %s" % (artist,tag[0].name.title())               return tag[0].name.title()def artist_to_groupings(artist):   if groupings_cache.has_key(artist):       return groupings_cache[artist]   else:       tags = network.get_artist(artist).get_top_tags()       relevant_tags = []       for tag in tags:           if int(tag[1]) >= 50:               relevant_tags.append(tag[0].name.title())       groupings = ", ".join(relevant_tags)       groupings_cache[artist] = groupings       print "%20s %s" % (artist,groupings)       return groupingsdef walk_mp3s():   for root, dirs, files in os.walk('.'):       for name in files:           if name.endswith(".mp3"):               audio = ID3(os.path.join(root, name))               artist = audio["TPE1"]               genre = artist_to_genre(artist[0])               grouping = artist_to_groupings(artist[0])               if genre != None:                   audio["TCON"] = TCON(encoding=3, text=genre)               if grouping != None:                   audio["TIT1"] = TIT1(encoding=3, text=grouping)               audio.save()def main(argv=None):   if argv is None:       argv = sys.argv   try:       try:           opts, args = getopt.getopt(argv[1:], "ho:vd:", ["help", "output="])       except getopt.error, msg:           raise Usage(msg)        # option processing       for option, value in opts:           if option == "-v":               verbose = True           if option in ("-h", "--help"):               raise Usage(help_message)           if option in ("-o", "--output"):               output = value           if option in ("-d"):               try:                   os.chdir(value)               except Exception,e:                   print "error with directory " + value                   print e       walk_mp3s()          except Usage, err:       print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)       print >> sys.stderr, "\t for help use --help"       return 2if __name__ == "__main__":   sys.exit(main())

## JCR in 15 minutes

We had a great NoSQL meeting yesterday evening colocated with ApacheCon. Thanks Jukka for organizing!

I was in track B for the second part, and found it very interesting to compare three different approaches to non-relational content storage: MarkLogic server, JCR and Pier Fumagalli’s Lucene+DAV technique.

I also quite liked Steve Yen’s “horseless carriage” way of looking at NoSQL. Defining things by what they are, as opposed to what they are not, sounds like a good idea.

I gave a short talk about JCR, find the slides below. Of course, as usual, they’re not as good as when I’m here to talk about them ;-)

## Interviewed by Internet Briefing Blog

Reto Hartinger of the Internet Briefing group has interviewed me about what I work on these days. The interview is in German.

## NoSQL Meetup and ApacheCon 09 in Oakland

ApacheCon US 09 starts today in Oakland. A couple of Day's engineers will give talks, not just about the usual suspects Sling and Jackrabbit, but also Tika and POI (details below).

Also, Jukka Zitting has helped organize a NoSQL meetup in Oakland starting tonight where Bertrand Delacretaz will talk about JCR.

Bertrand Delacretaz: Life in Open Source communities: Open Source communities often seem to have their own unwritten rules of operation and communication, their own jargon and their own etiquette, which sometimes make them appear obscure and closed to outsiders. In this talk, we'll provide recommendations on how to get touch with, and how to join, Open Source communities. Based on ten years of experience in various Open Source projects, we will provide practical information on how to communicate effectively on mailing lists, how to formulate questions in an effective way, how to contribute in ways that add value to the project, and generally how to interact with Open Source communities in ways that are mutually beneficial. This talk will help Open Source beginners get closer to the communities that matter to them, and help more experienced community members understand how to welcome and guide newcomers.

Carsten Ziegeler: JCR in Action - Content-based Applications with Jackrabbit: The Java Content Repository API (JCR) is the ideal solution to store hierarchical structured content, and to develop content-oriented applications. This session provides a practical introduction to help you get started using JCR in your own application. To demonstrate the basic architecture of such applications, a sample content-based application will be developed during the session. Basic techniques will be explained, including navigation, searching, and observations, using the Apache Jackrabbit project.

Embrace OSGi - A Developer's Quickstart: In theory, the first choice for highly modular, dynamic, and extensible applications is OSGi technology. The theory sounds very tempting, but what about the real world? Starting with the basics of OSGi, this session is focused on practical examples, tools, and procedures for a rapid adoption of OSGi in your own projects. Learn how to avoid the typical traps and how to get the most out of OSGi

Felix Meschberger: Rapid JCR applications development with Sling: Apache Sling is an OSGi-based, scriptable applications layer, using REST principles, that runs on top of a JCR content repository. In this talk, we'll see how Sling enables rapid development of JCR-based content applications, by leveraging the JSR 223 scripting framework. We'll also look at the rich set of OSGi components provided by Sling. We will create a simple application from scratch in a few minutes, and explain a more complex multimedia application that does a lot with just a few lines of code. This talk will help you get started with Sling and understand how the different components fit together.

Jukka Zitting: MIME Magic with Apache Tika: Apache Tika aims to make it easier to extract metadata and structured text content from all kinds of files. Tika is a subproject of Apache Lucene, and leverages libraries like Apache POI and Apache PDFBox to provide a powerful yet simple interface for parsing dozens of document formats. This makes Tika an ideal companion for Apache Lucene, or for any search engine that needs to be able to index metadata and content from many different types of files. This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity. The audience is expected to have basic understanding of Java programming and MIME media types.

Paolo Mottadelli: Apache POI recipes: The Apache POI project provides Open Source Java APIs for the manipulation of Microsoft Office format files. It was developed to provide OLE2 Compound Document format support. POI support for the new format was necessitated by the proliferation of new Office Open XML (OOXML) documents, due to its standardization. As a result, a common challenge emerged for projects that leverage POI to read and write Excel, Word, and PowerPoint documents: supporting the new format while maintaining backward compatibility with the earlier one. This session provides an overview of how the new POI architecture makes that challenge easier, using the common interfaces package and their double implementation. Participants will also learn about the main new features provided by POI towards support of the new OOXML format. To demonstrate POI's features, this session will also drive through a collection of practical recipes to solve the tough problems of integrating Office documents in your enterprise applications.

## Life in Open Source Communities, live at ApacheCon!

I have just finished my slides for next week at ApacheCon. Though the topic of how to “survive” in our open source communities has been on my mind for a while, this is a totally new presentation, which is both great (in the blank slate sense) and a lot of work.

Having recently read Presentation Zen (very recommended if you do presentations and/or like beautiful books), I started adding full-screen pictures to the first few slides, and couldn’t stop! The presentation will then consist of me ad-libbing (or more precisely trying to tell stories) on a series of nice pictures grabbed from morguefile.com (don’t worry about that name).

I’ll post the slides here later, for now they are super secret, so you’ll just get the teasers…images courtesy of morguefile.com (update: slides added now).

Hope to see you next week! In any case I have collected a number of useful links in my delicious bookmarks, I’ll point people to them in the presentation.

## Bertrand on the telly (brush up your French)

Bertrand Delacretaz was interviewed at the OpenWorld Forum in Paris about the Apache Software Foundation. It is an ASF primer - apparently the ASF and its ways of working are relatively unknown in France.

## Author-centric feature scoping and integration

Won't somebody please think of the children authors?
(almost Helen Lovejoy)

Have you seen the component for multi-variate testing (MVT, aka A/B testing) in the upcoming CQ5.3 release? I saw it demoed by David at Ignite and was completely blown away: CMS users (authors) can simply drag a couple of alternative banners onto the component right from within their regular editing interface (the CMS will then show the different versions to different users and count the click-through rates so that eventually the best performing banner is determined).

The MVT feature reminded me of two other CQ5 features: personalisation and analytics. All three are truly and seamlessly integrated into the user interface of the authors and they all provide less functionality than full-blown standalone solutions. To give you a concrete example: for each page the editors see right in the site admin (i.e. in their daily user interface) how many views each page got in the last 30 days. Clearly, this is no match for the kinds of reports you get on, say, Google Analytics, not even the same ballpark. Yet, I still think the authors get something that is valuable for them: they see it right away what is of interest to their audience.

(there is screencast available that demonstrates personalisation features from an author's perspective, registration needed)

When I compare this author-centric evaluation of functionality with my usual point of view as a system architect the business value of a feature for an author might be determined by:

• ease of use rather than feature richness and

• seamless integration into the UI

This is probably true for most systems that have non-technical users, but I believe the effect is amplified in CMSs because many CMS users use the system only once in a while rather than regularly.

Of course, this author-centric view on features should not necessarily dictate the underlaying systems architecture especially when you look at a complete content management solution encompassing analytics, personalization etc. The architecture might still be full-stack or best-of-breed and I do not want to postulate one being better than the other. However, I believe that one way of knowing that you got the author's user experience right is when you cannot see the system architecture reflected in the UI anymore. Or to put it the other way round: the UI should not enable you to guess which box which feature is running on.(*)

Thanks to Lars for providing discussions and ideas on this post.

(*) This idea is adjacent to a pet subject of mine: the user interface for basic content management (CRUD stuff) should not enable you to guess the underlying content/data schema. Sadly, up to today not few CMSs UIs resemble ERP-style data entry masks.

## How well does the french-speaking world know the Apache Software Foundation?

I was at OpenWorldForum in Paris a few weeks ago, together with fellow Apache members Sander Striker and Emmanuel Lécharny.

My first impressions (apart from the fact that Paris is always nice – I knew that already) were that the French tend to wear suits and say “vous” (polite form of “you”) instead of “tu” (the familiar form) which I would tend to use in geeky circles. Cultural differences…

But more seriously, how well does the french-speaking world know the Apache Software Foundation? Not well, it seems to me. In most of our discussions people could associate the ASF with the Apache HTTP Server project, but not much more. 2’000 committers? 300 members? Really?

To help improve this, I hope that the ASF can take a more active role in the conference next year, I’ll bring this up next week at ApacheCon with our conference people.

In the meantime french-speaking folks are welcome to learn a bit more about it thanks to TiViPRO’s interviews of Emmanuel and myself, shot during the conference.

## NoSQL interests

We’re organizing a NoSQL meetup in Oakland on Monday next week. In addition to helping set the meetup agenda, the “Topics you are interested in” question in the sign up form provides some interesting insight on the current interests of the NoSQL community. Here’s a quick breakdown of the key terms distilled from the 88 signups we’ve received so far.

Note that the data is biased towards Apache projects due to the meetup being organized at ApacheCon US 2009.

#### Projects

The following open source projects were mentioned. The list is in alphabetical order, as the data set is too small to make any reasonable ordering by popularity.

#### Topics

Many responses were about the “big data” aspect of the NoSQL movement. Some frequent keywords: distributed storage, large transactional data, consistency, failover, availability, reliability, stability, failure detection, failed node replacement, (petabyte) scalability, consistency levels, storage technology, performance, benchmarks, optimization, backup and recovery, map/reduce

Another common theme were the various database types and the NoSQL “development model”. Keywods: document stores, key/value stores, consistent hashing, graph databases, object databases, persistent queues, content modeling, migration from the relational model, social graphs, streaming, software as a service, offline applications, full text search, natural language processing

Beyond the above big themes, I found it interesting that the following technologies were specifically named: Erlang, Java, WebSimpleDB, WebDAV

In addition to specific topics, many people were asking for case studies or “lessons learned” -type presentations.

## Colayer's approach to collaboration software

Chances are you have not heard of Colayer, a Swiss-Indian company producing a SaaS-based collaboration software. I did a small project with the guys, that is how I got to know. When I first saw their product I immediately thought the guys are onto something good, so it is worthwhile to share a bit of their application concepts. They follow an approach I have not seen anywhere else.

On first look Colayer seems to be a mixture between wikis and forums: the logical data structure resembles a hierarchical web-based forum and the forum entries are editable and versioned like in a wiki. But there is more: presence and real-time. All users that are currently logged in are visible and one can have real-time chats within the context of the page one is in or see updates to the page in real-time (similar as in Google Docs). These chats are treated as atomic page elements (called sems in Colayer parlance) just like the forum entries or other texts. Through this mechanism, all communication around one topic stays on one page and in the same context.

There are two more crucial elements: time and semantics. All sem's visibility is controlled by their age and their importance. As such, a simple chat is given less weight that a project decision and will fade out of view after some time. All new items from all pages (i.e. discussions or topics) are aggregated on a personal home page and shown within the context where they occurred.

Below is a screenshot of such different sems in one page. One page corresponds to one topic or forum or wiki page. You can see the hierarchical model and the different semantics (denoted by the colors).

Here is an example screen shot that aggregates different recent sems on one page (essentially a context-aware display of new items including time and context in the same display). Note that this way of displaying new items manages to map importance, time and context into a two-dimensional page, which I find a very cool achievement.

The funny thing about Colayer's product (especially when compared to Google Wave) is that one "gets it" when first looking at it. It solves a problem I am facing in my work on a daily basis: where to put or find crucial information - on an internal mailing list or on the wiki?

The Colayer application is delivered as a browser-based SaaS solution (mainly targeted towards company-internal collaboration). This limits potential usage scenarios outside of the firewall. It would be cool if Colayer found a way of opening up their application to other data sources or consumers. It would be worth it, the app rocks.

## Ignite Zurich

Ignite in Zurich was a blast. Lots of good presentations (which I will hopefully be able to share later) and excellent discussions on content management. My favourite quote was from Newsweek's Meshach Jackson. Their editors on the CMS selection process:

If you don't select CQ we're all quitting.

I also enjoyed what David Nuescheler had to say about separation of content and layout. The standard CMS architecture thinking goes like this
a) we need to separate content and layout
b) therefore, the user interface for the authors cannot look anything like the rendered pages and we will make it look like a database entry mask

David demonstrated that a) simply does not lead to b). Or in his own words:

Separating content and layout does not mean you have to confuse the authors

To get a taste of the event: The conference hashtag was #dayignite and here are some Ignite pictures on Flickr:

So, in case you missed Ignite Zurich there is still Ignite Chicago coming up...

## Putting POI on a diet

The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.

However, there’s one problem with POI that I’d like to see fixed: It’s too big.

More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:

Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.

Does anyone have good ideas on how to best trim down this OOXML dependency?

## [ANN] Sling Users List now available

Apache Sling keeps on growing up. Today's step: Felix Meschberger announced a users mailing list (before there was only one mailing list for developers and users). The new list is also available in the Discussion Groups section of dev.day.com.

## Re: RESTful daydream #4

Justin Cormack has written a noteable blog post titled "RESTful daydream #4" (pun intended?) about RESTful content repositories. I disagree with some of what Justin writes about JCR and Sling. However, I completely share his vision about REST's role in content management and I am with him regarding the overall theme of his post.

Essentially, Justin asks for a "web content repository":

The odd thing is that a web content repository alone surely lends itself to a simple REST architecture. Content is after all lots of small resources with relations. [...] It takes content, relates it to other content, and serves it back, with authentication and versioning. Everything else is in other system layers, transforming it and so on. Not simple, but well defined; lower level than JCR + Sling say.

Apache Sling provides a RESTful interface onto content, but being a web application framework it provides much more, especially scripting. So I can understand well why Justin dismisses Sling as being too powerful. However, I believe that Jackrabbit's native http layer is pretty much on the mark (Justin also dismisses it in the comments as being not RESTful. I do not know why.). As far as I know the Jackrabbit http remoting layer is quite work-in-progress so one might argue about the details regarding its RESTfulness, but overall it fits what I understand to be a RESTful web content repository.

There is one aspect regarding a web content repository I would like to add to the discussion: I think it is crucial that the representations of resources and be consumed by browsers. HTML forms for writing and presenting resources in JSON should be part of the equation. Adding models or semantics on top of that will might do more harm than good.

Do we need a formal standard for web content repositories? From my perspective: not yet. At the moment we still need to learn more, i.e. more repository implementations and users that have built a CMS on top of a web content repository.

Here is a small bit about the asynchronous nature of the Flex layout mechanisms that I learned while slapping together a presales demo yesterday:

When changing properties of UIComponents listen for FlexEvent.UPDATE_COMPLETE events. They get fired when the change is actually done. In my case I needed to get the textWidth of a Label after changing the label text. Right after calling the setter of text the getter of textWidth will still return the old value.

## [LOTD] Why JCR is good for Content Management?

eXo's Peter Nedonosko discusses on his blog why JCR is good for content management. I could not agree more with his conclusion: it is useful for cms developers.

JCR standard is useful for CMS [...]

## Talk at Java User Group

Yesterday, I gave a talk at the Java User Group Switzerland (JUGS) titled "Agile RESTful Web Development". It was about the REST style in principle and hands-on RESTful development with Apache Sling. I enjoyed giving the talk and think that it was well received. Here's the slide deck:

## [ANN] Talking at openworldforum in Paris this Friday

I'll be talking at Open World Forum in Paris later this week, presenting a condensed version of my Open Source Collaboration Tools are Good for You talk, this Friday at 13:40 (and no, the topic is not "Coming Soon" as indicated on the program ;-)

Later on Friday afternoon, I'll take part in a forum (in french) on the future of open source forges. I haven't seen the list of participants yet, but the topic looks promising.

Sander Striker, Executive Vice President of the Apache Software Foundation, will also be there for a roundtable this Thursday after the 16:30 keynote.

And I'll be staying with fellow Apache member Emmanuel Lecharny - meeting the locals is always nice, and Emmanuel's a city cyclist as well, so we'll be able to trade city jungle tricks I guess.

The conference looks quite busy with lots of interesting presentations, roundtables and workshops.

Looking forward to it, and make sure to say bonjour if you're around!

## Data First in Cloud Persistence

My colleague Cedric Huesler gave a talk "Data First in Cloud Persistence" at yesterday's CloudCamp in London. Missed it? You'll have another chance next week at the Frankfurt CloudCamp. Meanwhile, here's the slide deck (I love the second slide):