Posted by Lars Trieloff APR 07, 2009
As you all know, CQ5 supports tagging and taxonomies and both side by side. Taxonomies are great, because they allow multi-dimensional classification of content, but sometimes there are things that do not fit into the taxonomy. And this is where it comes handy that you can just type and add a new tag to the standard tag namespace folksonomy. Using this feedback from the folksonomy you and enhance and improve your taxonomy. But what happens if you do not start with a neatly organized taxonomy, but with a wild-west folksonomy that has been created by numerous authors and you want to bring order into the chaos?
Actually, you are in a very good position. Starting with data first, gives you the ability to come up with a meaningful taxonomy that is relevant to your content in the first shot. Using a folksonomy as a starting point to create a taxonomy is what I (and others) call "Folksonomy Mining". As an illustration how to use my Folksonomy Mining technique, I will be using the folksonomy created by the last.fm community.
- Start with a folksonomy of viable size in a well-defined domain. You need at least 1000 tagged items and the domain should not be all-encompassing like "web pages on the internet". The last.fm folksonomy is certainly the right size and with music we have a domain model that is restricted enough.
- Get 100 most popular tags out of the folksonomy. With CQ5 tagging you have the "count" column that says you how popular a tag is. With last.fm, there is an API method for that. (100 tags)
- Remove obvious duplicates. "favorite", "favorites", "favourites" and "favourite" for instance need to be merged.
- Create dimensions for groups of similar tags. Examples that I can find in the last.fm folksonomy are: time (60s, 70s, 80s, 90s, 00s), origin (american, australian, british), mood (ambient, atmospheric, chillout), vocals (female vocalists, male vocalists, instrumental), ownership (albums i own, seen live), origin (soundtrack, covers, live, remix), season (christmas, summer), preference (amazing, awesome, beautiful)
- There is usually low co-occurrence between different items in the same dimension.
- For more complex dimensions such as genre (rock, pop, country, folk) you might want to create sub-and super-categories. For example rock-metal-death metal-brutal death metal (yes, this is part of the top 100)
- There is usually high co-occurrence between super- and sub-categories. And the super-category usually has more entries than the sub-category.
- Fill in "holes" in the taxonomy. For instance the time dimension: add 30s, 40s, 50s. In the season dimension add: spring, fall, winter
- For categories with many sub-categories add grouping categories where they become helpful, for instance in the origin dimension it might help to add american, european, asian, african to group origins.
- There will be a number of tags to remain uncategorized, just leave them this way, and leave the folksonomy open, so that new tags can be added over time
With these ten simple rules you have managed to grow trees out of the tag cloud, added structure where needed that can be re-used in query builders and other places of the system.

