Thursday, 9 October 2008, data mining and mashups

I've recently been putting together a Guide to Web 2.0 for The Helix Magazine and one of the most interesting aspects has been exploring the various mashups and applications of is a brilliant online music service and currently my favourite "web 2.0" application. By downloading a plugin for itunes (or whatever music player you have) that "scrobbles" each song you play (that is, tells what you are listening to), a picture of your music taste builds up, and people with similar listening tastes are found. Artists are recommended to you according to your tastes, charts of your songs built up and "radio stations" perfectly tailored to you can be streamed online. But it is better than radio as there are no ads and you like every song.

By the way, I am westius on

Millions of songs are scrobbled every day by users. This data helps develop a massive database of user music preferences, and because of it's API, it is possible to access information and develop interesting tools.

As users can tag their music with genres that they think aptly describe their songs and artists, it is possible to determine your own tag cloud of musical preferences. Using an excellent script at, I came up with my own tag cloud, as you can see here.

It is possible from such tag clouds to examine how listeners fall into different categories through a process known as Data Mining. Data mining is essentially the process of sorting through enormous amounts of data and picking out the relevant stuff. Using principal components analysis - a mathematical technique which reduces multidimensional data sets to lower dimensions for analysis - and k-means clustering - an algorithm to cluster n objects into k groups - Liekens came up with 5 broad groups of listeners:
  1. Electronic/pop
  2. Rock
  3. Indie
  4. Metal
  5. Hip-hop
Clearly this list does not reflect everyone on (where are the classical music listeners?), but it does reflect the majority. I was surprised that Indie is a group in itself and am intrigued by the bundling of electronic and pop together - there are some tweaks to the maths you can make that could come up with different groups, and better results might be possible with a bigger data set . Hip-Hop listeners were the most clearly defined group. You can read more about the maths and how these groups are separated in the original article.

Another interesting thing you can do is compare your music tastes to your friends. This pic is a difference cloud comparing my music tastes with that of my good friend intranation. We have a roughly 40% similarity in music genre tastes, with the green tags those that I have more of in my collection, and the red those genres that intranation listens to more than me. No real surprises there.

Mashups are all the rage at the moment. The term refers to web applications that combine data from more than one source into a single integrated tool. For instance, domain, an Australian real-estate site, adds data from Google Maps to provide location information. My current favourite mashup is idiomap. idiomap is a digital music magazine that personalises its content according to your interests in music, which it learns from your profile. It gives you stories and reviews of the artists and genres you like, helps you discover new music and mashes in video and audio from youtube and other sources. idiomag aggregates music articles from over 100 different sources. You can also tweak the articles you like so if you receive something you don't like, you won't get it again. I subscribe to the RSS feed of my personalised idiomap magazine and so far its been great and has included reviews of music DVDs of artists I like and schedules of when bands will be playing and appearing on TV. Good stuff.

I will probably put out a few more blogs like this as I explore this world of mashups. And for podcast listeners, yes hopefully I will get one of them out soon too!

1 comment: