Friday, 18 June 2010

How do you spell goal?

It's not often I get the chance to pursue three of my passions - sport, mathematics and online social media - at the same time. The 2010 Football World Cup combines these facets of life in a way we probably haven't seen before, providing numerous opportunities for data mining, funky visualisations and general nerd-indulgence, as well as knocking out twitter for a time.

One creative exploration I saw recently was by @neilkod, who collected data from 30 GB of tweets on how people spelt the word "goal". Data mining Twitter is an evolving field - see our recent story on how by using Twitter data you can predict the success of a film. You can find the full goal data table here, and I have listed the top and bottom few below:

Rank
Word
Count
1
goal
50225
2
Goal
11727
3
GOAL
4202
4
goAl
798
5
goall
340
6
GOAAAL
92
7
goaL
88
8
Goall
75
9
GoaL
69
10
GOAl
66
11
goaaaal
61
12
GOal
50
....
....
....
1249
GGGGGGGGGGGGGGGGOOOOO
OOOOOOOOOOOOOOOOOOOA
AAAAAAAAAAAALLLLLLLLLLLLLLLL
1
1250
GGGGGGGGGGGGGGGGGGGOO
OOOOOOOOOOOOOOOOAAAA
AAAAAAAAAALLLLLLLLLLLLLLLLL
1
1251
GGGGGGGGGGGGGGGGGGGOO
OOOOOOOOOOOOOALLLLLLLL
1
1252
GGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOO
AAAAAAAAAAAAAAAAAAAAAAA
AAALLLLLLLLLLLLLLLLLLLLLLLLL
1

As expected, on top is the word "goal" (71%) followed by "Goal" (17%) and "GOAL" (6%). Then there are various misspellings, before the excited tweets come in, including the 140 character "Goooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooal".

Interestingly, the caps lock induced "gOAL" is only used once.

You can also visualise this in a chart - note that in the chart below, the x-axis is a log scale due to the fact that the leading terms are so far ahead:

Distribution of the word "goal"

Now it's time to get our nerd on....

Zipf's law is a curious law that arose out of an analysis of language by linguist George Kingsley Zipf, who theorised that given a large body of language (that is, a long book), the frequency of each word is close to inversely proportional to its rank in the frequency table. That is:


where a is close to 1. This is known as a "power law" and suggests that the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. There is an excellent take on this over at Plus Magazine. As you can see in the log/log chart below, after the 5th version of "goal" a Zipf curve fits remarkably well.

Distribution of the word "goal" - log/log chart

There has never been a real explanation of why Zipf's law should apply to languages and there is controversy surrounding whether it gives any meaningful insight. Power laws relating rank to frequency have been demonstrated to occur naturally in many places - the size of cities, the number of hits on websites, the magnitude of earthquakes and the diameters of moon craters have all been shown to follow power laws.

Wentian Li demonstrated in his paper Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution, published in IEEE Transactions on Information Theory, that words generated by simply randomly combining letters fit the Zipf distribution. Li showed mathematically that the power law distribution of frequency against rank is a natural consequence of the word length distribution, with words of length 1 occurring more frequently than words of length 2 and so forth. His underlying theory is that the rank distribution arises naturally out of the fact that word length plays a part - long words tend not to be very common, whilst shorter words are. Li argues that as Zipf distributions arise in randomly-generated texts with no linguistic structure, the law may be a statistical artifact rather than a meaningful linguistic property.

Our results mirror Li's quite closely. It is clear that the most used versions of the word "goal" - where goal is spelt correctly with various capitalisations - should not fit the Zipf distribution as these words are not random - they are the actual correct spellings people are looking to write in their tweets. However, after the initial few words, random spelling errors, and then the simple randomness of how long people hold their fingers on the keys in the excitement of a goal, take hold. From this point, we see exactly the same as Li - that the Zipf distribution arises from random words, with longer words less common than shorter words.

If you would like to see some truly insightful twitter world cup visualisations, check out The Guardian's World Cup 2010 Twitter replay. I haven't been replaying the Australia vs. Germany game very often....