## Thursday 3 September 2009

### Stats in the media

I love to see statistics in the media. This week there were a couple of stories that caught my eye:

### Town hall bans staff from using Facebook after they each waste 572 hours in ONE month.

I found this article courtesy of the excellent science blog, The Lay Scientist. On first reading that headline from The UK's The Daily Mail, it would seem that 572 hours is an extremely large amount of time for one person to spend on Facebook in one month. The article says that over the last year, the 4500 employees of Portsmouth City Council have spent on average 413 hours a month on Facebook while at work, peaking at 572. If we assume there are 30 working days a month (that is, the employees are very keen and work weekends), this means on average they spend almost 14 hours a day on Facebook. Each!

If we look a little more carefully and assume a 40-hour week and a 22-working-day month, this comes to 180 working hours, well below the 413 Facebook average! The problem of course is the word EACH in the title, and The Daily Mail has since removed it from the online story. If we remove the offending word, as The Lay Scientist has done, we can see that the average employee with Portsmouth City Council staff spends 20.8 seconds using Facebook each day. If anything I think the council should be commended!

This would be an example of some bad stats (or more to the point, bad arithmetic).

### League's bad boys might just be acting their age, statistics suggest

The Sydney Morning Hearld published an interesting article on the recent spate of off-field incidents by Australian Rugby League players. 2009 has been a horror year for League, with new players seemingly in trouble with the law every week. The article used data from NSW Bureau of Crime Statistics and Research to compare the rate at which players are charged with assault, with the prevailing rate for all males aged 18 to 34 in NSW. They found that across the state, young men are charged with assault at a rate of about 700 per 100,000 each year - you can look this up yourself on the NSW Bureau of Crime Statistics and Research website, although I couldn't find it split by age. In the 12 months to March 31, out of the 400 players who play in the NRL, only three were charged with assault. The article then suggests that this rate of 750 per 100,000 is only slightly above the NSW figure and so therefore League players really aren't that bad.

The biggest problem with this statistic is sample size, which the article itself concedes. You can't draw too many conclusions when the addition of only a couple more assaults would double the NRL assault rate - to prove a significant difference between two data sets, you need to have a large sample size. With a small sample size and a very low assault rate, even if the assault rates look similar, you can't conclusively say very much.

The other problems include the fact that the data doesn't look at the year post March 31 - which has been the NRL's annus horribilis - except for a passing reference that it looks like it could be a bad year for the NRL given that three players have already been charged with assault and there are seven months left till next March. The data also does not take into account that many of the crimes NRL players have committed are not assault but fall into other categories.

However, the article concedes all this - so why was it published? Despite the fact that the word "statistics" has been used in the headline to add weight to the argument, the correct interpretation of these particular statistics is to say that very little has been proven. If you sampled your local gaol or down-town Johannesburg, a sample size of 400 for a crime that usually has a very low rate might prove significant, but that's not the case here. Perhaps we have proven that the NRL is better than a bunch of criminals - I guess that's something!

The issue here is that people remember the headline. Even though the article was entirely correct in what it said - it mentioned all the statistical concessions we've listed here, and even put the word "might" in the headline - readers will remember that "statistics showed" that League players are just like you and I. I posted about a similar topic a couple of years back when The Independent on Sunday presented a graph of the oil price between 2000 and 2008, and on the same chart plotted the Nasdaq technology index between 1992 and 2000. The two plots showed a startling similarity, even though they are completely unrelated and even though the article conceded this point very early. However, at first glance you are mislead, and this is what people remember - I somewhat cheekily plotted the performance of the Australian cricket team on the same chart to make this point - it was an even better correlation!

You can draw your own conclusions on the behaviour of League players. Of course, wikipedia has a list of off-field incidents involving league players for you to peruse!

Edit: I thought it worth taking a look at some of the stats - using a 2-tailed t test, with a sample size of 400, if 9 players committed assault in a year, then you could say that NRL players are more likely to commit assault than the general public. This corresponds to a rate of 2250 per 100,000 - three times what we had before. This could essentially be a big night out for NRL players at the end of the season! What this suggests is that a small difference can make a big result, and this is why we can't draw too many conclusions about League behaviour from this data. If there were a pool of 4000 NRL players, you could start to draw conclusions on NRL behaviour if 40 committed assault - this is a rate of 1000 per 100,000 - considerably less than the case for a pool of 400. It is dangerous to quote "rates" when you don't have much data.

The other point is that failing to show that two data sets are significantly different - that is, that NRL players are no different to the state as a whole - doesn't mean that they are the same. As we have seen, when only a few more assaults would make a very large difference, the system is not very stable.