Sunday 12 February 2012

The Big Swim

Recently I competed in one of Australia's biggest ocean swims, The Big Swim. Now I'm not particularly good, just stupid and competitive, and the results provide a nice sporting dataset with which to play. I've wanted to teach myself some mapping / visualisation techniques for a while, so I took the opportunity to investigate this data in order to find out from where competitors for the event came, and from where they are the quickest.

I have created the following interactive chart using Google Fusion Tables. From the swim results, I extracted the competitors' times and the suburbs they came from, and then mapped the suburbs to their postcode using the aus-emaps postcode finder. From this table I worked out the average, minimum, maximum and median times for each postcode. I've only plotted New South Wales postcodes.

The tricky part was mapping the postcode boundaries. Thankfully, the Australian Bureau of Statistics has a couple of files you can use, however to use these with Google Maps, you need to convert them to the kml file type. MyGeodata Converter provide such a service. This meant we had two files - one with the swimmer statistics per postcode, and one with the boundary coordinates. It is easy to merge these tables with Google Data Fusion, and voila, you have an intensity map.

The map below is coloured by the number of competitors from each postcode - red is the most and green the least. The most swimmers came from postcode 2026, which is Bondi and surrounds. Many postcodes, including my own, only had one competitor. If you click on a postcode, it will give you that postcode's statistics - note that the times are in decimal (Google Data Fusion has some issues with data type, so it was easiest to treat the times as decimals, rather than date/time format). So 51.58 minutes means 51 minutes 35 seconds.

The quickest postcode (that had over 10 competitors) was 2075 (St. Ives and surrounds). The slowest with over 10 competitors was 2153 (Baulkham Hills and surrounds). One might postulate that Baulkham Hills is too far from the beach, and that everyone in St. Ives has a private swimming coach. Or it could just be random, as there really aren't enough swimmers per postcode to draw too many conclusions.

The biggest bug in this is the "Sydney" postcode which is, I'm fairly sure, way over populated due to people putting "Sydney" down instead of their suburb in their swim registration. Not that many people live in the city.



The following chart shows the distribution of times, which looks quite like a normal distribution with a slight right skew due to the fact that there is a hard limit on the quickest you can possibly complete the swim, whilst you can take as long as you like to finish. Large public sporting events tend to have a long tail as people may come out once a year and jump in the ocean without particularly caring how quickly they go. This is especially true for running events where you often have people dressed up as Snoopy out the back. Ocean swim events tend to have less of this as, unlike running, if you stop, you drown! So without a very long tail, the Central Limit Theorem kicks in and gives you a normal-ish (or log-normal distribution) distribution.


References:
  1. The results come from the Ocean Swims website (which is an excellent source of information for ocean swimming in Australia) - the Ocean Swim Series website is also a good data source.
  2. Make your own tables and maps at Google Fusion Tables.
  3. The postcode information came from the Australian Bureau of Statistics and aus-emaps.
  4. I converted the ABS data to a kml file using MyGeodata Converter.
  5. All Things Spatial is a great resource for data mapping