The Tour de France (TDF) is the marquee cycling event on the calender for any top international pro cyclist as well as their squads. Everyone wants to do well here because its arguably the biggest and most glamorous stage for displaying athletic talent. The competition is tough, the fans are many, the stages are epic and the prize money is fat.
In this post, I'm trying to figure out what kind of a statistical distribution is seen in the finishing times from this year's prologue TT (Tour de France). I will also try to quantify the probability of getting close to the fastest time trialist in the world. Alberto Contador tried pretty darn well. How well?
Only one way to find out these things.
So here's what I did.
Step 1 : I obtained Cyclingnews.com data for the TDF Prologue TT on July 4, 2009. I obtained 180 data points corresponding to all the competing cyclists.
Step 2 : To make sense of this data clutter, I put them into Microsoft Excel 2007 and ran a descriptive statistics analysis on it. Here's what I obtained. What you're about to see is powerful.
So is my sample set taken from a normal distribution or something different? Let's try to answer that reasonably with the table above.
The mean, median and mode are very close to each other which MAY indicate its normally distributed. The average of the average deviation of each cyclist from the mean was 0.63 min or 37.8 seconds. The minimum time belonged to Fabian Cancellara, with a blitzy 19.53 mins whereas the maximum time belonged to Yauheni Hutarovich. I also have a Kurtosis and Skewness of 0.558 and -0.068 respectively.
Positive Kurtosis indicates a relatively appreciable peak which makes me suspect the distribution is leptokurtic (too tall instead of normally high). The book Using Multivariate Statistics (Tabachnick & Fidell, 1996) explains that if my Kurtosis statistic is more than 2 times [sqrt(24/180)] = 0.73, the data is not normally distributed. Since 0.558 is less than 0.73, we're ok.
Negative Skewness indicates that my data is left skewed. The same book mentioned above explains that if my Skewness statistic is more than 2 times [sqrt(6/180)] = 0.365, the distribution is not normal. Since -0.068 is less than 0.365, we're ok here as well.
Step 3 : The above only gives rough indications of the type of distribution. Nothing beats setting up a visual of the spread. So I made a histogram, with a chosen bin width of 0.20 min.
The graph agrees with the skewness and kurtosis statistics. The data has central tendency but is ever so slightly skewed towards the left. This is the data for the best cyclists in the world. Not really a Gaussian, but not too far away from it either. What kind of distribution it is will take more analysis and tests for goodness of fit, which I'm going to tackle some other time.
So What Does All This Mean?
Looking at the data and Fig 2, we can say that the course conditions in Monaco on that July day were such that nearly 48% of all 180 cyclists managed to get times below the average, which might mean they were pretty fit and came well prepared (or something else worked in their favor which I can't quantify). Thus, the 48th percentile is the average time, i.e 21 min and 30 seconds.
To put it in another fashion, the probability of a world class cyclist racing on this course in a time less than the average time is 0.48.
52% of the 180 performed under par, with about 8% of those 52 giving exactly average times. The probability is 0.52 that a cyclist is at average time or above it on this course.
We can also say that 72% of the 180 cyclists lie between one standard deviation on both sides of the average, 93% lie between two standard deviations about the average and 99% lie between 3 standard deviations. Pretty close to the 68-95-99 rule obeyed by normal distributions eh?
Alberto Contador Vs Fabian Cancellara As Time Trialists
Our last question is the most interesting. So if you're a top pro at the peak of your abilities, what are you chances of ever getting close to Fabian Cancellara's blitzkrieg results? Then the next question would be, how close do you want to get to 'Spartacus'? Within 2%? 3%?
Let's do 2% as a start. Within 2% is 23 seconds difference. Now that's probably the limit of what a time trialist can accept to cap the gap, so to speak!!
Let's look at what Contador obtained that day from the data. Bert raced the course 18 seconds slower than Cancellara for an amazing second place. In other words, there was a mere 1.54% time difference between the best all round cyclist in the world and the fastest time trialist in the world. Just 4 cyclists managed to come within 2% of Cancellara's time - Contador, Wiggins, Kloden and Evans. 4/180 = 0.02 = 2%.
In other words, just 2% of the 180 cyclists got a time less than or equal to 19 minutes and 55 seconds (this 2% window we're talking about).
Put in another way, this is the 2nd percentile. This is where the glory is at. And the money. And the kisses from the long legged European girls.
The probability that you're in this 23 second window from the best man on the bike is low. Just 0.022 or 1 in 45 chance. Keep in mind this is for the best in the world.
Now you know why you and I are not racing in the Tour de France. Let's just scratch our butts and cheer these beasts on.
* * *