*A guest post by Bernard Kachoyan*

Ever thought of batting as a life and death struggle against hostile forces? It always seemed that way when I batted. Well you might be more accurate than you think.

The experience of a batsman can be described as a microcosm of life: when you go out to bat you are “born”, when you get out you “die”. But what happens when you are Not Out (NO)? More subtly, when you are Not Out you simply leave the sample pool, that is you live for a while then you stop being measured. In the parlance of statistics, this becomes “censored” data. In medical research the “born” moment is equivalent to when a patient is first being monitored (e.g. survival times of cancer patients after diagnosis). The question in medicine becomes, what is the “survival function”, the probability that a patient survives for

*X*years after the start of observation? And how does the life expectancy curve of one population differ from another, in particular are people treated in a particular way different to a control group).

These type of problems are commonly addressed using Kaplan-Meier (KM) estimators. In economics, it can be used to measure the length of time people remain unemployed after a job loss. In engineering, it can be used to measure the time until failure of machine parts. Here we will apply those ideas to batting in cricket.

An important property of the KM estimate is that it is non-parametric in the sense that it does not assume any type of Normal distribution in the data, something which is patently untrue for this type of data. It also only uses the data itself to generate a survival curve (the term given to the survival function after it is drawn on a chart) and associated confidence limits. Hence the KM survival curve may look odd in that it declines in a series of steps at the observation times and the function between sampled observations is constant. However, when a large enough sample is taken, the KM approaches the true survival function for that population.

An important advantage of the KM method is that it can take into account censored data, particularly censoring if a patient withdraws from a study, i.e. is lost from the sample before the final outcome is observed. This makes it perfect for dealing with the NOs as described above.

When referring to batsmen, “death” means getting out, being “censored” means completing the innings before getting out (remaining NOT OUT) and “time” means number of runs scored (

*t*= scoring

_{j}*j*runs). The idea of the KP estimator is pretty simple.

- The conditional probability that an individual dies in the time interval from
*t*to_{i}*t*, given survival up to time_{i+1}*t*is estimated as_{i}*d*where_{i}/n_{i}*d*is the number who die at time_{i}*t*, and_{i}*n*is the number alive just before time_{i}*t*, including those who will die at time_{i}*t*_{i} - Then the conditional probability that an individual survives beyond
*t*is_{i+1}*(n*_{i}– d_{i})/ n_{i} - When there is no censoring,
*n*is just the number of survivors just prior to time_{i}*t*. With censoring,_{i}*n*is the number of survivors minus the number of losses (censored cases). It is only those surviving cases that are still being observed (have not yet been censored) that are "at risk" of an observed death_{i} - The KP estimator of the survivor function at time
*t*for*t*is then formally:_{j}≤ t ≤ t_{j+1}

Such KM curves have attractive properties, which perhaps explain their popularity in medical research for over half a century. They are fairly easy to calculate and they provide a visual depiction of all of the raw data—including the times of actual failure, yet still give a sense of the underlying probability model.

Let’s now apply the KM estimator to some cricket statistics. In this case I have arbitrarily chosen the batting statistics of Steve Waugh, Sachin Tendulkar (up to 2010 to keep roughly the same number of innings as Waugh) and Don Bradman. Without the consideration of the censored data (the Not Outs), then the curve simply reverts to the percentage of scores less than or equal to a certain number of runs - the value on the x axis. This is shown in Figure 1. Bradman of course is still clearly in a class of his own.

If we now properly include the NOs in the formulation we get survival curves as shown in Figure 2. I have omitted the Tendulkar curves here for clarity. As expected, the survival rates go up as the NOs do not indicate a true “death”. In Steve Waugh’s case, the increase is noticeable (I didn’t say “significant”!) since he has a large number of NOs compared to most batsmen within his number of test innings.

This is shown more starkly in Figure 3, where I have plotted both the censored and uncensored curves for Waugh and Tendulkar. I have plotted them on a logarithmic scale to highlight differences. It can be seen that Waugh’s censored survival curve (cf the raw curve) tracks Tedulkar’s very closely until a score of about 100. This reflects the large number of Waugh’s NOT OUTS (43 vs 29 in roughly the same number of innings, 260 vs 278). The diversity of the curves after that not only reflects the propensity of Tendulkar to go on to big scores, but also that a large number of Tendulkar’s not outs were after he had already scored a century (15 vs 2 for Waugh).

*Figure 1 and Figure 2*

*Figure 3*

The basic KM methodology has been around since the 1950s and of course has been extended in various ways by professional statisticians and alternative methods proposed. But their simplicity means it is still widely used.

There are several drawbacks, some of which can be seen in Figures 1-3. Firstly, the vertical drop at specific times is drawn from the data, and should not be seen as indicating particular “danger times”. This is particularly evident at larger scores where the naturally small sample size means that three are fewer data points (i.e. scores where a batsman actually gets out). So some sort of smoothing of the curve is thus necessary to provide an estimate of the true underlying functional dependency.

This reduction on the sample at large values also means the effect of each individual failure on the size of the step-down increases.

Another drawback of the KM method is that the estimate of the probability of surviving each “danger time” depends only on the number of patients at risk at that time. So if there are censored values the actual time between the last failure and the time of censoring is not considered.

It is natural at this point to question the underlying assumption of the KM method that the patients (i.e. innings) are independent. Is it common to talk in cricket about form slumps or purple patches. This can be examined statistically by considering the autocorrelation function of the scores, shown in Figure 4 assuming stationarity, where Waugh has been omitted for clarity. The figure clearly shown no evidence for time/innings correlation and although strictly speaking un-correlation does not imply true independence, it is evidence that the innings can be considered independent for the purposes of this analysis.

*Figure 4*

The question naturally arises is whether we can say anything statistically about whether the difference between survival curves is significant (cf treated vs control groups in medicine). Confidence intervals can be placed on the derived curves using the so-called

*Greenwood formula*, dating back to the 1920s, or its more modern variations. These will suffer the drawback of being less accurate in the tail of the curves, where by definition the sample size is smallest. Not only will the formulas return a greater error because of that, the validity per se comes into question as the expressions rely on a normal approximation (through the central limit theorem), hence can only be considered valid for remaining innings bigger than say 20 or so.

Unfortunately, as we have seen above it is in the tails of the curve where the distinctions between very good and great batsman are often found.

Similarly a number of ways of comparing curves exist in the statistical literature, such as Kolmogorov–Smirnov test, the Log-rank test or the Cox proportional hazards test. These can rapidly become very mathematically complicated, especially if we want to try and distinguish one part of the curve specifically (say the high end).

Although I haven’t done the hard yards in this article, my intuition tells me we might be hard pressed to prove statistically significant differences between the Waugh and Tendulkar corrected survival curves. This is the drawback of applying statistical tests into areas where their applicability is not clear.

In any case, it can be seen that batting can most certainly be considered a true life and death struggle.

## No comments:

## Post a Comment