I saw this afternoon an interesting talk by Lawrence Brown on baseball statistics, as part of the UCI statistics seminar.

The title of the talk is "In-Season Prediction of Batting Averages: A Field-test of Simple Empirical Bayes and Bayes Methodologies". In simpler words, the problem Brown considers is the following: at a certain point in the season, your favorite player has accumulated a certain batting average (ratio of the number of hits he made to the number of times he was at bat). What should you predict as the probability of a hit for his next at-bat?

The statistical nuggets I got out of the talk:

  • The assumption behind the statistics is that every player has a fixed probability of getting a hit, and that each at-bat is successful independently with this probability. The statistical task is to estimate the probability for each batter.

  • The batting average itself makes a poor estimator of the batter's probability: it's overwhelmed by the noise in the data. You'd do much better simply by predicting any individual batter's probability to be equal to the global average success probability of all batters (about 0.260).

  • Better predictors can be formed by shrinking the individual batting averages towards the global average success probability.

  • The data he used was from the 2005 season. He used a cross-validation regime in which he used the data from the first three months of the season to perform his statistical estimation, and the data from the remaining three months to test the quality of his estimates.

  • The assumption of independence is very likely untrue over short time periods (intuitively, for instance, at-bats within the same game should be correlated with each other because you'll be up against the same pitcher multiple times) but Brown did some empirical analysis showing that it works very well over time scales of a month or more. On intermediate time scales of a week or so, there is some evidence of streakiness in the batters' records. The streakiest batter: former Angel David Eckstein.

  • Rather than using hits/at-bats, it works a little better to use (hits + 1/4)/(at-bats + 1/2). Any function of the form (hits + c)/(at-bats + 2c) tends to smooth out the data for batters who have seldom come up to bat, but there are technical reasons for choosing c = 1/4 involving unbiased estimators. This smoothed average is still a poor estimator, but it forms a better input to the more sophisticated statistical techniques for shrinking the averages that Brown described.

  • The batting average for a given player, viewed as a random variable determined by his batting probability, is binomially distributed. To convert these variables to the normal distributions that are so much more convenient statistically, Brown uses a technique known as angular transformation or variance stabilization: he takes the arcsin of the square root of the smoothed batting average. The choice of c = 1/4 in the smoothing cancels a linear error term, leaving this transformed value within a quadratic error term of being an unbiased estimator of the arcsin of the square root of the batter's probability. He uses this arcsin-sqrt normalized batting average as the basic data for his algorithms; for instance he measures the error in estimation as the sum of squares of the differences between his estimated arcsin-sqrt-probability and the batters' true arcsin-sqrt-probabilities. He was asked after the talk whether this error measure made any sense, and he said that he tried measuring the error directly in terms of the probabilities and it didn't make much difference.

  • The variance of all the variables after this transformation is (to within a quadratic error term) 1/4n where n is the number of at-bats of the individual batter; this simple formula helps in the later estimation steps, and is I think the main reason for choosing this particular form of normalization. The problem of estimating batting probabilities is heteroskedastic, because different hitters have very different numbers of at-bats and therefore very different variances. In particular, pitchers don't get up to bat very often.

  • Along with the two trivial estimators described above (use the batting average itself, or the global hit probability), most of the statistical estimators Brown describes are "empirical Bayes": they use the data to estimate the distribution of (arcsin-sqrt normalized) batter probabilities. Once this distribution is fixed, they use Bayes' rule and the individual batter's normalized batting average and known variance to find from that distribution the best estimator for each batter.

  • Several of the methods he tried (for instance, hierarchical Bayes approaches) assumed that the arcsin-sqrt normalized batter probabilities are themselves normally distributed. This doesn't work very well because they aren't. One reason for this is that the pitchers throw off the bell curve — they're much worse as batters than the other players.

  • The best method he tried was based on nonparametrically estimating the distribution of normalized batter probabilities.

I don't see a paper by Brown in this subject on his home page, but Google finds some kind of preprint here. If you want any more detail about this, you'll have to look there, or wait until he puts up a real paper.

ETA April 10, 2008: Julie Rehmeyer has also posted about Brown's baseball research. Brown's paper has now been published: Ann. Appl. Stat. 2 (1): 113–152, 2008.