Eduardo linked to this interesting paper by Walter Mebane on using Benford’s Law (the distribution of digits that arises from numbers that are sampled uniformly on a logarithmic scale) to investigate election fraud. I’ll give my thoughts, but first here’s the abstract:

How can we be sure that the declared election winner actually got the most votes? Was the election stolen? This paper considers a statistical method based on the pattern of digits in vote counts (the second-digit Benford’s Law, or 2BL) that may be useful for detecting fraud or other anomalies. The method seems to be useful for vote counts at the precinct level but not for counts at the level of individual voting machines, at least not when the way voters are assigned to machines induces a pattern I call \roughly equal division with leftovers” (REDWL). I [Mebane] demonstrate two mechanisms that can cause precinct vote counts in general to satisfy 2BL. I use simulations to illustrate that the 2BL test can be very sensitive when vote counts are subjected to various kinds of manipulation. I use data from the 2004 election in Florida and the 2006 election in Mexico to illustrate use of the 2BL tests.

My main suggestion is that the method should be applied to lots of existing fair elections (or, at least, elections that can be presumed to be fair) in order to get a baseline for the digit distributions. As the article discusses, there are lots of reasons to expect Benford’s Law (either for the first or second digits) not to exactly apply even if the election is entirely fair. So these studies really need to be calibrated to the actual distributions of what happen, not to the theoretical model. As alluded to in the paper, some of this has to do with variation in precinct sizes.

A related point is that the chi-squared statistic is probably not the best summary to use. The problem is that, when the model does not actually apply to the data, the chi-squared test is more of a measure of sample size than of discrepancy from the model. This can be seen, for example, in Table 22, where there is a strong correlation between the chi-squared statistics and the number of voters. It would actually be better to use an unscaled discrepancy which more directly measures the difference between data and model. I’d probably start with something simple, as follows (for the second-digit Benford’s Law): let y_i be the proportion of cases with second digit i, for i=0,…,9. Then regress y_i on i. The slope of this regression can be compared with the theoretical slope (-0.0038, as calculated from the table on page 27). Other things could be tried too, of course–the point is to get a sense of what these deviations are, not just to see their statistical significance.

A related point is to use graphs instead of tables, which should allow patterns to be seen more clearly. This is super-clear in something like Tables 5 and 9-16, but I’d argue that every one of the tables would be better as graphs. I’d also recommend listing the Mexican states in order of increasing #voters or some other meaningful measure, rather than alphabetically.

Anyway, this is a very innovative and thought-provoking paper, and I look forward to seeing more work in the area.

Benford's law can be given for bases other than 10. It might be a lot easier to work with base 2 or 3.