I always think it’s funny when people go around saying their statistical methods have some sort of “guaranteed” performance. I mean, sure, guarantees are fine—but a guarantee comes from an assumption. If you want to say that your method has a guarantee but my method doesn’t, what you’re really saying is that you’re making an assumption and I’m not.

Assumptions are fine—they’re necessary—so that’s all cool. Let’s just remember where guarantees come from.

The phrasing is a bit too glib for me. When someone says their method has a “guarantee,” what they mean is that they have identified a set of assumptions under which the method works, and proved it. When they say a method has “no guarantee,” what they mean is that no such sets of assumptions has been identified, and possibly the method doesn’t work under any set of assumptions! So all methods have assumptions, implicitly or explicitly, but it is good to identify them whenever we can.

Evan:

That’s a useful clarification. I agree that guarantees are not empty. A guarantee is a theorem, a logical implication from assumptions. My point is that a guarantee implies an assumption. It’s fine when people state guarantees—as long as they recognize the assumptions underlying them.

There are guarantees that come from, in essence, fancy counting arguments, that do not involve assumptions about the data generating mechanism.

https://normaldeviate.wordpress.com/2012/07/04/statistics-without-probability-individual-sequences/

For large N almost every N digit binary sequence is a random sequence in the Kolmogorov / Per Martin-Lof sense.

In case it’s not obvious, the point of mentioning that is that somewhere in there it seems likely that whatever the counting argument is it may be equivalent to some randomness assumption because basically all the sequences are random.

This guarantee applies even if an adversary searches for the worst possible sequence — a decidedly non-random data generation mechanism!

That’s pretty interesting, but if even the worst ball in the urn is sorta red then the probability of picking a red ball is very high even if you search for the least red ball.

One of the things about Kolmogorov random sequences is that there is no computable function that outputs that sequence which is itself substantially shorter than the sequence. If the sequence is extremely long, then even writing a computer program to do the search for the worst possible sequence is potentially writing a very long computer program (it has to be long enough to output all the sequences you are checking). Basically I suspect there is some aspect to the guarantee here which has to do with this kind of complexity argument.

Part of the reason to mention all of this is that it’s part of my opinion that the assumptions about “random” outcomes that are appropriate for doing frequentist tests etc fail to hold in most actual cases in which they are used. There are a number of reasons for this, one of the most important is the size of the dataset:

If you collect 25 or 100 data points, you have a sequence of 25 or 100 values, suppose you have 3 decimal digits of precision in your data collection, then you have 1000^100 = 10^300 possible sequences…. Now suppose you collect 25000 values, you have 1000^25000 ~ 10^75000 possible sequences…. 10^300 / 10^75000 = 10^-74700 ~ 0 to basically 75000 decimal places.

you can extend sequences of 100 data points to sequences of 25000 data points by just choosing the arbitrary say 0 value for all the remaining values… Which shows that the sequence of 100 data points is a meaninglessly tiny subset of the sequences of 25000 data points. If almost all 25000 data point sequences are random, this does not in any way imply that almost all 100 data point sequences are random, or especially all 25 data point sequences etc.

The assumptions that are made about “random data” (independent identically distributed data from a usually assumed known distribution) are usually inappropriate in real-world data sets, for lots and lots of reasons. But when it comes to long sequences, the assumptions may be less problematic. How long? is the question in any given example.

Frequentist statistics assumes that the “randomness” is *a property of the data*. Checking that requires long sequences because the properties you’d test all become more checkable as the sequence increases in length, out to thousands or millions of data points. The die-harder tests use billions of data points to check the properties of pseudo-random number generators.

Bayesian statistics doesn’t assume randomness of the data at all, it assumes instead a description of the information state of the observer in terms of a weight over subsets of possible sequences that the model would consider plausible vs not plausible (the weight is the plausibility). Testing whether a sequence meets high complexity statistical testing procedures to see if it “really is random” is not part of Bayesian logic. Each sequence is assigned a plausibility based on the model assumptions, and this is a fact *about the model*

You can reject the idea that a sequence comes from a particular random number generator in a Frequentist analysis. In real-world cases, you will *always* do this if you get a long enough sequence.

On the other hand, for each sequence, a Bayesian model assigns a particular weight, deterministically… and the only way you can reject this weight, is to reject the model. The weight is a logical consequence of the model independent of the frequency with which one would get certain subsequences etc.

This difference in meaning of what the model implies in a Frequentist vs a Bayesian model is something that’s critical to understanding why Frequentist rejections of “statistical hypotheses” are rarely actually of interest, while Bayesian weighting of different data sequences are not subject to the same issues.

The nature of the guarantee is that a particular way of combining predictions from different algorithms will not do much worse than the best of those algorimths. So the program that outputs the worst possible sequence *given the prediction algorithms* is quite short; the complexity, if there is any, will be found in the prediction algorithms.

I assume the worst case finding algorithm works by basically running each prediction algorithm to get the output and then choosing the 0 or 1 value that makes the overall error in prediction maximal… all we need to do is make each prediction algorithm take longer than the age of the universe to compute and we can foil the attacker! ;-)

Anyway, I like this kind of stuff, but I think my point actually winds up tangential to your main point.

“If you want to say that your method has a guarantee but my method doesn’t, what you’re really saying is that you’re making an assumption and I’m not.”

I don’t get this. Let’s say I have some nice algorithm for a clustering problem with a defined cost function. For a dataset of size n, CarefulCluster guarantees to terminate within n^2 steps, and produce an outcome with cost no more than 1.5 times the cost of the optimal outcome. A different algorithm QuickCluster guarantees to terminate within n log n steps and to produce an outcome with cost no more than 3 times the cost of the optimal outcome. CarefulCluster and QuickCluster have different guarantees, not different assumptions. Maybe they don’t have “assumptions” at all, beyond the specification of the clustering problem and the cost function.

But maybe such things aren’t “statistical methods” in the sense you mean?

James:

Good point. Algorithms can have guarantees of this sort. When I see guarantees in statistics or machine learning, they’re typically guarantees about accuracy of estimates or predictions, and these guarantees depend on particular models.

I think Andrew’s talking about results like the central limit theorem. In the simple form, the variables y[n] are required to be i.i.d. with finite mean and variance and the “guarantee” is that the mean of the sequence converges to the expectation with errors distributed as normal(0, sd[y] / sqrt(N)).

In ML work in computer science, I see two kinds of guarantees. One form is a regret bound for strategies in bandit or other learning problems. Like the CLT, you need assumptions like i.i.d. bernoulli returns from each bandit in the simplest case. The other is convergence bounds for optimizers. For instance, the big one here is from Robbins and Monro, concerning convergence of stochastic gradient descent assuming a certain form of learning rate schedule. Without that learning rate schedule, the proof wouldn’t go through.