In ML work in computer science, I see two kinds of guarantees. One form is a regret bound for strategies in bandit or other learning problems. Like the CLT, you need assumptions like i.i.d. bernoulli returns from each bandit in the simplest case. The other is convergence bounds for optimizers. For instance, the big one here is from Robbins and Monro, concerning convergence of stochastic gradient descent assuming a certain form of learning rate schedule. Without that learning rate schedule, the proof wouldn’t go through.

]]>Good point. Algorithms can have guarantees of this sort. When I see guarantees in statistics or machine learning, they’re typically guarantees about accuracy of estimates or predictions, and these guarantees depend on particular models.

]]>I don’t get this. Let’s say I have some nice algorithm for a clustering problem with a defined cost function. For a dataset of size n, CarefulCluster guarantees to terminate within n^2 steps, and produce an outcome with cost no more than 1.5 times the cost of the optimal outcome. A different algorithm QuickCluster guarantees to terminate within n log n steps and to produce an outcome with cost no more than 3 times the cost of the optimal outcome. CarefulCluster and QuickCluster have different guarantees, not different assumptions. Maybe they don’t have “assumptions” at all, beyond the specification of the clustering problem and the cost function.

But maybe such things aren’t “statistical methods” in the sense you mean?

]]>Anyway, I like this kind of stuff, but I think my point actually winds up tangential to your main point.

]]>If you collect 25 or 100 data points, you have a sequence of 25 or 100 values, suppose you have 3 decimal digits of precision in your data collection, then you have 1000^100 = 10^300 possible sequences…. Now suppose you collect 25000 values, you have 1000^25000 ~ 10^75000 possible sequences…. 10^300 / 10^75000 = 10^-74700 ~ 0 to basically 75000 decimal places.

you can extend sequences of 100 data points to sequences of 25000 data points by just choosing the arbitrary say 0 value for all the remaining values… Which shows that the sequence of 100 data points is a meaninglessly tiny subset of the sequences of 25000 data points. If almost all 25000 data point sequences are random, this does not in any way imply that almost all 100 data point sequences are random, or especially all 25 data point sequences etc.

The assumptions that are made about “random data” (independent identically distributed data from a usually assumed known distribution) are usually inappropriate in real-world data sets, for lots and lots of reasons. But when it comes to long sequences, the assumptions may be less problematic. How long? is the question in any given example.

Frequentist statistics assumes that the “randomness” is *a property of the data*. Checking that requires long sequences because the properties you’d test all become more checkable as the sequence increases in length, out to thousands or millions of data points. The die-harder tests use billions of data points to check the properties of pseudo-random number generators.

Bayesian statistics doesn’t assume randomness of the data at all, it assumes instead a description of the information state of the observer in terms of a weight over subsets of possible sequences that the model would consider plausible vs not plausible (the weight is the plausibility). Testing whether a sequence meets high complexity statistical testing procedures to see if it “really is random” is not part of Bayesian logic. Each sequence is assigned a plausibility based on the model assumptions, and this is a fact *about the model*

You can reject the idea that a sequence comes from a particular random number generator in a Frequentist analysis. In real-world cases, you will *always* do this if you get a long enough sequence.

On the other hand, for each sequence, a Bayesian model assigns a particular weight, deterministically… and the only way you can reject this weight, is to reject the model. The weight is a logical consequence of the model independent of the frequency with which one would get certain subsequences etc.

This difference in meaning of what the model implies in a Frequentist vs a Bayesian model is something that’s critical to understanding why Frequentist rejections of “statistical hypotheses” are rarely actually of interest, while Bayesian weighting of different data sequences are not subject to the same issues.

]]>One of the things about Kolmogorov random sequences is that there is no computable function that outputs that sequence which is itself substantially shorter than the sequence. If the sequence is extremely long, then even writing a computer program to do the search for the worst possible sequence is potentially writing a very long computer program (it has to be long enough to output all the sequences you are checking). Basically I suspect there is some aspect to the guarantee here which has to do with this kind of complexity argument.

]]>https://normaldeviate.wordpress.com/2012/07/04/statistics-without-probability-individual-sequences/

]]>That’s a useful clarification. I agree that guarantees are not empty. A guarantee is a theorem, a logical implication from assumptions. My point is that a guarantee implies an assumption. It’s fine when people state guarantees—as long as they recognize the assumptions underlying them.

]]>