Selection bias leads to confusion about the relative stability of deterministic and stochastic algorithms

If you do any statistical computing at all, one thing you soon realize is that simple algorithms are more stable than complicated algorithms, and deterministic algorithms are more stable than stochastic algorithms.

Least squares is super-stable (unless you have collinearity or near-collinearity, in which case you don’t want to be fitting unregularized least squares anyway); regularized least squares is even more stable and is no more complicated than least squares; logistic regression with maximum likelihood is super-stable unless you have collinearity or separation; regularized logistic regression is just as simple and is even more stable. Getting to more complicated algorithms: variational inference can run into problems; and Markov chain Monte Carlo is wonderful but it can have trouble mixing and you have to be careful to monitor its convergence.

What I want to argue here is that the relative stability of simple and deterministic methods is not a property of the algorithms but rather a reflection of the problems to which they are applied.

Least squares is not a stable algorithm at all! Try to use it to fit a hierarchical model or a mixture model or any latent-variable model and you will get a disaster. But we don’t use least squares for such problems. Similarly for maximum likelihood logistic regression: you can use it on simple problems but it will fail on more complicated item-response models. We use regularized least squares or regularized maximum likelihood for slightly harder problems but it too will fail on many latent-variable problems. When we apply complicated stochastic algorithms to difficult problems, we get some difficulties. Complicated stochastic algorithms would work just fine on simple linear and logistic regressions; we just usually use simpler algorithms for such problems, at least when they are small problems.

The point is that there’s an “ecological correlation” (as they say in social science) between difficulty of problem and complexity of computation algorithms. More complicated algorithms get applied to harder problems. Conditional on the problem, it’s the complicated stochastic algorithms that are typically more stable. Those steps of iteration help avoid getting stuck in bad places, and the randomness allows us to easily monitor mixing. Deterministic algorithms, when they don’t work, often fail in ungraceful ways.

This is not to say you should never use simple deterministic algorithms, just that you should be aware that reason they typically have such good performance in the wild is that they are typically applied to relatively easy problems.

63 thoughts on “Selection bias leads to confusion about the relative stability of deterministic and stochastic algorithms

  1. How does today’s advice regarding algorithms and selection bias pertain to the following oft-repeated admonition in this blog?:

    “Never play cards with a man called Doc. Never eat at a place called Mom’s. Never sleep with a woman whose troubles are worse than your own.” ― Nelson Algren, A Walk on the Wild Side

    Does Algren’s advice have “such good performance in the wild” because it is “typically applied to relatively easy problems.” When will we expect the advice/algorithm to fail? And, of course, how to measure success and failure?

    • Paul:

      The Algren advice is relevant to statistics! I say this because betting is sometimes presented as a foundation of probability, but, as Yuling and I wrote, every bet needs a taker, and the existence of someone willing to lay money on an event supplies information about its probability. See here for more on this point.

      • How to reconcile this with the Ulam definition of a coward as someone who will not bet even when you offer him two- to-one odds and let him choose his side.

        • Roger:

          I’ve never heard that quote before, but, yeah, if you’re allowed to choose the side then there’s no good reason not to take that bet, unless you’re concerned that the person on the other side won’t pay off. The concern that I was referring to is when you don’t get to choose the side.

          For example, suppose I fit a model and say that some event has a 2/3 chance of occurring, and then someone offers me a 2-to-1 bet on it. Should I take that bet? The answer is not so clear: the fact that this person is offering to bet on this could itself convey information. I’d want a vig, and how much vig depends on what sorts of knowledge other people can have.

        • The folklore about the Ulam definition arises from a Samuelson paper in 1963 that describes offering a bet on a coin flip to E. Cary Brown, the chair of the MIT economics department at the time. The payoffs were $200 if coin landed heads, $-100 if it landed tails. Brown refused the bet but said he would be willing to take 100 such bets. Samuelson was outraged about this, claiming that it violates expected utility principles. It doesn’t as shown later by Pratt. For me the story undermines the Dutch book arguments that claim that willingness to bet is strictly determined by EU computations, leaving no room for either optimism or pessimism.

        • > Samuelson was outraged about this, claiming that it violates expected utility principles. It doesn’t as shown later by Pratt.

          It seems that you mean that Pratt noted that it does if we consider “proper” utility functions but it doesn’t if we allow for “improper” utility functions.

          If that’s the case it seems misleading to say that Samuelson claim is false just because it doesn’t hold when we take words to mean different things.

        • This coin flip thing makes zero sense to me. They’re obviously completely different lotteries.

          Looking it up, the argument seems to go “well if I present the bets to you individually, one at a time…” That argument obviously makes no sense. Yeah, I would turn down the first bet only because I wouldn’t assume the counterparty isn’t going to offer it to me again. Definitely not so many times I’m guaranteed to win. It’s a different decision space.

        • > Samuelson was outraged about this, claiming that it violates expected utility principles. It doesn’t as shown later by Pratt.

          I’ve been rereading Samuelson’s “Theorem. If at each income or wealth level within a range, the expected utility of a certain investment or bet is worse than abstention, then no sequence of such independent ventures (that leaves one within the specified range of income) can have favorable expected utility.”

          There are no explicit assumptions about the form of the utility function – I’ll have to check in what sense Samuelson outrage is unwarranted (if we “stand with Daniel Bernouilli, Bentham, Ramsey, v. Neumann, Marschak, and Savage”).

        • > Definitely not so many times I’m guaranteed to win.

          You are not guaranteed to win. Samuelson’s colleague only asks for the guarantee to be able to play 100 times. But there is no value in the having the option of doing the hundredth flip because he says he would reject that single flip!

        • If you’re given the ability to play 100 simultaneous flips, the bet is an obvious win.

          If on the other hand you must play serially, the bet is not so obvious. Suppose you have $300 to play with, there’s a 1/8 chance you lose three times in a row and now have no chance to regain. By being “guaranteed” 100 plays you’re actually playing serially but in effect simultaneously because you need not pay out anything until the 100th play. By then you’re virtually guaranteed a win.

          Serial bets with bankruptcy terminating the plays is a very different game. Failure to acknowledge this is a structural problem in some economic theories. There’s a heterodox branch that call themselves “ergodicity economics”

          And there’s a site collecting info on it. This introduction mentions Samuelson actually.

          https://ergodicityeconomics.com/2024/02/05/ergodicity-economics-a-history-2/

          As pointed out in that site the “right” rule is to maximize expected growth rate. So for example you may prefer a long series of small bets to a game you have high probability of going bankrupt on. Even if the growth rate conditional on not going bankrupt is quite high.

        • Yeah but that’s a materially different bet.

          “Do it 100 times and sum the payoff”
          is different from
          “Do it 100 times with the option to cash out/in at any time”
          is different from
          “Do it 100 times in sequence and pay in/out each time”

          Furthermore, the initial endowment matters. That said, as Andrew points out, the stakes are so small here that not taking the bet, either once or iterated, is wrong for anyone not in extreme circumstances.

        • Somebody:

          The stakes of that bet are indeed so small for me. But, as Daniel points out, for some people, $100 is real money! On the other hand, if your financial situation is such that $100 is real money, nobody’s gonna offer you this bet. Or, if they are, you should be very very suspicious that some trick is involved. Which gets us back to our original point that only some bets are available to you, and if someone offers you a seemingly-attractive bet, you have to look at it carefully: the fact that the bet is being offered to you provides some additional information.

        • > If you’re given the ability to play 100 simultaneous flips, the bet is an obvious win.

          The thing is that given the option between playing 100 simultaneous flips and playing 99 simultaneous flips Samuelson’s colleague would take the latter if he has to be consistent with the choice of no flips over one.

        • Carlos. That might be true but is not obvious to me. Are you saying it’s because 99 flips plus 1 flip is 100 flips, and when he hits the 99th flip he’ll be offered 0 or 1 more flips and by his logic he should refuse the extra flip… we can then recurse down to “he shouldn’t take any number of flips”?

          I think the point of the “ergodicity economics” people is that what you’re willing to take as “simultaneous” flips need not be particularly related to what you’d take as a serial game.

          If he’s guaranteed 100 flips, even if he does them in series, it’s really “simultaneous” because there’s no intermediate payouts or bankruptcy possible. In the “simultaneous” case beyond a few (like 2 or 3 or 4 maybe) he should be willing to take as many as anyone would give him. In the serial case, he might not want to play the game at all (I haven’t simulated serial outcomes so I’m not sure).

        • To be clear, Samuelson’s discussion assumes that the colleague has much more than $300 and his rejection of a single bet is not sensitive to his wealth being precisely $X: he would also reject the bet if he had between $X-$10k and $X+$20k (i.e. he wouldn’t accept any single bet in the whole sequence of bets that he would accept).

          The “simultaneous” or “sequential” distinction doesn’t matter if the preferences are to be represented by a utility function that depends only on the outcome.

          Could he accept two simultaneous bets while rejecting a single bet? That would require the expected utility of taking two simultaneous bets 1/4 U(X-200) + 1/2 U(X+100) + 1/4 U(X+400) to be higher than the utility of not taking them U(X).

          However the rejection of single bets means that

          (A): 1/2 U(X-100) + 1/2 U(X+200) is less than U(X)

          (B): 1/2 U(X-200) + 1/2 U(X+100) is less than U(X-100)

          (C): 1/2 U(X+100) + 1/2 U(X+400) is less than U(X+200)

          and writing down 1/2 (B) + 1/2 (C) and taking (A) into account you can find that the expected utility of taking two simultaneous bets is lower than the utility of not taking them.

          Exercise for the reader: extend the calculation above to one hundred simultaneous bets.

          (The “ergodicity economics” angle is completely irrelevant. It’s not an extension of expected utility or an alternative to expected utility – it’s actually more restrictive because what it does is to fix the utility function.)

        • The thing is that given the option between playing 100 simultaneous flips and playing 99 simultaneous flips Samuelson’s colleague would take the latter if he has to be consistent with the choice of no flips over one.

          I don’t see why that would be true in any of the versions of this game.

          Let your utility function be log(cost of a sandwich). Start with an initial endowment of 250 sandwich.

          0.5 Log[200 + 250] + 0.5 Log[250 – 100] > Log[250]

          so this hypothetical person would take the bet. Let’s say they’re offered the opportunity to play the game again. If they had won the first round

          0.5 Log[200 + 450] + 0.5 Log[450 – 100] > Log[450]

          so they would play again. If they had lost the round

          0.5 Log[200 + 150] + 0.5 Log[150 – 100] < Log[150]

          So they'd keep what they had left and walk away. This is for the version of the game where you iterate once at a time and have to payout intermediate quantities. But I can't think of any version of the game that prohibits different decisions for different iterates a priori.

        • Ah, I found the source in “The Sqrt[N] Law and Repeated Risk Taking” which clarifies that it’s limited to “a maximizer who is risk averse enough to refuse a specified favorable bet at every W0 level…” or that wealth effects are irrelevant. I suppose that’s implied in the motivating example by (I’m assuming) E. Cary Brown’s ludicrous wealth relative to the $100 stakes. That makes more sense now.

          I was also finally able to find the origin point of that claim economists like to make about “insurance isn’t about adding risks, it’s about dividing risks”. It appears to be based on a minimax risk framework, which I find unconvincing.

      • Thanks Carlos, sometimes it’s best to just write out the math!

        I agree that it seems weird to say you’d take 100 simultaneous bets and not 1 bet under those conditions.

        • Andrew: 1963 $100 is roughly 1000 in current dollars so losing $1000 would be something to explain at home over dinner, no? I’m pessimistic, so I would be “sure” to be unlucky on the single flip. On 100 flips, as noted in the other comments, the probability of losing anything is tiny, and the expected gain $5000 is quite large especially in current dollars.

          Samuelson’s claim was that if one is willing to accept 100 simultaneous flips then for _any_ U, EU implies that one should be willing to take the one flip bet. Pratt showed that this was wrong. Only for “proper” U is this true, But why should one be proper? Or, why indeed, should one be constrained by making decisions solely on the basis of some EU straightjacket.

        • > Samuelson’s claim was that if one is willing to accept 100 simultaneous flips then for _any_ U, EU implies that one should be willing to take the one flip bet.

          Samuelson’s claim was that “If at each income or wealth level within a range, the expected utility of a certain investment or bet is worse than abstention, then no sequence of such independent ventures (that leaves one within the specified range of income) can have favorable expected utility.”

          > Pratt showed that this was wrong.

          In what sense? Is there a counter-example for the 100 +200/-100 bets of function u(x) such that his assumptions are verified and his conclusions is false?

  2. Complicated stochastic algorithms would work just fine on simple linear and logistic regressions; we just usually use simpler algorithms for such problems, at least when they are small problems.

    This is what I used to believe. In essence it was just a very inefficient approach to such problems. But it was shown to be wrong in that “HMC fails at the mode” thread. The algorithm can’t distinguish between overfitting to noise and getting the exactly correct answer.

    This is not limited to MCMC, it shows up in general when nparam ~ ndata:

    The general principle is that as the model complexity increases, the test risk of trained models
    first decreases and then increases (the standard U-shape), and then decreases again. The peak in test risk
    occurs in the “critical regime” when the models are just barely able to fit the training set. The second
    descent occurs in the “overparameterized regime”, when the model capacity is large enough to contain
    several interpolants on the training data. This phenomenon appears to be fairly universal among natural
    learning algorithms, and is observed in simple settings such as linear regression, random features regres-
    sion, classification with random forests, as well as modern neural networks.

    […]

    When n ≥ d, we are in the “underparameterized” regime and there is a unique minimizer of the objective in Equation 1. When n < d, we are “overparameterized” and there are many minimizers of Equation 1. In fact, since X is full rank with probability 1, there are many minimizers which interpolate, i.e. X β̂ = y. In this regime, gradient descent finds the minimum with smallest L2 norm ||β̂||^2 .

    https://arxiv.org/abs/1912.07242v1

    • > This is what I used to believe. In essence it was just a very inefficient approach to such problems. But it was shown to be wrong in that “HMC fails at the mode” thread.

      No, that’s not the lesson to take. The better statement is HMC requires much smaller time steps if initialized at the mode. It doesn’t really “fail” it is just more inefficient.

      There does not exist any algorithm which can distinguish between “this is the precisely correct parameter vector” and “this parameter vector precisely fits the data but that is mere over fitting”

      That’s because it’s not up to the fitting algorithm to distinguish, it’s up to the modeler to specify in their model.

      Suppose we have zero measurement error in our measurements (for example we are trying to fit the results of a computer simulation) then we merely search for the parameters that precisely fit the results to the level of roundoff error (assuming those parameters are unique)

      But suppose there is error in measurement and yet we still are able to fit the model precisely through all the data points. Here we know we have overfit because our model specifies measurement error and so we know the probability our precise fit is correct is near zero. Measurement error actual outcomes would need to have been precisely zero at each data point. Improbable.

      This is closer to the case with the “starting at the mode”. We have minimized errors but in all likelihood made them smaller than our model believes they should be (nearly zero every time). MCMC should explore the region where the individual errors are more like what the model expects. Hence, get away from the mode.

      • There does not exist any algorithm which can distinguish between “this is the precisely correct parameter vector” and “this parameter vector precisely fits the data but that is mere over fitting”

        Any algorithm that checks the model against new data (collected after the training data) can do this. Overfitting is detected all the time.

        MCMC should explore the region where the individual errors are more like what the model expects. Hence, get away from the mode.

        But it does not. It either blows up or takes so long as to be useless. Thus the hack of “don’t initialize to the mode”, which is becoming “we never want the mode anyway” or even “mode is bad, stay away from the mode”.

        • I’ve initialized to the mode and had the runs be successful, it mostly requires a longer run and more initialization time. It’s not “so long as to be useless” it’s just “longer than if you had initialized nearer equilibrium” when Bob gave his example he initialized with a particular timestep which is *in general* stable (ie. once away from the mode), but starting from the mode was not. He showed that the timesteps could be made smaller and it would eventually work.

          Overfitting is only really a concept that comes from “find the mode” (or find any single point) type fits. You always overfit when you’re finding a point. The high probability region you MCMC sample from defines the region that is all more or less equally reasonable. unless that’s a tiny region, *any* single point in it is overfit compared to an ensemble of them

        • Luckily, Bayesian inference is calibrated in expectation if the model is correct. So overfitting isn’t nearly the big deal it is in point estimate with a reasonable model. Doing these kinds of tests, as we always recommend to Stan users, can catch model errors.

          Thus the hack of “don’t initialize to the mode”, which is becoming “we never want the mode anyway” or even “mode is bad, stay away from the mode”.

          Mode is bad is independent of MCMC or HMC. This is just the way the math works in high dimensions. If you sample from a standard multivariate normal in N dimensions, the squared distance of draws from the mode follows a chiSquare(N) distribution (it’s the sum of independent standard normals, hence chi-square). This is well bounded away from the mode as soon as N > 10 or so. I go over this concept and provide a plot of the distance to the mode for standard normal in section 4 of this case study: n a Stan case study: https://mc-stan.org/users/documentation/case-studies/curse-dims.html. The result is unexpected by people used to thinking about optimization.

          I wouldn’t call the advice to not initialize at the mode a “hack”. It’s not a good draw, so there’s no reason to expect it to be a good initialization. Also, it’s much much easier to initialize randomly.

        • > It’s not a good draw, so there’s no reason to expect it to be a good initialization.

          It’s a better draw than any other though.

          It’s bad in the context of a sampling algorithm that uses a draw as initialization for the next draw and works better when both have similar density.

          Not being a good initialization is no reason to say it’s a bad draw. (Of course a sampling method that returns the mode in every draw would be bad!)

        • Carlos, I guess the question is “better for what purpose”

          It’s in some sense the best single draw, but it’s a terrible member of any ensemble of draws. (In what follows I assume high dimensions, for concreteness say 20 or more) The ensemble represents probability mass in some sense from the whole distribution whereas a single draw represents only the mass in an epsilon neighborhood of the draw. There is no mass near the mode, but there’s no mass near any single point! So the two facts “cancel out” but when it comes to an ensemble of 100 points now we have a kind of scale of the epsilon neighborhood of each point (let’s say the ball of radius half the distance to the nearest neighbor) each point represents the mass in that ball. Maybe not super well but there’s no other info about that ball other than the point at the center.

          The distance between the point at the mode and the nearest neighbor is going to follow that chisq distro Bob mentions, so the mode point is representing a huge ball with a high density. It’s doing so inaccurately since a large ball has all it’s mass on the surface, where the density is much lower, but the modal point represents the mass as volume time modal density, which is far too high a mass compared to the true mass.

          It’d be interesting to draw from a unit normal in 20D and look at the distribution of half-nearest-neighbor radii and calculate the “representative mass” of each point. I’ll do it later today if I have a chance.

        • > It’d be interesting to draw from a unit normal in 20D and look at the distribution of half-nearest-neighbor radii and calculate the “representative mass” of each point. I’ll do it later today if I have a chance.

          I don’t know what you expect to find – and maybe I misunderstand what you try to do – but I’d say that the “representative mass” is the density and for a multivariate standard normal the expected distance to a random point (or the expected distance to the nearest of K random points) is lower at the origin than anywhere else.

        • Yeah, I guess what I meant is you’re estimating the mass of the ball as (volume * density at the center), you may be right about the distance to the nearest neighbor being smallest from the center ball, but what matters is the ratio of estimated mass to actual mass.

          (volume * density at the center) / (integrated density * dv over the ball)

          think of that as “overweighting ratio”

          The density decreases in every direction from the center, and the volume is mostly on the surface, so integral(density * dv) over the ball centered at the mode should be much much smaller than (density at the center * volume) meaning the overweighting ratio is very large

          whereas for points away from the center, the density increases in some directions and decreases in others, the (center density * volume) / (integral(density *dv)) ratio may be much closer to 1. Or at least, it may be close to constant across the random sample and very much smaller than the same ratio for point at the mode

          If this intuition is right, then sampling at the mode in any ensemble in essence vastly overweights that point

        • > for points away from the center, the density increases in some directions and decreases in others

          We could say that in N dimensions there are 2N directions. The density increases in one direction only – and decreases in the other 2N-1 directions.

          > sampling at the mode in any ensemble in essence vastly overweights that point

          I don’t understand what “sampling at the mode” means, let alone “in an ensemble”. But I agree that it sounds bad.

        • by “sampling at the mode” I mean “drawing a sample point at the mode”

          Suppose you have an infinite sequence of points {s_i}, drawn using a high quality MCMC method. suppose you start this sequence at the mode {s_1 = [0,0,…]}

          Now we use the prefix of our infinite sequence s_{1…N} as a sample of size N for estimating things.

          Let’s take N = 2, when we calculate an average:

          1/2 * f(s_1) + 1/2 * f(s_2)

          in some sense we are trying to approximate an integral int(f(x) p(x)dx) the 1/2 is playing the role of p(x) but the proper weights should be more or less int(p(x) dv) for dv some volume around s_i, that is “the volume weighted average of p(x)”

          For a point s_1 at the mode, the volume weighted average for a region of finite radius r will be much smaller than p(s_1) V(r)

          When you say “The density increases in one direction only – and decreases in the other 2N-1 directions.” I don’t think that’s right.

          Restating things in terms of N, suppose we pick a random point from the normal N dimensional distribution. We know it’ll be at a distance from 0 described by chi-squared distribution. We can also think of it as a point on a sphere at that chi-squared radius.

          There’s an N-1 dimensional subspace that is locally *tangent* to the sphere and density is related only to the radius. So for small movements in that N-1 dimensional subspace nearby points are at constant p(x). If we move only a small distance in a spherically symmetric proposal, we can decompose the movement into normal and tangent subspaces…

          a * dN + b * dT

          where dN is a unit normal vector in a single direction, and dT is some unit vector any direction in the tangent space. So if we move by a distance epsilon, the p(x) value will depend entirely on the value of a and not at all on the value of b. We can think of our epsilon movement as epsilon * (a * dN + b * dT) where now a and b give the relative size of the movements in the two directions.

          The dot product of our vector will be epsilon^2 *( a^2 + b^2) and the a^2 + b^2 is chisq_N distributed, but a depends on a single dimension so a^2 is chisq_1 distributed which has a sharp peak at 0 (an 95% coverage interval is [0.00098, 5.024]

          While b^2 is chisq_N-1 distributed, which for let’s say N-1 = 19 has a broad peak around 19 ish (95% coverage interval is [8.9,32.85])

          According to a simulation rand(Chisq(1))/rand(Chisq(19)) is on average about 0.057 and median about .024

          so typically any random not-too-big proposal will take us to a point very close to the same p(x) when we’re starting at a typical value

          On the other hand, at the mode, there is no tangent space, and every direction you move goes downhill

          So, volume averaging in a small ball around a “typical” point will be averaging a near-constant function, whereas averaging in a small ball around the mode will be averaging in a strongly decreasing function.

          This suggests that the “overweighting factor” will be constant among the “typical” samples, but be very large for the sample at the mode

        • There’s an N-1 dimensional subspace that is locally *tangent* to the sphere and density is related only to the radius. So for small movements in that N-1 dimensional subspace nearby points are at constant p(x).

          Just like for small movements in the N dimensional space around the origin nearby points are at constant p(x).

          > On the other hand, at the mode, there is no tangent space, and every direction you move goes downhill

          Actually at the origin all the directions are in the tangent space and the probability is “constant” in every direction. If it goes “downhill” in every direction from the origin it goes just as “downhill” (in relative terms) in the tangent directions everywhere else.

          At any point the density is C exp(-x^2) where without loss of generality the coordinate system is oriented so only the first coordinate at most is non-zero and C is a constant.

          If we move epsilon in any direction (when x is zero) or in any direction within your N-1 dimensional tangent subspace (when x is non-zero) the new density is proportional to C exp(-1/2 (x^2+epsilon^2)) = C exp(-1/2 x^2) exp(-1/2 epsilon^2).

          The relative change is exp(-epsilon^2) when moving within the tangent subspace from any point – including the origin.

        • You’re right, of course it’s smooth at the origin! hah The fact that there wasn’t a gradient made me immediately think of it going downhill everywhere, but that’d be a cusp. It’s actually just constant in all directions for small perturbations.

          Thanks Carlos, you often catch when I’ve made a mistake, and it’s appreciated.

          I guess the real situation in terms of an ensemble isn’t that the behavior of p(x) near 0 is bad. The behavior is bad for MCMC with large stepsizes, but that’s a sampling algorithm issue.

          The real issue with sampling the origin is if you attempt to calculate integrate(f(x) * p(x) dV) as a sum then you will need to integrate over a LOT of dV values away from 0 and only 1 dV very close to 0.

          If you’re selecting just *one* point, the mode is the “best” but if you’re selecting an ensemble of points to calculate an approximation of an integral you want the points to be in the set where the mass is

          for 20 dimensions, it’s like r^20 different little dV values at a distance r and the density drops like exp(-r^2/2) and the combination of the two, we can estimate the behavior of the mass at a certain distance…

          r^20*exp(-r^2/2)

          and if you plot that it’s got a peak near 3 (this just recapitulates the basic shape of the correct thing which is the Chisq density which has a peak but its not at this location its farther out)

          here I do it with an arbitrary rescaling to keep it all in the same plot and put everything in the exponent for numerical safety

          https://www.desmos.com/calculator/nvbikzhvqi

          Intuitively there is just a lot more mass away from the mode so that even though the density declines rapidly for a while the volume explodes fast enough that most of the mass is away from the mode.

        • > if you’re selecting an ensemble of points to calculate an approximation of an integral you want the points to be in the set where the mass is

          One way to define “the set where the mass is” is the high density region. Other sets may be defined but they will contain less mass for a given volume. In any case, what you want is to have an ensemble of points sampled from that distribution according to their density.

          In principle you don’t want to exclude the origin, you don’t want to exclude the points along the axes, you don’t want to exclude any point. What you want is to have an ensemble of points sampled from that distribution according to their density.

          > most of the mass is away from the mode

          Sure. That doesn’t make the mode worse than any other point. Most of the mass is even more away from any other point.

        • The mode is a good single point but is a terrible member of any ensemble of usual size (say 100-10000 points)

          Of course the unit 20D normal we can draw independent samples from but usually we can’t. So we use MCMC or another scheme to get our collection of 100-1000 vectors of parameters. If we use such a scheme and wind up with samples near the mode, the quality of estimates will increase if we simply drop those samples.

          We use these samples to estimate integral(f(x)p(x)dV) by 1/N sum(f(x_i))

          In the integral there are very few dV volumes near the mode, but they are weighted by large p(x)

          A scheme for sampling that purposefully excludes the mode (for example by intentionally initializing away from it, or initializing at it but intentionally throwing away any prefix that includes regions near it) will do better at representing the integral in ~1000 samples than ones that don’t do that.

          Of course you’re not wrong that what we want is a sample from the distribution randomly according to its density. The point is just that only samples of length very very long have points near the mode, so in fixed size samples of usual size the fact that you have some near the mode indicates youve biased your inference via your method and eliminating such samples (often by throwing out a prefix) is better.

          If we could take independent random samples the issue would just never arise as the probability that a perfect sample of size ~10000 even has samples near the mode is so small as to be irrelevant.

        • > The mode is a good single point but is a terrible member of any ensemble of usual size (say 100-10000 points)

          Any other point, say the point [1.2 0.3 -0.8 -0.7 …], is a not so good single point and even more terrible member of any ensemble.

          > If we use such a scheme and wind up with samples near the mode, the quality of estimates will increase if we simply drop those samples.

          I agree, if we end with samples near the mode that’s bad. If we ended up with samples near [1.2 0.3 -0.8 -0.7 …] that would be even worse. (I understand that you’re talking about issues with sampling schemes that make the former situation a real danger but then the biased sampling method is to blame – it’s not the point’s fault.)

          > We use these samples to estimate integral(f(x)p(x)dV) by 1/N sum(f(x_i))

          > In the integral there are very few dV volumes near the mode, but they are weighted by large p(x)

          There are many fewer dV volumes near [1.2 0.3 -0.8 -0.7 …] and they are weighted by a much lower p(x).

          Now, in much of the discussion on this subject there is the implicit assumption that [0 0 0 ..] is “bad” and [1.2 0.3 -0.8 …] is “good” because the former “never happens” and the latter “happens all the time” because even though [1.2 0.3 -0.8 …] is much less likely than [0 0 0 0 …] one could also have [-1.2 0.3 -0.8 …] and [0.3 1.2 -0.8 …] and a ton of permutations and reflections and other things that cannot be obtained by swapping coordinates and changing their signs but are equally likely because the thing that determines the probability for a standard normal is the sum of their square.

          However, that’s absolutely irrelevant when we want to use the samples to estimate integral(f(x)p(x)dV because in general f([1.2 0.3 -0.8 …]), f([1.2 0.3 -0.8 …]), f([-1.2 0.3 -0.8 …]), f([0.3 1.2 -0.8 …]), etc. have no relation whatsoever with each other.

        • > in general f([1.2 0.3 -0.8 …]), f([1.2 0.3 -0.8 …]), f([-1.2 0.3 -0.8 …]), f([0.3 1.2 -0.8 …]), etc. have no relation whatsoever with each other.

          Well, the first two do indeed look similar because of my sloppy editing job. I think the point is clear though.

        • > There are many fewer dV volumes near [1.2 0.3 -0.8 -0.7 …] and they are weighted by a much lower p(x).

          That doesn’t make much sense either… I guess that in the integral there are the same dV volumes near every point – whatever that means – but they are weighted by a much lower p(x) for the points far from the mode.

        • You’re right that always starting at say [1,0.2,3.8,-1.29…] would equally bias your sample but also this point has no “special property” which makes it apparently attractive to start at. If we say we should always start at “one of the many points like [1,0.2,3.8…] that has p(x) near mean(p(x))” this rule is not a strong bias, since there are many such points sprinkled around many such places.

          As for the “many fewer dV” that can be interpreted as: “there are many fewer dV at a given radius from the mode when the radius is large, than when the radius is small” but the radius is precisely what controls the level-sets of p(x) so it could be rephrased more usefully and generally (for non-gaussians) perhaps as “there are many fewer dV in the level-set of p(point near mode) than in the level set of p(point farther from mode)”

          We can imagine approximating the integral(f(x) p(x) dV) as a sum:

          sum(sum(f(x_j) p_i dVij, j, 1, k_i), i, 1, N)

          where p_i is a partition of the p(x) range into small slices, and dVij is a partition of the level set of volume where p(x) is in the range of the p_i so that dVij all have p(x_in_center) ~ p_i a constant across the inner sum over j.

          Furthermore we can imagine ordering the p_i such that the total mass of the slice sum(p_i dVij,j,1,k_i) is in descending order (ie. small values of i correspond to larger values of total mass slice).

          So basically we’ve sliced the p(x) range into thin slices, we’ve found the slices where the total mass is largest, and starting there we begin to integrate across each mass slice, and then move on to the next mass slice.

          We can imagine doing the integral within each “level set” by a recursive random sampling

          Also note that the k_i reflects the size of the mass of each slice, so that the mass of that slice is reflected in the number of points we choose within the slice.

          We can calculate a truncated approximation to the expectation taken by looking at p_i V_i for each of the slices, and taking the largest of the slices one by one until we’ve gotten close to total mass 1.

          Consider a long and potentially biased by initialization MCMC run (a trillion samples). Using the log(p(x)) values, calculate Z = sum(exp(logp(x))) and calculate p(x) = exp(log(p) – log(Z)) for each sample (an estimate of p(x) dVij), do a histogram of the p(x) values, and then start at the histogram with the largest bar and reorder the sum to have those terms first, then from the next largest bar, then from the next largest bar… finally truncate the process when the sum of the total mass in all remaining samples is less than a sufficiently small value say .000001

          We can represent this throwing away also by a sub-sampling process. For each histogram bar we take all the points within it and sample a fraction of them, let’s say we want to drop our trillion samples to 10000 so we sample with probability 1e-8 from each histogram bar. Eventually some of the bars have less than 1e8 samples and we expect to take less than 1 point from that entire level-set.

          If we do this for a gaussian, this truncated sum will have “thrown away the mode” it will also have thrown away anything in the deep tails of large radii. Plus it will have randomly thrown away a lot of points in each level set. Still, nothing much of value was lost. The thrown away samples represent entire regions where there “is no mass to speak of” or where our representative sub-samples give us sufficient information “for our purposes”

          I hope someone other than us reads these comments, because I don’t think either of us is unclear on what we want etc, it’s more like unclear on how to word the description to make it accurate and sufficiently formal.

        • > We can calculate a truncated approximation to the expectation […] taking the largest of the slices one by one until we’ve gotten close to total mass 1.

          Right. We can also take the slices in decreasing order of density until we cover the same mass – for a slightly better accuracy/points ratio in general. Or we could do worse starting from the other end.

        • When you have trillions of samples, yes you can order it in any way you like and you’ll get more accurate answers if you truncate the very lowest density regions.

          But when you want a sample of 10k points, if you start with the highest density region and move outward until you get 5000 points you’ll do a terrible job. you will find that the method I described is relatively close to the kind of result you get from MCMC. Starting with trillions of samples and then downsampling randomly by a factor of a billion or whatever means that regions whose mass comes from high density but the region is small will not show up in the downsample. We can see that in the left hand quantiles of the Chisq distribution using the multivariate normal as the archetype. That’s the mode area. Also areas where the mass comes from a region where the density is so very very small that there are few samples in the trillion point sample will have zero points in the result. That’s the right hand tail of the Chisq. In the end, because you’re downsampling the trillion samples, you get only points where you started out with a lot of them, which is the meat of the Chisq distribution.

          A procedure for generating “a trillion” samples can be quite different and more flexible than the procedure for generating 5000, because “no matter where you start” in the trillion point sample, if your algorithm is ok, you’ll still get a good result. Not so true if you’re getting “bad” samples for the first 500 points in your 5000 point sample.

        • > But when you want a sample of 10k points, if you start with the highest density region and move outward until you get 5000 points you’ll do a terrible job.

          Quoting your previous message, “note that the k_i reflects the size of the mass of each slice, so that the mass of that slice is reflected in the number of points we choose within the slice.”

        • Right, maybe I didn’t explain right.

          Suppose you have a trillion points, so the mass of the region near 0 is small, but because there’s a trillion points it’s still say a 100k points… So if you want to reduce your TOTAL number of points down to 10k from 1e12, and you order them by probability density and then truncate the trillion points to the first 10k then you’re getting entirely points near the mode… a terrible idea. I’m 100% sure you know and agree with that.

          So instead, let’s downsample at random. for example by ordering the points in a random order, and then taking the first 10k from this random shuffle… Now we’re taking a sample that’s 10000/1e12 = 1e-8 as large, so the number of points from the mode slice is expected to be 100e3*1e-8 = .001 so there’s very unlikely to be any at all.

          The process “shuffle a trillion points into random order and take the first 10k” is very similar to “start with the high mass slices and calculate the expectation using those and truncate all the low mass slices”… Because only the high mass slices will wind up contributing points to the downsample.

          “starting with a trillion points and downsampling” is conceptually kind of like MCMC. That is, it walks around in the “high mass slices”

        • Do you see how subsampling a very large sample (trillions) results in truncating the contribution of some of the “mass slices” down to zero? And the ones it doesn’t truncate to zero are the ones that are the biggest mass slices?

          And how MCMC started “within one of the big mass slices” and allowed to walk around until a moderate size sample is generated will also “truncate” out the small mass slices? How both the “random subsampling from an ideal trillion sample set” and “MCMC started in one of the high mass slices” both perform an expectation that is analogous to “slice up the range of p(x) into thin slices, calculate the level sets, then integrate starting with the highest mass level set and proceeding until you get to quite small level sets” (ie. the sum of the remaining level sets is mass less than something like 0.0001)

          ?

        • I also see how if you generate a trillion points “taking the largest of the slices one by one until we’ve gotten close to total mass” and then “subsample” keeping the first ten thousand only – as you suggested to conclude that taking the slices in a different order is a terrible idea – that will truncate down to zero the contribution of each one of the “mass slices” except for the largest.

          All that subsampling, MCMC, etc. stuff is not related to
          https://statmodeling.stat.columbia.edu/2024/06/03/selection-bias-leads-to-confusion-about-the-relative-stability-of-deterministic-and-stochastic-algorithms/#comment-2373483 so I guess we do agree on that mathematical fact.

        • Ok, so I don’t know if there’s anything we disagree on anymore, but to summarize:

          1) If you have a way to randomly and independently sample from a distribution then do it
          2) Most of the time we don’t have that
          3) We want samples that give good expectation estimates from mean(f(x) for x in oursample)
          4) We want samples that are small enough for RAM purposes (usually 100-10000 range)
          5) Such samples if thought of as a fully random subsample of a much larger sample, will almost never have points near the mode or out in the tails. They will have many samples from the “high mass slices”
          6) MCMC gives samples that are conceptually similar to (5), mainly spending time in “high mass slices”
          7) MCMC fails to give samples conceptually similar to (5) if started at the mode, unless we let it walk a good distance away from the mode before starting to record samples.
          8) When started near the mode HMC variants tend to require very small step sizes that won’t be required later in the sample, and so it’s inefficient
          9) A conceptually good way to start MCMC is to start from anywhere and get the sampler to converge into a “high mass slice” where log(p(x)) is close to mean(log(p(x))), and then record samples from that point onward.
          10) There’s nothing wrong with the mode appearing in very large samples, that’s normal.
          11) There’s nothing wrong with the modal region per-se it’s the frequency with which it occurs in our sample that’s of concern. For samples of size 100-100k points it should almost never occur because that’s naturally what happens with a perfect sampler.

        • Maybe the only disagreement was on whether most directions were “downhill” in a standard multivariate normal.

          I don’t think that anything in that summary is incompatible with my first comment – nor with any later comment of mine for that matter.

        • At one point you said:


          > It’s not a good draw, so there’s no reason to expect it to be a good initialization.

          It’s a better draw than any other though.

          I think we agree in an abstract sense it’s a better *single* draw than any other, but if we think of MCMC as if it were slicing p(x) into thin slices, and then finding level sets and doing integrals by walking around in the level sets of high mass… then it’s not a draw that’s compatible with that procedure. So in this sense it’s a bad draw because it comes from a level set that has low mass but MCMC with a “short” runtime should walk around mostly in level sets with high mass.

          If Bob’s wording were rephrased to “it’s not a draw from a good level set” his statement would be reasonable I think.

          Also we were both slightly confused about whether normals decrease in all but one directions, only one direction, or no directions.

          To summarize there:

          1) At the mode, p(x) is locally constant and stays the same for an infinitesimal movement in any direction. However to second-order for movement of length O(1) it decreases by O(1) in any direction.

          2) Away from the mode p(x) is locally constant in an N-1 dimensional tangent subspace and stays the same for an infinitesimal movement in that subspace, and varies only in when moving in the 1 dimensional radial subspace. However, in a high mass slice, considering second order effects… the p(x) is already near 0, and for movements of O(1) it stays near 0, resulting in relatively little change in p(x). I did some math on that using taylor expansions but I don’t think I ever posted it.

        • > At one point you said: […] “It’s a better draw than any other though.”
          > I think we agree in an abstract sense it’s a better *single* draw than any other, but if we think of MCMC […] then it’s not a draw that’s compatible with that procedure.

          Note that at the very same point I also said – right after the fragment that you quoted – that “It’s bad in the context of a sampling algorithm that uses a draw as initialization for the next draw and works better when both have similar density.”

          So yes, I can agree that “if we think of MCMC […] then it’s not a draw that’s compatible with that procedure” – I essentially said as much in the second sentence of my first comment.

        • So you did. However you also said “Not being a good initialization is no reason to say it’s a bad draw. (Of course a sampling method that returns the mode in every draw would be bad!)”

          So, I guess where the disagreement comes from is maybe “is no reason to say it’s a bad draw”. Is Bob justified in some sense in saying “it’s a bad draw”?

          And the disagreement comes from asking the question “bad in what sense and for what purposes?”. It’s a bad draw to have in any ensemble of typical size such as 10k or less points, which is mostly what almost everyone in the world actually uses… some kind of sampling scheme in which in the end they wind up with a few hundred to thousand points. It’s a bad draw in the sense of including it in such sized samples biases your inference compared to throwing it out.

          Whether it’s because of an initialization scheme, or because of some kind of approximation scheme (variational inference or something) or because of some kind of sampling method that converges slowly or whatever… it’s a bad draw to have in your typical 100 to 10k draw sample

          Note, Carlos, I hope I’m not frustrating you a lot here. It’s hard to know via text messaging, and if you are frustrated maybe we should just call it a day?

          I never know who is in the audience, and this is days later, and I don’t even remember the content of all these posts so it might just be you and me talking about irrelevancies.

        • > Also we were both slightly confused about whether normals decrease in all but one directions, only one direction, or no directions.

          It seems that we still are confused. If the decrease that you care about is the relative change – because what you care about is the acceptance ratio – it has the same magnitude (for a given movement in N-1 dimensional tangent subspace) everywhere.

          I don’t know what do you mean by “relative” when you write that “the p(x) is already near 0, and for movements of O(1) it stays near 0, resulting in relatively little change in p(x)” but just in case I’ll repeat it:

          for the standard multivariate normal the ratio p(x + movement orthogonal to x)/p(x) has the same value exp(-1/2 |movement|^2) everywhere – it depends only on the magnitude of the movement and the value of p(x) is irrelevant.

        • I don’t have any math worked out on that question I can refer to, I’m happy to defer to your calculation for the moment. What I was particularly thinking of is whether or not you tend to wind up sampling points at p(x)/p(mode) near 1, and the answer is no. Any jump away from the mode results in going from a starting point of p(x)/p(mode) = 1 to something much smaller, whereas once you’re away from the mode you wind up with p(x)/p(mode) ~ 0 and staying there for most of the time. Or in the log space log(p(x)) – log(p(0)) is some near-constant value, you’re always sampling in a narrow band of log(p)-log(pmode) even if you’ve got a highly non-gaussian distribution.

  3. I lead a good clean life, so when Andrew’s reply to Roger Koenker contained the word, “vig,” I had to scurry for its meaning:
    ————————————————
    https://fluentslang.com/vig-meaning/

    What Does Vig Mean?

    The slang term vig is a shortened version of the term “vigorish,” which refers to the fee paid to a bookie in sports betting or the interest paid on a loan obtained from an individual rather than a bank. The term “vigorish” originated from the Russian word “vyigrysh,” meaning “winnings or gains.” It was later adopted into Yiddish before becoming a common term in English. The slang term “vig” likely emerged as a quicker and more convenient way to refer to vigorish with the rise of the internet and text messaging. Apart from the two meanings mentioned above, there are no other known definitions for the slang term “vig.”
    —————————————-

  4. Andrew, could you explain this a little bit?

    “[If] you have collinearity or near-collinearity […] you don’t want
    to be fitting unregularized least squares anyway”

    Are you suggesting that regularized least squares, for example lasso
    or ridge, are an option? Or regularization using a prior distribution?

    I understand your statement in the following way: In settings with
    high multicollinearity parameter estimates of least squares are
    unreliable. Adding or removing predictors results in different
    parameter estimates of other predictors; coefficient’s magnitude and
    sign change. If there is no good theoretical reason to include or
    exclude some of those collinear predictors, our model depends
    arbitrarily, and meaningfully, upon hard-to-justify modeling
    decisions.

    I guess an informative prior distribution might help, however this
    assumes that we do have knowledge which predictors to include or
    exclude. But if we know this already, we would not face a
    multicollinearity problem anyway. Furthermore I do not see how lasso
    or ridge could help. Lasso will pick predictors in an arbitrary way;
    ridge will bias coefficients of all predictors in the same way, thus
    we do not really learn anything.

    Would be great to hear your – or somebody else’s – thoughts on this.

    • Huan:

      Yes, if you have collinearity or near-collinearity it can make sense to use a method such as horseshoe (or lasso or ridge) which regularizes by partially pooling coefficient estimates toward zero. I was also thinking about methods such as deep nets, which I’ve heard can be very effective for problems with many predictors (or, as they’re called in that lingo, “features”).

      • Thanks for your reply.

        I am working on such a problem and I used lasso. However, it horribly failed:

        I ran an ecological regression, with many dozens (in some settings some hundred) predictors. Collinearity is often perfect or near perfect. Sometimes they are just linear functions of one another. For example population = population of males + populations of females. For many correlation is high because they are correlated with population. I model migrant flows between pairs of regions, there are 400 regions. Thus there are 400² = 160.000 observations per year.

        The data has about 20 years. For every year I ran lasso on the data. Then I checked which predictors were selected and how large their coefficients were. When plotting years vs coefficients I expected that the paths look “regular”. For example it would be weird when in 2000 the coefficient for GDP per capita is large and positive, in 2001 it is near 0 and and in 2002 it is large and positive again. But this is basically what happened.

        I concluded that lasso is confused which predictors to select and slight variations in the data from year to year result in very different selections. I now think it is better to manually construct predictors such that collinearity is much lower.

        Maybe you have an idea?

        • Lasso can’t identify coefficients for collinear covariates. Imagine you have a covariate x and its coefficient is beta. Then if you duplicate x into x1 and x2 and have coefficients beta1 and beta2, all that lasso will do is make sure that beta1 + beta2 = beta, but otherwise the optimization surface is flat.

          If you use ridge regression, it will make sure that beta1 + bet2 = beta / 2.

          You can combine the L1 regularization of lasso with the L2 regularization of ridge, using the elastic net, which is a mixture of the two.

          But it won’t solve the problem of lasso being unstable. One instability is with respect to the strength of the lasso—the stronger it is, the more coefficients get set to zero, but if you look at the path algorithms which plot coefficients vs. strength of lasso, you’ll see that it’s unstable even without changing data. That’s just a feature of trying to take a point estimate over an unstable problem.

          If instead you just use a ridge-type prior and fit a Bayesian posterior, the coefficients that don’t have much effect will wash out without being set to exactly zero.

          My number one suggestion is to simulate in cases where you know the answer to play around with how these algorithms behave.

        • Thanks Bob,

          just for clarification, for ridge it should be beta1 = beta2 = beta / 2 instead of beta1 + beta2 = beta / 2, right?

Leave a Reply

Your email address will not be published. Required fields are marked *