Read this to change your entire perspective on statistics: Why inversion of hypothesis tests is not a general procedure for creating uncertainty intervals

Dave Choi writes:

A reviewer has pointed me something that you wrote in your blog on inverting test statistics.

Specifically, the reviewer is interested in what can happen if the test can reject the entire assumed family of models, and has asked me to consider discussing whether it applies to a paper that I am working on. By any chance, do you know of any papers that discuss this and should be cited? Would you like me to cite your blog post directly?

My reply: I’ve never written an article on this so yes you can cite the blog post.

I would like to write more about this some day.

42 thoughts on “Read this to change your entire perspective on statistics: Why inversion of hypothesis tests is not a general procedure for creating uncertainty intervals

  1. If L (\hat{a}, x) is the likelihood evaluated at the maximum, then one can construct a confidence interval for a by perturbing the likelihood L (\hat{a} =- d, x), setting d so that the likelihood ratio test rejects at some level once d is exceeded (there are papers written on this and there is a name for this technique–I’m in a meeting and can’t look it up–it’s from the 60’s or such). Now, even if L is the wrong likelihood it will generally converge (quasi-likelihood, Huber, White, etc), and the confidence interval will exist even under the wrong specification. This isn’t really a hypothesis test but it uses the same underling idea. How is this wrong?

  2. Andrew,

    I’m surprised (it would be more correct to say that I’m shocked): the fact that the union of the parameter values accepted by a family of tests of level $\alpha$ is a valid $1-\alpha$ confidence set, is a theorem, 9.2.2 of Berger-Casella (I was so surprised that I had to go and look it up). But I also follow your argument. So what should I conclude? That such a confidence set is one of the numerous confidence sets which all have the correct coverage probability, but which have very bad properties? Something like the confidence set which is equal to the real line with probability $1-\alpha$ and equal to the empty set with probability $\alpha$, irrespective of data – coverage probability is perfect, usefulness in estimation is zero.

  3. I take it that most if not all statisticians do not believe that their models are true. They reflect reality very coarsely and are best thought of as approximations, but as approximations to what? An aside: many years ago I told Henry Wynn that I regarded statistical models as approximations. He replied that approximate means close to the truth which silenced me for a long time. In the following the word approximation applies to the data as in `the model P is an adequate approximation to the data x’. There is no attempt to approximate some unknown `true’ data generating process. It is `approximation without truth’. The challenge is to provide an account of statistics as the study and application of approximate stochastic models. This has consequences. As there are no true parameter values there are no confidence intervals as there are no true parameter values to be covered. Furthermore as the approximation is to the given data x there can be no appeal to repetitions of the data. Bayesian statistics is no longer possible as the additivity of Bayesian priors over the parameter space is based on truth. Replacing truth by adequate means that the prior is no longer additive but see
    https://itschancy.wordpress.com/2015/03/28/laurie-davies-and-betting-on-adequacy/
    In the following the word `model’ refers to a single probability measure, not a parametric family of such. The latter usage is an indication of sloppy thinking. A model P is an adequate approximation to the data x if `typical’ data X(P) generated under P `looks like’ x. The idea of comparing data x with simulations under a model P is not new. The first reference I have is work by Neyman, Scott and Shane on the spatial distribution of galaxies in 1953. I anybody has earlier references I would be grateful. In any concrete application the words `typical’ and `look like’ must be made precise. The word `typical’ is made precise by specifying a number alpha, 0 < alpha <=1 such that 100alpha% of sample are regarded as typical. It may be interpreted as the degree of approximation required, the larger alpha the lower the degree of approximation. the words `look like' are replaced by a subset E(P) such that P(X(P) in E(P))= alpha. The interpretation is that typical samples X(P) generated under P lie in E(P). If the data x lie in E(P) then they `look like' a typical sample X(P) and P is regarded as an adequate approximation to the data x. . Given a family {\mathcal P} of model the approximation region is defined by {\mathcal A}(x,alpha, {\mathcal P})= {P in {\mathcal P}: x \in E(P)}. To take a very simple example. the family of models is i.i.d. N(mu,1) and the concept of adequacy will be based simply on the mean. On putting alpha =0.95 the mean {\bar X}_n(mu) of (X_1(mu),…,X_n(mu)) typically lies in E(mu)=(mu-1.96/sqrt(n),mu+1.96/sqrt(n)), Thus N(mu,1) will be an adequate approximation to the data x_n if the mean {\bar x}_n also lies in E(mu)=(mu-1.96,mu+1.96). Going through the motions it is seen that {\mathcal A}(x_n,0.95, {\mathcal N})=({\bar x}_n -{\bar x}_n ,{\bar x}_n +{\bar x}_n) which is the standard 0.95 confidence interval for mu. The interpretation is completely different. There is no claim that this interval contains the `true' mu because there is no assumption that the data x_n are some N(mu,1) so there is no `true' mu to be seen. The interpretation is the following. Take any mu in (({\bar x}_n -{\bar x}_n ,{\bar x}_n +{\bar x}_n) -{\bar x}_n ,{\bar x}_n +{\bar x}_n), generate lots of data using this mu and calculate their means. Throw away the atypical means, that is the smallest and largest 2.5%. The remainder give the interval of typical means and the mean of the data {\bar x}_n will lie in is interval. If one is interested in the full normal models N(mu,sigma^2) then a decision will have to be made as to what features are of interest. Only a small finite number of features are possible. These could include, the mean, the variance, the absence of outliers and the Kuiper distance between the model and the empirical measure to name four. These can all be combined to define an approximation region. It is clear from the N(mu,1) example that the approximation region can also be interpreted as a confidence region. This is always the case. Consider the approximation region for all normal models N(mu,sigma^2) based on the four features just mentioned. Let us start with say a N(0,1) sample of size 50 and let us transform it in a continuous manner to end up with an i.i.d exponential sample with mean 1. Initially the approximation region for (mu,sigma) will be quite large but as the sample becomes less and less like a normal sample the approximation region will become smaller. Eventually the approximation will be worse than the limit specified by alpha and the approximation region is empty. In a previous discussion on this blog the approximation region was treated as a confidence region. The approximation region becoming smaller was interpreted as an increase in precision. Eventually the region becomes empty at which point there is a switch from highest precision, one value only, to no precision at all. This is nonsense. Conclusion, think approximation not truth. In order to provide a reasonable account of statistics the concept of approximation region is not sufficient. Other ingredients are required: regularization, topologies, functionals and stability. For those still reading here is a book: https://www.crcpress.com/Data-Analysis-and-Approximate-Models-Model-Choice-Location-Scale-Analysis/Davies/9781482215861

    • @laurie re “`the model P is an adequate approximation to the data x’”:

      That is one view. There is another view where the model is considered to be, indeed, an approximate description of the real world, not just of the data at hand, but of the “population” from which the data are drawn.

  4. Martha, indeed models very often are chosen to reflect some aspect of reality. There are no rules for this but sometimes if you are lucky there may be theoretical reasons for a particular family of models. At other times you may simply chose a standard model, linear regression for example. No matter how you get your family of models you will still face the problem of deciding which particular models or parameter values if any are consistent with the data. Choosing a model and then comparing the model and the data are two different activities which may be, and often are, linked in a cycle. They are both necessary, it is not a case of one or the other. Or have I misunderstood you?

    • “No matter how you get your family of models you will still face the problem of deciding which particular models or parameter values if any are consistent with the data”

      I would say the question is whether the data are consistent with the model. And the model, as I use the term, needs to be consistent with the data collection process or any hierarchical or other structures in the context from which the data arise. And the model needs to be one which has a hope of answering the question(s) of interest.

    • I presume Martha means something like: simulated data X are consistent with the data at hand x if X = x + measurement error for some measurement error/tolerance/precision.

      X themselves are the result of an (approximate) stochastic (say) model.

      So an example hierarchical model is

      x|X,e0 ~ p(x|X,e0)
      X|mu,e1~ p(X|mu,e1)
      mu ~ p(mu)

      for ‘error’ parameters e0, e1. The ‘standard’ (non-hierachical) Bayesian approach corresponds to e0 -> 0, ie p(x|X) -> delta(x-X) but there is no reason not to consider the more general structure.

        • OK sorry. I meant the hierarchical model as an example of trying to capture

          “And the model, as I use the term, needs to be consistent with the data collection process or any hierarchical or other structures in the context from which the data arise.”

          in a way that relates to Laurie’s emphasis on approximation in data space. I suppose I missed the mark.

  5. Quoting from linked post:

    > as your fit gets gradually worse, the inference from your confidence interval
    > becomes more and more precise and then suddenly, discontinuously has no
    > precision at all

    Does “gradually worse fit” in this context, means that the assumptions used to
    derive our confidence intervals are inconsistent with “real” process that
    generated our data? In other words, we are not really working with confidence
    intervals with respect to underlying process generating data.

    Would following be a good example: mean estimation with additional condition
    that mean is contained in specific range, say [0..1] (and CI respect this range),
    but real mean is outside this interval?

  6. Martha, ojm, take two simple examples. The first consists of 27 measurements of the amount of copper in drinking water.
    2.16,2.21,2.15,2.05,2.06,2.04,1.90,2.03,2.06,2.02,2.06,1.92,2.08,2.05,1.88,1.99,2.01,1.86,1.70,1.88,1.99,1.93,2.20,2.02,1.92,2.13,2.13
    The second 20 measurements of the speed of light (see the article by Stephen Stigler) taken by Michelson. 880,880,880,860,720,720,620,860,970,950,880,910,850,870,840,840,850,840,840,840.
    The family of models is the normal family N(mu,sigma^2). The problem is to specify those (mu,sigma) which are regarded as acceptable for the data, if any. This is typically done by specifying a confidence region or something similar.

    • “The problem is to specify those (mu,sigma) which are regarded as acceptable for the data, if any.”

      How (in a real, applied context) would a problem like this arise?

    • I’m not at my computer but what happens if one tries a simple ABC rejection algorithm*

      Note that it has an additional parameter for measuring the distance between simulated data and observed data (or summary statistics of the datasets). I tried to indicate above that this might be thought of as an additional layer in a hierarchical model.

      *https://en.m.wikipedia.org/wiki/Approximate_Bayesian_computation#The_ABC_rejection_algorithm

  7. Martha: How (in a real, applied context) would a problem like this arise?

    Here is the context for the water data. In interlaboratory tests laboratories are given prepared samples with a known quantity of some chemical, in this case copper. The laboratories return their readings which are then analysed by the institute conducting the test. Suppose in the data I gave the sample was prepared with the value 1.9. The question is then whether the measurements the laboratory returned are consistent with the known amount of contamination 1.9. It is a form of quality control. The evaluation of such data is more difficult than simply basing the analysis on the Gaussian model. There are national and international standards for the statistical evaluation of interlaboratory tests. In other words there is a protocol for evaluating such data. I was involved in the formulation of the German standard
    @misc{DIN38402,
    Howpublished = {Deutsches Institut f\”ur Normierung},
    Title = {{DIN} 38402-45:2003-09 {G}erman standard methods for the examination of water, waste water and sludge – {G}eneral information (group {A}) – {P}art 45: {I}nterlaboratory comparisons for proficiency testing of laboratories ({A} 45).},
    Year = {2003}}
    Data on contamination are evaluated all the time, not only in interlaboratory test and not only for drinking water. The amount of contamination in exhaust fumes is a case of topical interest. In spite of its simplicity the problem as I formulated it is instructive as the ideas involved can be applied to more complicated problems. I could not have asked you to provide a general protocol for the evaluation of interlaboratory tests. It is however possible and quite simple to formulate a protocol based on the Gaussian model. I have now given you a simplified version of a real, applied and very important context and would be interested in your protocol.

    ojm: This is interesting. You can if you wish formulate my problem as how to measure the distance between simulated data and observed data. In the present case I would not simulate data as the relevant quantities can be directly calculated but in general simulations would be required. In the reference you gave one makes one simulation for every parameter value drawn from the prior. This seems to me to be a bad idea but there are no doubt versions where one simulates many times. In the present case it is quite easy to specify bound for both parameters based on the data. One can then place a grid on this region and check all parameters on the grid. This seem to be like placing a uniform prior over the region. If one interprets it as a prior then it is explicitly a data dependent one. One can of course choose a prior based on knowledge of the situation which allows you to specify beforehand what you think are reasonable parameter values. Looked at it in this way the prior becomes part of a numerical procedure with the goal of providing adequate parameter values. This is what you did in the paper on tissue repair. Figures 3 – 6 seem to me to be in this vein. The same applies to the paper on Bayesian mixtures by Marin et al where the relevant figures are 15 and 23.

    Here are my results for the copper data with alpha=0.9. The standard confidence interval for mu is (1.978,2.054). My set of adequate values is (1.945,2.086). Now change the smallest value 1.7 to 1.6. The confidence region is (1.970,2.054), the approximation region is (1.935,2.089). For 1.5 the results are (1.962,2.055) and (1.931,2.086) respectively: for 1.4 (1.954,2.056) and (1.929,1.978) respectively; for 1.36 (1.950,2.056)and (1.928,1.936); for 1.35 (1.949,2.056) and the empty set.

    • Laurie: The information “In interlaboratory tests laboratories are given prepared samples with a known quantity of some chemical, in this case copper. The laboratories return their readings which are then analysed by the institute conducting the test. Suppose in the data I gave the sample was prepared with the value 1.9. The question is then whether the measurements the laboratory returned are consistent with the known amount of contamination 1.9,” partly answers the question I was trying to ask.

      But still missing is the connection between this information and your earlier statement “The family of models is the normal family N(mu,sigma^2). The problem is to specify those (mu,sigma) which are regarded as acceptable for the data, if any. This is typically done by specifying a confidence region or something similar.”

      In particular:
      Why were normal models chosen?
      Why does the problem “whether the measurements the laboratory returned are consistent with the known amount of contamination 1.9” translate to “specify those (mu,sigma) which are regarded as acceptable for the data, if any” ?
      What does “acceptable for the data” mean in this context?
      Why is this “typically done by specifying a confidence region or something similar”?

      In other words, I was trying to ask for the justifications/rationale/definitions/reasoning behind the modeling and protocol you outlined.

      To possibly clarify more:
      1. My questions partly go back to your earlier comment: “At other times you may simply chose a standard model

      I am uneasy with the “chose a standard model” part. In the copper example, was the Gaussian model chosen just because it is “standard”?

      2. I agree with you that models can only be approximate, but I believe that in problems such as the copper data, there is a “truth” out there (e.g., the concentration in the samples given is known!), and the modeling needs to aim at describing the “truth” as well as practically possible.

  8. Stanislaw: sorry I missed your contribution. Perhaps the last section of my reply to Martha and ojm goes some way to answering your question. I start off with the real data, and for what it is worth they are real data, and then slowly decrease the smallest value from 1.7 to 1.36. This creates so to speak an outlier which becomes less and less consistent with any normal model. This is shown by the decrease in size of my approximation interval. It is to be interpreted as the range of values of mu which are consistent with the data becomes smaller and smaller. For the value 1.35 there are no values of mu which are consistent with the data and the region is empty. The problems start when people interpret this region as a confidence region. As the length of a confidence region is taken as a measure of precision and as the confidence interval contains (with a certain probability) the `true’ parameter values it seems as though the data are getting worse and worse but we are getting ever preciser estimates of this true value. This seems nonsense to me. The interpretation as an approximation region makes sense, at least to me. Although I did not do it you can gives a measure of how good the approximation is. I do this using P values but their interpretation is different from the usual one, assuming a usual one exists. Anyway the P values for 1.7, 1.6,1.5,1.4,1.36 are respectively 0.305, 0.185, 0.131 0.046 and 0.0252. When the P value drops below 0.0125, which is the case for 1.3584, the approximation region is empty. the smaller the P value the worse the approximation. The value 0.0125 derives from the alpha=0.9. As a guide if you start with normal data of size n=27, simulated not real, then in 90% of the case the P-value is larger than 0.305. This is already an indication although a weak one, that the normal model is not all that good for the data. In 98.5% of the cases it is larger than 0.185. Thus if you replace 1.7 by 1.6 you are on the edge so to speak of acceptability of the normal model for the water data.

  9. Stanislaw: Just reread your post.

    Does “gradually worse fit” in this context, means that the assumptions used to
    derive our confidence intervals are inconsistent with “real” process that
    generated our data? In other words, we are not really working with confidence
    intervals with respect to underlying process generating data.

    This is not the way I think. I dislike “real” in quotes, I dislike “true” in quotes. If you have to put “true” in quotes something is wrong. I avoid “true” but am prepared to use true. I think there is indeed a true but unknown amount of copper in the sample of drinking water. I do not believe it has a true mu from the normal distribution attached to it. Here a sentence from my first contribution.

    There is no attempt to approximate some unknown `true’ data generating process.

    • Laurie:

      I believe what Stanislaw is referring to is a Measurement Model used to assign values to amounts of copper in the water. In Structural Equation Models a Measurement Model is estimated to allow for the assessment of measurement error separately from Residual Error in the Predictive Model of a health outcome (as an example).

      • Wow, that was not very clear. Allow me to rephrase:

        In structural equation modeling, there are typically two parts of a model. One is the measurement model which, as the name suggests, models the data generating process and the other is the structural model which models the causal relationships at the latent variable level. One of the argued benefits of this, though not universally held, is that it allows us to model measurement error separately from residual error of the causal relationship between latent latent predictor and outcome variables.

  10. RE the original subject of the post and my comments above – interesting that ABC and/or similar sampling implementations of Bayes so clearly involve accept/reject decisions for the construction of the posterior from prior/model samples.

    Doesn’t seem so dissimilar to hypothesis test inversion after all?

  11. artha, James: Deeper and deeper. Let us take the simple coin toss. To simplify matters I shall assume that the coin is tossed by a machine and not some human being: I do not want to model the human being as there are some expert coin tossers amongst them. In Germany the Lottozahlen are `chosen’ by a machine and the whole process is televised. What is the `true generating mechanism’? It depends at what level you want to discuss this. We could start with Newtonian mechanics. The model would then be that of a deterministic chaotic process, chaotic in the sense that very small changes in the initial conditions result in different outcomes, head or tails: http://perlikowski.kdm.p.lodz.pl/papers/PR2008.pdf. You may try to model this chaotic system when analysing the results of coin tossing or you may decide to use the simple binomial model. But if you choose the latter option in what sense is the simple binomial model an approximation to the deterministic chaotic process behind the results? You may not like the deterministic part and would prefer an explanation at the quantum level with a random collapse of the wave packet. This does not help. Bohmian mechanics is a deterministic chaotic version of quantum mechanics. Modelling real data is very difficult so we can simplify matters if we stay within mathematics. Take the binary expansion of any complex (Solomonoff-Kolmogorov-Chaitin) number in (0,1). Martin-Löf has shown that it will pass all tests for independence. Take pi which is not a complex number so there is no Martin-Löf result for pi. People have studied the binary expansion of pi up to several million places and again all tests for independence have been passed, at least as far as I am aware. If anybody has a reference where this is not the case I would be grateful for it. This is also the case for a random number generator. These are chaotic deterministic algorithms which produce ‘random’ numbers. Finally there are probabilistic theorems in number theory. In all these cases the binomial model is accepted simply on the grounds that the numbers produced by these deterministic chaotic systems ‘look like’ binomial random variables. Can you go beyond this and give me a deeper sense in which the binomial model approximates the true data generating mechanism? I suspect not but it would be very interesting if someone could give one. When the hurlyburly’s done you are still going to have to decide whether the data are compatible with the model.
    Let me take the following model for the water data X=mu+sigma*epsilon where the epsilon are i.i.d. with distribution function F. On putting F=Phi we get X~N(mu,sigma^2). For the water data I now speculatively and provisionally identify mu with the true quantity of copper in the water sample. There is no true value of mu because the data do not come with a true (F,mu,sigma^2) attached. We are now going to specify a range of plausible values for mu on the basis of the data and the model. These plausible value can then be compared with known amount of copper and then using the identification we can decide whether the laboratory has done its job properly or not. Specifying a range of values will depend on F and on what we do with the data given F. Given two distribution functions F and G how well can we distinguish between them on the basis of the data? One would be reluctant do use the binomial model for the copper data simply on the basis of the data alone. We can generate data with the F distribution by F^-1(U) where U is uniform on (0,1). The same for G with the same U. If F^-1(U) and G^-1(U) are very close together then you will not be able to distinguish between F and G on the basis of the data. In fact if the data are given to finite precision it may be that F^-1(U)=G^-1(U). Roughly speaking (I can be more exact but this would alter nothing) F and G are going to be indistinguishable at the level of the data if the Kuiper (or Kolmogorov) distance between them is very small d_{ku}(F,G) < 10^-6 for example. None of this is altered if I restrict everything to distributions F with an infinitely differentiable density function which I now do. Suppose a distribution function F passes a goodness-of-fit test, Kolmogorov or chi-squared for example, it is now standard practice for frequentists to calculate maximum likelihood and Bayesians to postulate a prior and then calculate the posterior. Goodness-of-fit is based on F, the formal analysis is based on the density f of F. F and f are connected by the differential operator D(F)=f, that is differentiate F to get f. D is an invertible but unbounded linear operator and therefore pathologically discontinuous. The result is that different models all of which fit the data can give very different results. Here are the 95% confidence intervals for the normal, Laplace and comb models for the copper data based on maximum likelihood: (1.973,2.063), (1.991,2.072),(1.9684,1.9694). The log-likelihoods are 20.31,20.09 31.37. All three models fit the data. One notices here the extremely small confidence interval for the comb model. A Bayesian analysis does not help. The posterior for the comb model is essentially concentrated on the interval (1.9684,1.9694). From a mathematical pint of view this is not surprising. The differential operator D is pathologically discontinuous and consequently the choice of F is ill-posed. the problem must be regularized. Without going into further details I decide to regualize by minimizing Fisher information. To be more specific I restrict consideration to all distributions with a finite variance and then minimize the Fisher information. The answer is a Gaussian model. The net result is that the model for the data is X~N(mu,sigma^2) and once again how would you specify those mu and sigma consistent with the data? I would be reluctant to accept mu=100, sigma=1^-6 on the grounds that the data do not look anything like that.

    In my first post I wrote that other ingredients beside approximation regions are required and mentioned explicitly regularization, topologies, functionals and stability. The above gives some indication of what I meant by regularization and topologies. I have not answered all your questions but this is already long and I may have completely misinterpreted them. I will leave it at that for today.

    • Also regarding your point about goodness of a distribution. There are two senses of “good”.

      1) A sequence of numbers can be considered to be a sample where the distribution D is the frequency distribution.

      2) A distribution D can be considered to be a reasonable choice to model the plausibility that a particular value coming out of a physical process will eventually be found to equal a given value x.

      (1) is impossible to determine very well with “small” datasets (say less than a million or a billion values or something like that). (1) is inherently a statement about infinite sequences. Infinite sequences are non-physical.

      (2) simply requires that none of the small finite number of data values be outside the “typical set”.

      (2) is a kind of regularization, we’ll accept anything that has the actual data in its typical set. If course that could be a variety of discrete distributions, so we typically regularize further than that. Maximum entropy is frequently appealed to.

      The key difference between inference as you posit it and Bayesian inference is that Bayesian inference works with the plausibility of the data under the model without reference to sampling. There is no actual requirement in the likelihood portion of the Bayesian model that each data point be the result of an iid sample from the SAME distribution whose frequency distribution is equal to its probability distribution, whereas your inference works with the plausibility of the model as a generator of the data by iid sampling from a frequency distribution.

    • A Laurie: “I have not answered all your questions but this is already long and I may have completely misinterpreted them.”

      Yes, I don’t think you have answered my questions; we seem to be talking past each other. I don’t really know how to explain my questions any better, so I’ll just drop out of this discussion.

  12. Daniel Lakeland: (0) pi. I wrote:
    Take pi which is not a complex number so there is no Martin-Löf result for pi.
    In other words it is clear that I know that there are short computer algorithms for computing pi. Maybe you just wanted to give me a reference but I was already aware of it.

    You may only have two senses of the word good, I have more and they are all for finite samples such as the copper data. The dogma is to give an adequate approximation to the data at hand which includes the sample size.
    (1) I agree so this is why I never talk about infinite sequences, see above.
    (2) I don’t understand you first (2) but if what you mean by this is your second (2) then we seem to agree. Your typical set is presumably my E(P) \subset r^n. This in itself is not sufficient and you suggest regularizing by maximum entropy for discrete distributions. This is not always appropriate but it certainly can be a sensible suggestion. It applies only to discrete distribution as the entropy of any continuous distributions is infinite. For a continuous distribution with density f you could use \int f \log f which is also acceptable but not a measure of entropy. It is the case that minimum Fisher information and using \int f \log f give the same answer the normal distribution but \int f log f is a fiddle under the name of maximum entropy. It is the first order correction to an expression whose first term is infinity. I prefer minimum Fisher information which is clearly better. If I minimize Fisher information over all distributions including the discrete ones the answer is finite and results in a distribution with an absolutely continuous density. If I do the same for maximum entropy the result is infinite and results in any distribution which is not discrete. I have the impression that those who use densities have a double identity. Sometimes they are particles using a weak topology and hence can apply the central limit theorem when going from discrete to continuous. At other times they are waves and cannot use the central limit theorem for going from discrete to continuous.

    whereas your inference works with the plausibility of the model as a generator of the data by iid sampling from a frequency distribution.

    Well yes it does work then but are you trying to say that it only works then? This is the impression I get. Are you saying for example that my ideas will not work in the area of non-parametric regression?

    • RE pi, I was just pointing out an interesting formula, which others might not know.

      RE entropy in continuous distributions… I think this is an interesting topic and I will write a blog on it using nonstandard analysis. http://models.street-artists.org look for it in a few days.

      Re inference: i was mainly pointing out that your view identifies inference on the parameter with inference on whether the specified distribution is “satisfactory” for the data. Whereas Bayesian analysis takes the “satisfactoriness” of the distribution as given and calculates the probability of the data under the model.

      What alterations to the axioms of Cox are required to show that your method is the method that satisfies the new axioms? I would be interested in the answer to that question.

  13. Two comments on maximum entropy. (1) if you maximize subject to a typical set you will end up with the coarsest discrete distribution which satisfies the condition. (2) Any number which claims to measure the entropy of a distribution must be invariant with respect to 1-1 transformations. This is the case for discrete measures. It is not the case for \int f \log f. It is not even invariant if you change the units.

    I agree with your description of Bayesian statistics. It also applies to much of frequentist statistics, for example David Cox’s book Principles of Statistical Inference. In this world there are two phases of statistical analysis. The first is EDA for want of a better expression. This is followed by formal statistical inference. You can of course iterate this. EDA is based on distribution functions and has a weak topology. Formal inference is based on densities with a strong topology. There is a double flip, from distributions to densities and from weak to strong. I never felt comfortable with statistics and it took some time before I understood why, at least to my own satisfaction. I use the word model in the sense of a single probability distribution. This is not the standard meaning in statistics although it is the standard meaning in the machine learning world. Let suppose you want to know whether to base your formal statistical analysis on the family of Gaussian models. Suppose one observation has the value 10 to take your own example. Then you convincingly argued that this is not consistent with the standard N(0,1) model. It is however consistent with the N(10,1) model. This is exactly what I do. I specify those parameter values which are consistent in a well-defined sense with the data. To be explicit I consider then mean, the standard deviation, size of outliers and Kuiper distance between the empirical and model distribution. Each acceptable parameter pair comes with four numbers and this is where I stop. I take it that although you agree that some parameters are not consistent with the data (the N(0,1) and N(100,10^-6) cannot both be consistent) you now move to the second phase and accept the model in your sense. But your meaning of the word model is not mine. When you go to the second stage you take all the parameters with you including those which are not consistent with the data. Furthermore by using likelihood you will now base any further analysis on the mean and standard deviation of the data. The information given by the outlier measure and the Kuiper distance will be thrown away. Even the information given by the mean and standard deviation will be reduced to a single number. Moreover the likelihood is blind. If the data are 0-1 and the model is N(mu,sigma^2) and I give you the likelihood based on the model you have no way of knowing that the model is completely false. It seems to me that you use EDA to guide you into an area close to the data but to get the fine detail you reduce everything to a single number which is blind. This seems very odd to me. If there is a conflict between the result of your likelihood based analysis and a post hoc EDA evaluation then the EDA result takes precedence. Consider non-parametric regression using wavelets. Suppose a does an analysis then he/she will not report the likelihood. Instead a particular result, say the mean of median result will be taken and a figure included showing the original data and the resulting function. If this function is way off say due to an undetected outlier then a search begins for the reason and a correction. In other words the post hoc EDA is decisive. From this is should be clear that I reject all attempts to formalize expressions such as statistical evidence, for example that done by Michael Evans, the relative belief ratio. The whole project is misguided in my opinion. Given an event A with probability P(A) and suppose the event C occurs. Then if P(A|C) > P(A) then we have evidence for the truth of A. This considers arbitrary sets A and C (subject to measurability). For what subsets A can we usefully calculate probabilities? Vapnik argues that you can only learn from Vapnik-Cervonenkis classes. I think this is correct which is one but not the only reason why I use a weak topology based on V-C-classes for all my statistical analysis.

    • “you reduce everything to a single number [likelihood function based on assumptions which can never be questioned/changed] is blind.” was I believe Fisher’s biggest early mistake [more exactly as in brackets] and why I think a heavy focus on sampling distributions is misplaced.

  14. Martha (and Laurie): Not sure whether you’re still reading this. I can’t give Laurie’s answer but I can give mine (which may not be to Laurie’s liking but I’d be happy to have his point of view):
    “Why were normal models chosen?” Laurie addressed this one, I think.

    “Why does the problem “whether the measurements the laboratory returned are consistent with the known amount of contamination 1.9” translate to “specify those (mu,sigma) which are regarded as acceptable for the data, if any” ?”
    I’d think that we need to be in some sense frequentist here; I’m not so sure whether this is what Laurie intends. I think for this to make sense we need to have a “frequentist mindset”, by which I mean that we interpret probability models as mechanisms generating data in a frequentist way. We do *not* have to believe that reality really works like this, but we should accept that this is a useful way for human beings to think about reality and to reduce the infinite complexity that reality has when thinking about all the details. If we then specify for example a confidence interval (or a Davies-style adequacy region), we have a set of models that are consistent with the data, whereas we can exclude many other models. To make use of this we need to adopt “frequentist thinking” consciously, and knowing that this is a mode of thinking, not a statement about how the world really is. So we can say that the values in the adequacy region are “plausible values” under this way of thinking, which connects a parameter value that may have a clear physical interpretation with an idealised idea about a mechanism generating observations.

    I always thought that it is a problem of frequentism that it is often presented in a way that takes the models and the frequentist interpretation too literally. If you look at frequentist teaching, you see that things are presented as “we’re assuming that N(a,\sigma^2) is true and this needs to be checked.” However, ask the statisticians who teach this if they really believe that the assumptions hold in reality, most will tell you “of course not”. I’m interested in exploring how to use frequentist thinking without claiming that this is how reality really works; making clear that the model assumptions do *not* indicate how reality has to be for a method to work, but rather in order to characterise an idealised situation in which it works; and there needs more work including idealisation to connect reality to this. As far as I understand Laurie, his approach does this, too, although he may be reluctant to use the term “frequentist” because of all the bad associations and misunderstandings that it may involve.

    Using models consciously as tools for thinking without implying naive ideas about how the world works is not for the Bayesians alone.

    There are still two further questions:
    “What does “acceptable for the data” mean in this context?” (This may have been addressed by the above.)
    “Why is this “typically done by specifying a confidence region or something similar”?” The idea here is to check model by model whether it is “adequate”, i.e., consistent with the data, or not, and this leads to something that can formally be described as confidence set (although the interpretation is somewhat different).

    • Hmm, “I always thought that it is a problem of frequentism [Bayesianism] that it is often presented in a way that takes the models and the frequentist [posterior probability] interpretation too literally.”

      I believe models are less implicit in frequentism but taking probabilities as _real_ rather than just models/representations to emulate what happens empirically more common in Bayes.

      Of course, depends heavily on your sampling frame – currently reviewing a frequentist analysis where the authors state they only did standard Anova on outcome measures that were proven to be Normally distributed by various tests of Normality.

  15. Daniel Lakeland: Jaynes propagated maximum entropy. Did he use this concept to justify the normal distribution? He always insisted on taking limits in the correct manner. If you do this for maximum entropy the result for the normal distribution is infinity. Did this worry him or am I missing something here?
    I describe the second phase of statistical analysis as the ‘behaving as if true’ phase which I think correctly describes it. Of course behaving as if true does not imply believing to be true. Bayesians and frequentists both do it. It seems to be standard.

    • Laurie, I think Jaynes was concerned about it, but I’m not an expert on Jaynes. I hope you will like my blog post. The gist of it will be that we can define the nonstandard maximum entropy problem absolutely without any logical problems, and it actually maps better to what applied people are doing (approximating say the output of a 32 bit D/A converter “as if” it were continuous for example) and the nonstandard maximum entropy problem will pick out a nonstandard distribution whose standardization is the normal distribution.

    • Laurie:

      http://models.street-artists.org/2016/01/14/differential-entropy-and-nonstandard-analysis/

      It can be helpful to have some background in IST and nonstandard analysis. The basics are probably on my blog somewhere if you search, but also the following:

      1) With the added axioms of IST we can prove that there is an integer larger than any standard integer (an unlimited integer), and there is a number greater than 0 but smaller than any standard positive number (an infinitesimal).

      2) Every number has a “nearest standard number” called its standardization. Using this you can also find standardizations of functions, ie given a nonstandard function f(x) with limited values (ie. none of the values are “unlimited”) we can find a standard function g(x) which at any argument is only infinitesimally far from f(x) and this g is the standardization of f.

      The gist then is that by following the template of entropy maximization on a finite sample space we can pick out p_i for all the nonstandard infinitesimal intervals between some range -N to N with N unlimited. We can then construct a nonstandard “step” function, and find that its standardization is the one we expect (the normal(0,1) curve for example in the case where we maximize entropy subject to mean 0 and sd = 1).

      At the end I show how Jayne’s solution “the limiting density of discrete points” was the same conceptually as using a nonstandard uniform distribution to rescale the entropy by a nonstandard constant so that the entropy value being maximized becomes standard.

  16. hristian this addressed to your comments but not so to speak to you personally as it includes things we have already discussed. How does one choose models? To start with the copper data. Such data often have outliers so that the Gaussian model I postulated is much too simple. Nevertheless let us assume that it has proved satisfactory in the past. Moreover past experience has shown that if the measurements are taken with care then they can be well approximated by some Gaussian model. Moreover the mean of the data provides a good estimate of the true amount of copper in the data in the sense that a 95% confidence interval covers the true value in 95% of the cases. Furthermore the standard deviation provides a good measurement of how precisely the laboratory works. Sometimes however even the data from good laboratories has the occasional deviant observation. In the data I gave the value 1.7 looks somewhat deviant. This will have to be taken into account. All the forgoing means that the choice of the model is based on previous data and so is in part frequency based. The approximation region I use for the normal model takes all this into account as well as the Kuiper distance. For n=27 the latter is so weak that one could dispense with it. Finally one can provide an overall measure of the degree of approximation based on the family of normal models. For the copper data this is 0.3. Normal data exceed this value in about 90% of the cases so the approximation is not brilliant. This low value is caused by the 1.7.

    It is not always the case that a model is based on past experience. Sometimes a special model is required for a specific data set. At other times there is sufficient well founded knowledge for the model to be based on this alone. At other times one may wish to know if A influences the probability B without anymore motivation than it might possibly do so. One then takes a standard model, say logistic regression and includes A in the covariates. The process at arriving at a model is too heterogeneous for there to be any methodology for it.

    My original contribution to this blog was on the topic stated in the title: “Why inversion of hypothesis tests is not a general procedure for creating uncertainty intervals” and “Why it doesn’t make sense in general ….” The figure in the heading is context free. My first contribution was context free but it didn’t take long before I felt the need to provide simple examples and ended up at one point talking about chaotic systems. I suppose in retrospect that this was inevitable. The concept of truth is so wedded into statistical philosophy, a confidence interval contains the true value etc, additivity of Bayesian priors and posteriors, that I suppose it is not easy to think in terms of approximation: ban the word truth when talking about parameter values and reserve it for the real world. The word approximation was interpreted in terms of how well your model reflects, approximates, corresponds to the subject matter, the data collecting process and so on. Well This can and was discussed and I have/had no objections to doing so but it is not what I meant. At the end of the day you have the data and the model and you have to decide which if any parameter values are consistent with the data, that is in my language, you have to specify an approximation region. It is possible to do this in a context free manner. No matter, the discussion is interesting even if it took a turn I was not expecting.

Leave a Reply to lauriedavies2014 Cancel reply

Your email address will not be published. Required fields are marked *