A question of infinite dimensions

Constantine Frangakis, Ming An, and Spyridon Kotsovilis write:

Problem: suppose we conduct a study of known design (e.g. completely random sample) to measure *just a scalar* (say income, gene expression example from Rafael Irizarry), and suppose we get full response.
Question: what data do we actually observe?
Answer: we observe an infinite dimensional variable, which can carry extra information about how we analyze the scalar (say to estimate the population mean).

Logic:

1. Suppose we believe that if we had applied the same measurement device on all the population, then we would have some non-response. That would then mean that in the actual sample we got, the mere fact that we observed *all the data* is actual information and means that we got a non-representative sample of the population (just from the responders).

2. If we believe that (1) can be true, then we should worry. Reversely, if we do not worry, it implies we believe (1) is false. But there is no measurement device that is a priori guaranteed to work for all units, so we must worry.

3. The key issue now is that we usually think that, by incorporating the indicator of observation in a new column in the data, we believe we have fully described what we observed. But I suggest we have not. This is because we can iterate the logic of (1) now on the “new data”: the fact ={that we observed that we had full responses} is also a nontrivial observation, as long as it is measured with a device that can sometimes be fallible. But when we iterate this logic we conclude that we actually observe an infinite sequence of variables.

This is very much similar to Godel’s argument of incompleteness, applied to statistics if we treat a measurement device is a Turing machine. Its practical implication is that it is extremely important to understand the *variation in how* exactly each and every measurement was made because that variation is extra information **even if (and not only if) we observe all measurements !! **

I didn’t really follow, so I asked Constantine to clarify. He wrote:

Here is an example of the first level.

1. Setting: suppose we are studying the income Y of a city’s population, and Y in truth Y follows a log-normal distribution and we know that. We are to conduct measurements on units, with a measurement device (e.g., an interviewer) that can *possibly* give no response (if it gives response, we assume it is true).

2. Data: we now conduct a simple random sample, and with the measurement *device* we use, we get 100% response in the sample. Also, say with the data we get an MLE{median pr(Y)}=$54K and the MLE(SD(log(income)) i.e. among all response sample)=0.43, or MLE(SD(income))=$27K;

Question: Should we worry about non-response even if we got full response ?
Answer: The answer is YES, because we would get a DIFFERENT RESULT than $54K under some consideration of non-response, EVEN IF WE GOT FULL RESPONSE.

3. Example: What can that consideration be, and what answer could we get ?

Suppose that if the *same measurement device had been applied to all the population*, we would have gotten 20% non response (R=0). Moreover, suppose that this nonresponse depends on the outcome in the sense that the ratio of the median income among responders versus non-responders is .7, which occurs because all incomes < median respond, but a random 60% of the incomes > median respond. Suppose also we know this – this gives a model for pr(R|Y).

What is the MLE now, with the same log normal model but also the pr(R|Y) model ? It is $60K. It is significant to note that by the above MLEs, I mean no randomness, in the following sense: under the outcome model pr(Y) in part 1 and the pr(R|Y) model in part 3 (the pr(R|Y) model), we have:

A) the median of the true distribution of Y is $60K, but
B) the median of the distribution pr(Y|R=1) is $54K. So, the MLE of $54K if we get full response is not a happenstance but the value we expect to get if we ignore part 3.

Since A and B differ, it matters whether we consider the observation of {the fact that we observed all the data} as important information.
The new observation I am making here is that this is not complete – we have to be considering (otherwise we are making assumptions) the observation that we observed that we observed …., and this can be iterated to infinity.

The key results are that
Result 1) from a plan to measure just a scalar, we are actually observing infinite variables; and
Result 2) there is a bound (like Heisenberg’s uncertainty bound) of how much of this information we can actually use.

I still don’t really understand what Constantine is saying here, but he’s a smart guy, so I’m passing this along in case it interests any of you out there.

3 thoughts on “A question of infinite dimensions

  1. Geez, I'm in the real world. I'll worry about this argument IF I ever get no nonresponse AND I am somehow able to determine that nonresponse is required. This is not a problem I expect to consider in my lifetime.

    Otherwise, isn't this just a statement that data may be biased, and that we may need to deal with this bias? This would be one type of bias to find, but we're awash in more common types.

  2. I think I understand the argument. Suppose that
    *nobody* with income over $1 million is willing
    to admit this to an interviewer. Then the fact
    that you got 100% responses means that you don't
    have anyone with income over $1 million in the
    sample. If there are actually substantial numbers
    of these people, then your sample may have (by
    chance) given you a substantially wrong result.

    I'm not convinced, however. Don't you know that
    you have no one with income over $1 million in
    the sample even without considering the
    possibility of non-response by such people? It's
    not clear to me that anything is added by the
    infinite number of additional variables…

  3. Seems like the guy has just recreated the concept of selection bias and put on few layers of it one upon another.

Comments are closed.