I guess that one can upload the data, access data that others have posted, and perform some simple types of analysis. It might not sound much, but having a database of data will remove the need for people to provide summaries of it. Anyone interested in the problem can perform the summaries for himself. This will make data analysis much more approachable than before. This can also become competition to existing spreadsheet and statistical software, and a platform for deploying recent research: it is often frustrating for a researcher in statistical methodology how difficult it is to actually enable users to benefit from the most recent advances in the research sphere.

]]>One of the most frequently asked questions in statistical practice, and indeed in general quantitative investigations, is “What is the size of the data?” A common wisdom underlying this question is that the larger the size, the more trustworthy are the results. Although this common wisdom serves well in many practical situations, sometimes it can be devastatingly deceptive. This talk will report two of such situations: a historical epidemic study (McKendrick, 1926) and the most recent debate over the validity of multiple imputation inference for handling incomplete data (Meng and Romero, 2003). McKendrick’s mysterious and ingenious analysis of an epidemic of cholera in an Indian village provides an excellent example of how an apparently large sample study (e.g., n=223), under a naive but common approach, turned out to be a much smaller one (e.g., n<40) because of hidden data contamination. The debate on multiple imputations reveals the importance of the self-efficiency assumption (Meng, 1994) in the context of incomplete-data analysis. This assumption excludes estimation procedures that can produce more efficient results with less data than with more data. Such procedures may sound paradoxical, but they indeed exist even in common practice. For example, the least-squared regression estimator may not be self-efficient when the variances of the observations are not constant. The morale of this talk is that in order for the common wisdom "the larger the better" be trusted, we not only need to assume that data analyst knows what s/he is doing (i.e., an approximately correct analysis), but more importantly that s/he is performing an efficient, or at least self-efficient, analysis.

This reminds me of the blessing of dimensionality, in particular Scott de Marchi’s comments and my reply here. I’m also reminded of the time at Berkeley when I was teaching statistical consulting, and someone came in with an example with 21 cases and 16 predictors. The students in the class all thought this was a big joke, but I pointed out that if they had only 1 predictor, it wouldn’t seem so bad. And having more information should be better. But, as Xiao-Li points out (and I’m interested to hear more in his talk), it depends what model you’re using.

I’m also reminded of some discussions about model choice. When considering the simpler or the more complicated model, I’m with Radford that the complicated model is better. But sometimes, in reality, the simple model actually fits better. Then the problem, I think, is with the prior distribution (or, equivalently, with estimation methods such as least squares that correspond to unrealistic and unbelievable prior distributions that do insufficiant shrinkage).

]]>