1. I finished a few.

2. I was not particularly successful here. Easier said than done, I suppose.

3. I’d give myself a solid B on this one.

4. The guys at the bike shop did this one. Also fixed my front brake, which apparently was hanging on by literally one strand of cable.

5. Yep.

6. No comment.

7. Nope. But did get a paper accepted on my struggles in this area.

8. Not yet. This one’s a push to 2007.

9. Don’t recall.

10. Nope. But I didn’t get any further behind, either. Gotta work harder recording the amusing incidents.

11. Didn’t do so well on this one (even though, by the look of it, it seems like the easiest of resolutions).

12. Enough unfinished business here that I don’t think I need anything new for 2007. Well, ok, here’s one: I’d like to go to the movies at least once.

]]>More interesting are the different perspectives that one can have on high-dimensional data analysis. Donoho’s presentation (which certainly is still relevant six years later) focuses on computational approaches to data analysis and says very little about models. Bayesian methods are not mentioned at all (the closest is slide #44, on Hidden Components, but no model is specified for the components themselves). It’s good that there are statisticians working on different problems using such different methods.

Donoho also discusses Tukey’s ideas of exploratory data analysis and discusses why Tukey’s approach of separation from mathematics no longer makes sense. I agree with Donoho on this, although perhaps from a different statistical perspective: my take on exploratory data analysis is that (a) it can much more powerful when used in conjunction with models, and (b) as we fit increasingly complicated models, it will become more and more helpful to use graphical tools (of the sort associated with “exploratory data analysis”) to check these models. As a latter-day Tukey might say, “with great power comes great responsibility.” See this paper and this paper for more on this.

I was also trying to understand the claim on page 14 on Donoho’s presentation that the fundamental roadblocks of data analysis are “only mathematical.” From my own experiences and struggles (for example, here), I’d interpret this from a Bayesian perspective as a statement that the fundamental challenge is coming up with reasonable classes of models for large problems and large datasets–models that are structured enough to capture important features of the data but not so constrained as to restrict the range of reasonable inferences. (For a non-Bayesian perspective, just replace the word “model” with “method” in the previous sentence.)

]]>I recently came across your “Statistical Modeling, Causal Inference, and Social Science” website in my attempt to determine the best analysis for my research. As there were some inquiries about whether GEE is a better approach than multilevel modeling, I was hoping you could help with my dilemma.

I am interested in neighborhood (defined as census tract) influences on childhood diabetes risk in the city of Chicago. Although I have a little over 1200 cases, ~40% of my tracts have only 1 case, and the average number of cases per tract is 5. GEE has been suggested as the better approach to HLM, but I am not getting much support for this option….any suggestions for the best approach or articles that might provide some insight?

My quick response: see here.

My longer response:

I think of GEE and multilevel (hierarchical) models as basically the same thing, with the main difference being that GEEs focus on estimating a nonvarying (or average) coefficient in the presence of clustering, whereas MLMs (HLMs) focus on estimating the aspects of the model that vary by group.

Looking further, there are differences in taste: GEEs appeal to people who don’t like distributional assumptions, whereas MLMs appeal to people who like generative models. I prefer MLMs because I like to set up an explicit model for the data; others prefer GEEs because they like to have a procedure that estimates parameters in the absence of assumptions for how the coefficients vary. To give myself the last word on this: I like MLMs because they are expandable to more complex models, and also because often I actually am intersested in the varying coefficients (particuarly varying slopes, as here).

**Technical issues**

To get to your technical question: there is no problem whatsoever in fitting MLMs when most of your groups have only 1 case, and the average number of groups per case is 5. No problem at all, and I have no idea why anyone would say otherwise. (See Section 12.9 of our new book for more on this, in particular the second full paragraph on page 276.)

You also might be interested in this article, which compares GEE and hierarchical logistic regression.

]]>The probability of getting brain cancer is determined by the number of younger siblings. So claim some scientists, according to an article published in the current issue of The Economist.

I have ordered your book so that I can read more about controlling for intermediate outcomes, but I am not yet confident enough to tackle it myself. Perhaps you might blog this?

I’ll give my thoughts, but first here’s the scientific paper (by Altieri et al. in the journal Neurology), and here are the key parts of the news article that Bruce forwarded:

Younger siblings increase the chance of brain cancer

IT IS well known that many sorts of cancer run in families; in other words you get them (or, at least, a genetic predisposition towards them) from your parents. . . . Dr Altieri was looking for evidence to support the idea that at least some brain cancers are triggered by viruses and that children in large families are therefore at greater risk, because they are more likely to be exposed to childhood viral infections. . . .

Dr Altieri describes what he discovered when he analysed the records of the Swedish Family Cancer Database. This includes everyone born in Sweden since 1931, together with their parents even if born before that date.

More than 13,600 Swedes have developed brain tumours in the intervening decades. In small families there was no relationship between an individual’s risk of brain cancer and the number of siblings he had. However, children in families with five or more offspring had twice the average chance of developing brain cancer over the course of their lives compared with those who had no brothers and sisters at all.

Digging deeper, Dr Altieri found a more startling result. When he looked at those people who had had their cancer as children or young teenagers he found the rate was even higher–and that it was particularly high for those with many younger siblings. Under-15s with three or more younger siblings were 3.7 times more likely than only children to develop a common type of brain cancer called a meningioma, and at significantly higher risk of every other form of the disease that the researchers considered. . . . the mechanisms by which younger siblings have more influence than elder ones are speculative. . . . An alternative theory is that a first child may experience a period when his immune system is particularly sensitive to certain infections at about the age when third and fourth children are typically born. . . .

OK, now my thoughts. There are two issues to address here: first, what exactly did Altieri et al. find in their data analysis, and, second, how can we think about causal inference for birth order and the number of siblings?

**What did Altieri et al. find?**

The main results in the paper appear to be in Table 2, where the brain cancer risk is slightly higher among people with more siblings. The overall risk ratios, normalized at 1 for only children, are 1.03, 1.06, 1.10, and 1.06 for people with 1, 2, 3, or 4+ siblings, respectively. The table gives a p value for the trend as 0.005, but I think they made a mistake, because, in R:

> x <- 0:4 > y <- c(1,1.03,1.06,1.10,1.06) > summary (lm (y~x))

Call:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.012000 0.019950 50.727 1.69e-05 ***

x 0.019000 0.008145 2.333 0.102Residual standard error: 0.02576 on 3 degrees of freedom

Multiple R-Squared: 0.6446, Adjusted R-squared: 0.5262

F-statistic: 5.442 on 1 and 3 DF, p-value: 0.1019

The p-value seems to be 0.10, not 0.005.

Stronger results appear in Tables T-1, T-2, and T-3 (referred to in the paper and included in the supplementary material at the article’s webpage). Risk ratios for brain cancer are quite a bit higher for kids with 3 or more younger siblings, and lower for kids with 3 or more older siblings.

In all these tables, results are broken down by type of cancer, but sample sizes are small enough that I don’t really put much trust into these subset analyses. A multilevel model would help, I suppose.

**Causal inference for birth order**

How to think about this causally? We can think about the number of younger siblings as a causal treatment: if littel Billy’s parents have more kids, how does this affect the probability that Billy gets brain cancer? But how do we think about older siblings? I’m a little stuck here: if I try to compare Billy as an only child to Billy as the youngest of three children, it’s hard to think of a corresponding causal “treatment.”

Thinking forward from treatments, suppose a couple has a kid and is considering to have another. One could imagine the effect of this on the first child’s probability of brain cancer. One can also consider the probability that the second child has brain cancer, but the comparison of the two kids would not be “causal” (in the Rubin sense). This is not to dismiss the comparison–I’m just laying out my struggle with thinking about these things. Similar issues arise in other noncausal comparisons (for example, comparing boys to girls).

**The reluctant debunker**

Finally, I’m sensitive to Andrew Oswald’s comment that I’m too critical of innovative research–even if there are methodological flaws in a paper, it’s conclusions could still be correct. My only defense is that I’m responding to Bruce’s request: I wasn’t going out looking for papers to debunk.

In any case, I’m not “debunking” this paper at all. I don’t see major statistical problems with their comparisons; I’m just struggling to understand it all.

]]>So, I was happy to notice Ilya Grigorik’s analysis of the distributions of the dataset. In particular, the average user seems to be centered at 3.8 (on a scale from 1-5), indicating that people do try to watch movies they like. But the uneven distribution of score variance across users indicates that one could model the type of user, perhaps with a mixture model:

I must also note that NetFlix users have an incentive to score movies even with lukewarm scores, which moderates the above distribution. On most internet sites that allow users to rank content, the extreme scores (1 or 5) are overrepresented: some people make the effort to write a review only when they are very unhappy and want to punish someone, or when they are very happy and want to reward or recommend the work to others.

Another interesting source of rating distributions is the Interactive Fiction Competition results page: it has numerous histograms of scores for individual IF works.

]]>The entries themselves were pretty funny, but I also liked the comment on the atomic energy kit entry by the guy with “a comfortable six-figure salary.” Maybe if he’d had a little less radiation exposure as a child, he’d have a comfortable seven-figure salary by now . . .

]]>The left column of maps above shows our estimate of the states supporting Bush and Gore based on the voters in each of 5 income categories. The most striking pattern is that, unsurprisingly, Gore does better if the vote is restricted to lower-income voters, and Bush does better at the high end.

The next pattern–which I think is really cool–is that the “red state, blue state” pattern of the coasts vs. the south and center of the country basically disappears for the poorest voters. At that extreme, it’s just not true that the rich states support Gore and the poor states support Bush. For rich voters, however, the pattern is clear: Gore wins in California and a few rich northeastern states, and Bush wins the rest. These graphs dramatize that the “red and blue state” patterns are most relevant for the richest voters.

The column of graphs on the right shows which states support Bush and Gore *more than the national average* within eachincome category. Again, the pattern for the richest states is similar to the national map (with the Pacific coast and northeast/upper-midwest being blue), but with the states showing a different pattern at the low incomes.

Just to be clear: I’m not claiming that these patterns are, or should be, some kind of surprise. Rather, I’m pointing out that many of the familiar distinctions between red and blue states are most relevant for the richest voters. It doesn’t have to be this way, and I don’t think it was that way before 1992, but this is what we’re seeing now, and it’s a way to understand the pattern that we discussed in our paper, that income is more predictive of Republican vote in poor states than in rich states.

]]>The Current Index of Statistics lists all statistics articles published since 1960 by author, title, and key words. The CIS includes articles from a multitude of journals in various fields—medical statistics, reliability, environmental, econometrics, and business management, as well as all of the statistics journals. Searching under anything that contained the word “data” in 1995–1996 produced almost 700 listings. Only eight of these mentioned Bayes or Bayesian, either in the title or key words. Of these eight, only three appeared to apply a Bayesian analysis to data sets, and in these, there were only two or three parameters to be estimated.Actually, our toxicology paper appeared in the Journal of the American Statistical Association in 1996--how could Breiman have missed that one (our model had 90 parameters, and the paper had a detailed discussion of why the prior distribution was needed in order to get reasonable results)? Was he restricting himself to papers with "data" in their keywords? Putting "data" as a keyword in an applied statistics paper is something like putting "physics" as a keyword in a physics paper!

The Current Index of Statistics lists all statistics articles published since 1960 by author, title, and key words. The CIS includes articles from a multitude of journals in various fields—medical statistics, reliability, environmental, econometrics, and business management, as well as all of the statistics journals. Searching under anything that contained the word “data” in 1995–1996 produced almost 700 listings. Only eight of these mentioned Bayes or Bayesian, either in the title or key words. Of these eight, only three appeared to apply a Bayesian analysis to data sets, and in these, there were only two or three parameters to be estimated.

Actually, our toxicology paper appeared in the Journal of the American Statistical Association in 1996—how could Breiman have missed that one (our model had 90 parameters, and the paper had a detailed discussion of why the prior distribution was needed in order to get reasonable results)? Was he restricting himself to papers with “data” in their keywords? Putting “data” as a keyword in an applied statistics paper is something like putting “physics” as a keyword in a physics paper!

**OK, OK . . .**

My point here isn’t to pick on Breiman, who isn’t around to defend himself (when we were both at Berkeley, I tried to talk with him about Bayesian methods, but we never found the time for the conversation, something I strongly regret in retrospect), but rather to reiterate a point I’ve made elsewhere, which is how our attitudes toward methods are so strongly shaped by our direct experiences. Continuing my quoting from the Breiman article:

I [Breiman] spent 13 years as a full-time consultant and continue to consult in many fields today—air-pollution prediction, analysis of highway traffic, the classification of radar returns, speech recognition, and stockmarket prediction, among others. Never once, either in my work with others or in anyone else’s published work in the fields in which I consulted, did I encounter the application of Bayesian methodology to real data.

. . .

All it would take to convince me [about Bayesian methods] are some major success stories in complex, high-dimensional problems where the Bayesian approach wins big compared to any frequentist approach. . . . A success story is a tough problem on which numbers of people have worked where a Bayesian approach has done demonstrably better than any other approach.

Now that these success stories are out there (and are reachable with almighty Google—which puts the Current Index of Statistics to shame—or by flipping through various textbooks), I suppose Breiman would have been convinced. What’s funny is that he couldn’t just say that he had made great contributions to statistics, and others had made important contributions to applied problems using Bayesian methods. He had to say that “when big, real, tough problems need to be solved, there are no Bayesians.”

**Pluralism**

I think that a more pluralistic attitude is more common in statistics today, partly through the example of people like Brad Efron who’ve had success with both Bayesian and non-Bayesian methods, and partly through the pragmatic attitudes of computer scientists, who neither believe the extreme Bayesians who told them that they must use subjective Bayesian probability (or else—gasp—have incoherent inferences) nor the anti-Bayesians who talked about “tough problems” without engaging with research outside their subfields.

My impression is that there’s a lot more openness now, and a willingness in evaluating methods to go beyond the two poles of pure subjectivism (like those Bayesians at the 1991 Valencia meeting who were opposed *in principle* to checking model fit) and barren significance testing (like those papers that used to appear in the statistical journals with tables and tables of simulations of coverage probabilities). It’s refreshing to see the errors of even the experts of a decade ago—perhaps this will give us courage to make our own rash statements which can in their turn be overtaken by reality.

While the conventional way for making inferences from observations goes through the use of conditional probabilities (via de Bayes identity), there is an alternative. It consists in introducing some new definitions in Probability Theory (image and reciprocal image of a probability, intersection of two probabilities), that are accompanied by a compatibility property. The resulting theory is simple, accepts a clear Bayesian interpretation, and naturally incorporates the Popperian notion of falsification (for us, falsification of models, not of theories). The applications of the theory in the domain of inverse problems shall be discussed.

Unfortunately I can’t make the talk. I can’t figure out what he’s saying in the abstract, but the topic interests me. If anybody knows more about this, please let me know.

P.S. Brian Borchers writes,

]]>Tarantola has been writing about Bayesian approaches to geophysical inverse problems for some time. He has recently (2005) published a book on inverse problem theory (Inverse Problem Theory and Methods for Model Parameter Estimation, SIAM 2005) that you might find interesting.

The “image of a probability” doesn’t appear in the SIAM book, but it is the topic of Tarantola’s new book, “Mapping of Probabilities”. You can download a draft (or at least the first two chapters) from his web site at http://www.ipgp.jussieu.fr/~tarantola/

My favorite part of this graph is the title–it really personalizes the data. See more here.

]]>(Although I can’t figure out why it’s classified under Topology.)

There’s lots of other cool stuff there, including this cascade of bifurcations:

]]>I [Jessee] show that most people do in fact have some level of policy ideology that has an important effect on their voting behavior. The influence of party identification, however, is also quite strong. Judging from the baseline of Downsian policy voting, I show that independents, even those with lower levels of political sophistication, perform quite well on average, and engage in essentially unbiased spatial policy voting. Partisans of similar levels of sophistication, by contrast, are systematically pushed away from more rational decision rules and seem to be making biased choices in translating their policy preferences into vote choices. On the whole, it seems clear that party identification operates more as a systematic bias than a profitable heuristic.Continuing, Jessee writes, Continue reading ]]>

I [Jessee] show that most people do in fact have some level of policy ideology that has an important effect on their voting behavior. The influence of party identification, however, is also quite strong. Judging from the baseline of Downsian policy voting, I show that independents, even those with lower levels of political sophistication, perform quite well on average, and engage in essentially unbiased spatial policy voting. Partisans of similar levels of sophistication, by contrast, are systematically pushed away from more rational decision rules and seem to be making biased choices in translating their policy preferences into vote choices. On the whole, it seems clear that party identification operates more as a systematic bias than a profitable heuristic.

Continuing, Jessee writes,

One school sees party identification as informing and shaping people’s views of the political elements they encounter—a lens through which citizens see the political world. The other side, by contrast, sees party identification as a product of citizens’ experiences with the two parties. . . . Bartels (2002) refutes the claim that people incorporate new information into their political beliefs in a rational, mostly unbiased manner. He argues that Republicans and Democrats differ in the way in which they update their beliefs based on new information. In addition to this, Bartels demonstrates that this phenomenon cannot be solely the result of differing preferences, since respondents’ statements about objective facts, such as whether unemployment increased or decreased during the Reagan administration, are also heavily colored by party identification.

OK, now my thoughts:

1. The basic message reminds me of Joe Bafumi’s paper on “the stubborn American voter.” Joe’s paper also has the time dimension–stubbornness has increased since the 1970s–and I’d like to see that in the Jessee model also.

2. The particular point–that Dem identifiers vote more for the Dem, and Rep identifiers vote more for the Rep, even after accounting for ideological distances from the candidates–is something that Jeff Cai and I noticed here (see the intercepts in the estimated models on page 10 of our paper). However, we didn’t really take much notice of it this pattern, since we’d expect to see this pattern. Jessee goes a bit further than we do in considering a more elaborate model as well as using vote choices for senators as well as president.

3. The idea of putting together an individual-voting model and an ideal-point model looks cool. I haven’t looked at the details carefully, but I’m surprised there’s no distance model (for example, (ideal point of voter – ideal point of candidate 1)^2 – (ideal point of voter – ideal point of candidate 2)^2).

4. I don’t really like the description on pages 27-28 of Bayesian inference in terms of prior and posterior beliefs. I know that a lot of people think of it this way, but I’d rather think of Bayesian inference as a way of fitting a model to data and working out its implications (see our Bayesian book, especially chapters 1 and 2 (for more on where prior distributions come from) and chapter 6 (for more on the use of Bayesian methods to check model fit).

5. I like all the graphs! Regarding the tables: Table 1 should lose the horizontal lines, also I’d order the bills from the most liberal to the most conservative ideal point (as indicated by the roll call votes). Table 2 should be a graph, of course, or else just a simple display of fitted equations. All those significant figures aren’t needed.

6. Is Josh Clinton really a Jesuit priest or is that S.J. just a peculiarity in how the referencew were done?

**The big issue**

Returning to point 2 above, the big thing that Jessee is emphasizing is that party ID is predictive of vote, even after conditioning on issue positioning. This is something that’s so familiar that I’ve never before thought of it as something worth commenting on. Democrats (mostly) vote for Democrats, Republicans mostly vote for Republicans–that’s what it’s all about. Jessee is putting a twist on this by calling party voting biased, with his key assumption being that if voting were unbiased, then people would vote only on the issues, and party ID would add no predictive power.

I’m not quite sure what to think about this. It’s an audacious claim–taking a familiar observation about party ID and turning it into a stylized fact that needs explaining. I have a couple of problems with the reasoning, though. First, I could imagine that it would be rational to vote for my party’s presidential candidate, even if I were otherwise indifferent on the issues, for other reasons, including nominations to the Supreme Court and other appointive positions, the desire to put a check on the other party’s power in Congress, and anticipated future policy questions. Thinking more generally, if I take representative government seriously, I might vote for the representative I like/trust (from the party I trust) rather than voting on specific issues or ideological positions, which might not be relevant for the future.

Beyond this, I’m not so comfortable with calling these differences “biases” or even “heuristics.” Party ID just seems so fundamental, in so many ways. But maybe I’m just thinking in an old-fashioned way; I’m willing to be convinced by Jessee, Rivers, and others on this.

]]>in discussing this paper.

]]>The exciting thing (for me) is that we’re hiring Fellows in statistics. The position would involve interdisciplinary science teaching to Columbia undergrads, and research with me (and my collaborators) in applied statistics. This is a great postdoc opportunity, especially for people who want to move into an academic career.

]]>Americans today “know” that a majority of the population supports the death penalty, that half of all marriages end in divorce, and that four out of five prefer a particular brand of toothpaste. Through statistics like these, we feel that we understand our fellow citizens. But remarkably, such data–now woven into our social fabric–became common currency only in the last century. Sarah Igo tells the story, for the first time, of how opinion polls, man-in-the-street interviews, sex surveys, community studies, and consumer research transformed the United States public.. . . Tracing how ordinary people argued about and adapted to a public awash in aggregate data, she reveals how survey techniques and findings became the vocabulary of mass society–and essential to understanding who we, as modern Americans, think we are.

As a survey researcher, this looks interesting to me. Parochially, I’m reminded of our own observation that in the 1950s it was more rational to answer a Gallup poll than to vote. Nowadays, most of us are participants as well as consumers of surveys.

]]>]]>A coauthor and I recently encountered a bit of uncertainty regarding an underlying assumption of the negative binomial regression (NBREG) and were wondering if anyone had any advice on how to proceed. Our question centers on whether the NBREG model is capable of handling interdependence between counts, and, if so, what kind of interdependence is it designed to capture?

In several texts authors suggest using an NBREG model instead of a Poisson model when overdispersion is present. In examples overdispersion is often attributed to one of two causal mechanisms. The first, which we call an “omitted variable effect”, occurs when there is some unobserved variable present in the data that makes some units/subjects have higher counts than others. A common example is the number of published papers an assistant professor produces in a year. We cannot assume the rate of publication is constant because professors will vary in their productivity for a number of reasons that are specific to each individual. A similar example has to do with how well sports teams perform across a season. Some teams will score at a higher rate than others because of a variable we cannot observe. In these examples, there is an interdependence within individual professors and within individual teams.

The second causal mechanism could be called “success breeds success”. In this case, the individual counts are not independent of one another because success in one period might encourage the subject to make another attempt. For example, a successful sales pitch on Wednesday for a door-to-door salesman may encourage him to try again on Thursday. Another example might be the number of violent episodes mentally ill patients undergo in a given year. One hypothesis might be that a violent episode in time t leads to an increased probability of a violent episode in time t+1 (a cathartic effect is also possible, where a violent episode in time t reduces the probability that the patient will undergo a violent episode in time t+1). Under this causal mechanism the contagion effect or interdependence is across time.

After searching the literature, we are left with two questions.

1. Are NBREG models meant to handle interdependence? (While there seems to be a consensus of “yes” on this answer, several publications suggest the exact opposite. One paper, in fact, went to great lengths to demonstrate why and how current NBREG models need to be modified to be capable of handling non-independence).

2. If NBREG models can handle non-independence, which kind of non-independence are they meant to handle? Independence within subjects, where there is some omitted variable that would account for why some subjects have higher counts than others or independence across time where a success in time t leads to a second attempt in time t+1?

Constructing Efficient MCMC Methods Using Temporary Mapping and Caching

I [Radford] describe two general methods for obtaining efficient Markov chain Monte Carlo methods – temporarily mapping to a new space, which may be larger or smaller than the original, and caching the results of previous computations for re-use. These methods can be combined to improve efficiency for problems where probabilities can be quickly recomputed when only a subset of `fast’ variables have changed. In combination, these methods also allow one to effectively adapt tuning parameters, such as the stepsize of random-walk Metropolis updates, without actually changing the Markov chain transitions used, thereby avoiding the issue that changing the transitions could undermine convergence to the desired distribution. Temporary mapping and caching can be applied in many other ways as well, offering a wide scope for development of useful new MCMC methods.

This reminds me of a general question I have about simulation algorithms (and also about optimization algorithms): should we try to have a toolkit of different algorithms and methods and put them together in different ways for different problems, or does it make sense to think about a single super-algorithm that does it all?

Usual hedgehog/fox reasoning would lead me to prefer a toolkit to a superalgorithm, but the story is not so simple. For one thing, insight can be gained by working within a larger framework. For example, we used to think of importance sampling, Gibbs, and Metropolis as three different algorithms, but they can all be viewed as special cases of Metropolis (see my 1992 paper, although I guess the physicists had been long aware of this). Anyway, Radford keeps spewing out new algorithms (I mean “spew” in the best sense, of course), and I wonder where he thinks this is all heading.

P.S. The talk was great; slides are here.

]]>and here’s the abstract:

We find that income matters more in “red America” than in “blue America.” In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference. In addition to finding this pattern and studying its changes over time, we use the concepts of “typicality” and “availability” from cognitive psychology to explain how these patterns can be commonly misunderstood. Our results can be viewed either as a debunking of the journalistic image of rich “latte” Democrats and poor “Nascar” Republicans, or as support for the journalistic images of political and cultural differences between red and blue states — differences which are not explained by differences in individuals’ incomes. We have also found similar patterns in election polls from Mexico.

Key methods used in this research are: (1) plots of repeated cross-sectional analyses, (2) varying-intercept, varying-slope multilevel models, and (3) a graph that simultaneously shows within-group and between-group patterns in a multilevel model. These statistical tools help us understand patterns of variation within and between states in a way that would not be possible from classical regressions or by looking at tables of coefficient estimates.

Maybe someone will ask a question about rent-seeking. (See the comments here.)

And, for those of you who have bothered to read this far, here’s a brand-new graph just for you:

MS, OH, and CT represent poor, middle-income, and rich states, respectively, and the red, blue, and gray lines on each plot represent frequent church attenders, occasional church attenders, and nonattenders. We’re still trying to make sense of it all.

]]>The New Yorker

Humor on the slopes at Beaver Creek

FeaturingDennis MillerNeed a lift this winter? Join the laughter in Beaver Creek, Colorado. On the last weekend in February, The New Yorker Promotion Department’s Humor on the Slopes event fills the famed destination with three days of highly elevated comedy. Resort to laughter with performances by comedians including Dennis Miller, appearances by New Yorker cartoonists, a comedy-film sneak preview, and much more.

Yeah, yeah, I know, that’s how they can afford to pay Ian Frazier and the rest of the gang. But still . . .

On the other hand, Harold Ross was from Aspen so maybe it all makes sense.

P.S. Yes, “Dennis Miller” is in boldface in the original.

P.P.S. Typo fixed.

]]>I am creating a logistic model on 230 cases (4 categorical explanatory variables; about 25% of the cases are 1s, and 75% are 0s in the dependent variable). And I get accuracy of 65%. As a further validation, I boostraped the 230 cases sample 1000 times (with replacement), and ran the obtained model through those 1000 samples, getting accuracies in the range of 57% to 68%. Is that a approvable validation method? Or is bootstrapping “without” replacement and less cases better? Or is this kind of validation in general wrong? (Problem is that I have no test sample).

I’m a little confused here. How can you get an accuracy of 65% when simply predicting 0 all the time gives an accuracy of 75%! This doesn’t sound like such a great model…

]]>I was wondering: who are the consumers of the long-tail items? I’d conjecture that the people who buy books in the long tail are, on average, buyers of many books. Similarly, I’d conjecture that the rarefied few who read our blog read many other blogs as well. In contrast, the average buyer of a bestseller such as The Shangri-La Diet might not be buying so many books, and, similarly, the average reader of BoingBoing might be reading not so many blots.

Or maybe I’m wrong on this, I don’t know. I’m picturing a scatterplot, with one dot per book (or blog), on the x-axis showing the number of buyers (or readers), on the y-axis showing the average number of books bought (or blogs read) per week by people who bought thiat book (or read that blog). Or maybe there’s a better way of looking at this.

The question is: is the “long tail” being driven by a “fat head” of mega-consumers?

]]>I guess that one can upload the data, access data that others have posted, and perform some simple types of analysis. It might not sound much, but having a database of data will remove the need for people to provide summaries of it. Anyone interested in the problem can perform the summaries for himself. This will make data analysis much more approachable than before. This can also become competition to existing spreadsheet and statistical software, and a platform for deploying recent research: it is often frustrating for a researcher in statistical methodology how difficult it is to actually enable users to benefit from the most recent advances in the research sphere.

]]>One of the most frequently asked questions in statistical practice, and indeed in general quantitative investigations, is “What is the size of the data?” A common wisdom underlying this question is that the larger the size, the more trustworthy are the results. Although this common wisdom serves well in many practical situations, sometimes it can be devastatingly deceptive. This talk will report two of such situations: a historical epidemic study (McKendrick, 1926) and the most recent debate over the validity of multiple imputation inference for handling incomplete data (Meng and Romero, 2003). McKendrick’s mysterious and ingenious analysis of an epidemic of cholera in an Indian village provides an excellent example of how an apparently large sample study (e.g., n=223), under a naive but common approach, turned out to be a much smaller one (e.g., n<40) because of hidden data contamination. The debate on multiple imputations reveals the importance of the self-efficiency assumption (Meng, 1994) in the context of incomplete-data analysis. This assumption excludes estimation procedures that can produce more efficient results with less data than with more data. Such procedures may sound paradoxical, but they indeed exist even in common practice. For example, the least-squared regression estimator may not be self-efficient when the variances of the observations are not constant. The morale of this talk is that in order for the common wisdom "the larger the better" be trusted, we not only need to assume that data analyst knows what s/he is doing (i.e., an approximately correct analysis), but more importantly that s/he is performing an efficient, or at least self-efficient, analysis.

This reminds me of the blessing of dimensionality, in particular Scott de Marchi’s comments and my reply here. I’m also reminded of the time at Berkeley when I was teaching statistical consulting, and someone came in with an example with 21 cases and 16 predictors. The students in the class all thought this was a big joke, but I pointed out that if they had only 1 predictor, it wouldn’t seem so bad. And having more information should be better. But, as Xiao-Li points out (and I’m interested to hear more in his talk), it depends what model you’re using.

I’m also reminded of some discussions about model choice. When considering the simpler or the more complicated model, I’m with Radford that the complicated model is better. But sometimes, in reality, the simple model actually fits better. Then the problem, I think, is with the prior distribution (or, equivalently, with estimation methods such as least squares that correspond to unrealistic and unbelievable prior distributions that do insufficiant shrinkage).

]]>Given a first and last name, it estimates the number of people in the US with the same name. They take the data from the 1990 Census and make an assumption that the first and last name are uncorrelated. There is a brief section on accuracy here. It might be a bit silly, but at least provides an easy way to look up of Census name frequencies (assuming their scripts work correctly). >From a research perspective, if such a website proves popular, perhaps one could use the same basic idea and produce better estimates by including first x last name correlation, and maybe add the functionality to collect user data like basic demographics, etc. to use with “how many x’s you know” surveys.

Wow, the “>From” in his email really takes me back . . .

Anyway, for first names, I prefer the Baby Name Voyager, which has time series data and cool pink-and-blue graphics, but it is convenient to have the last names too. By assuming independence, I think this will overestimate the people named “John Smith” and underestimate the people named “Kevin O’Donnell,” (I once looked up John Smith in the white pages and found that, indeed, it’s less common than you’d expect from independence. Which makes sense, since if you’re named Smith, you’ll probably avoid the obvious “John.” Unless it’s a family name, or unless you have a sense of humor, I suppose.)

But Matt comments:

Also, I think this might be a good tool for teaching undergrads. In my class we just covered the basic rules of probability and I tried to get across the idea of independence of events. A name like Jose Cruz provides a good examples of things that are not independent.

I’m down with that. And it could be a cool class project to do some checking of phone directories. The violation of independence is reminiscent of the dentists named Dennis.

]]>I have a question regarding to difference between ANOVA and Multilevel Models. When do you shift from ANOVA to a Multilevel Model? For example, I have a data structure as follow (data set incomplete):

Number of species Habitat type

12 A

24 A

12 A

32 A

22 A

21 A

21 A

12 B

32 B

32 B

23 B

21 B

22 B

12 B

32 B

12 C

34 C

43 C

34 C

23 C

22 CThe data for each habitat type was taken from different fragments or patches with different area (in hectares). It means that (for the 2 first rows), 12 species were detected in a 5 ha patch of habitat A, and 24 species were detected in a 30 ha patch of habitat A.

Habitat A has 7 replicates, habitat B has 8 replicates, and habitat C has 6 replicates.

I want to know if the number of species differs by habitats.

Is the data structure appropriate for a Multilevel Model or do I need to do the analysis with a GLM using a log link and a Poisson distribution? Do I need to consider the size of the patch (may be as an offset variable)?Also, is there a difference between random effect (like in ANOVA) and random term (like in Multilevel Model)?

First off, Anova and multilevel modeling are closely connected: both are ways of using a linear model to structure data, partitioning effects into batches. That is, each row of an Anova table corresponds to a batch of linear predictors which, in the corresponding multilevel model, would be modeled exchangeably. See here and here.

Second, what’s up with the number of species? It’s bizarre that these are all formed from 1’s, 2’s, 3’s, and 4’s. Why are there never, for example, 18 species in a patch?

Third, a natural model to fit would be an overdispersed Poisson regression with log link, using log (patch area) as an offset, and using a multilevel model with habitat type as the grouping. But, with only 3 groupings, you’ll actually get similar results by just setting group A as the baseline and including indicators for B and C in your overdispersed Poisson regression model.

P.S. I’ll answer any question from Costa Rica because we spent our honeymoon there and had delicious platanos and arroz con pollo (most memorably in a lunch place where we noticed a tarantula crawling along on the floor).

]]>But . . . although I think there’s truth to all of the above explanations, I think some insight can be gained by looking at this another way. Lots of research shows that people are likely to take the default option (see here and here for some thoughts on the topic). The clearest examples are pension plans and organ donations, both of which show lots of variation and also show people’s decisions strongly tracking the default options.

For example, consider organ donation: over 99% of Austrians and only 12% of Gernans consent to donate their organs after death. Are Austrians so much nicer than Germans? Maybe so, but a clue is that Austria has a “presumed consent” rule (the default is to donate) and Germany has an “explicit consent” rule (the default is to not donate). Johnson and Goldstein find huge effects of the default in organ donations, and others have found such default effects elsewhere.

**Implicit defaults?**

My hypothesis, then, is that the groups that give more to charity, and that give more blood, have defaults that more strongly favor this giving. Such defaults are generally implicit (excepting situations such as religions that require tithing), but to the extent that the U.S. has different “subcultures,” they could be real. We actually might be able to learn more about this with our new GSS questions, where we ask people how many Democrats and Republicans they know (in addition to asking their own political preferences).

Does this explanation add anything, or am I just pushing things back from “why to people vary in how much they give” to “why is there variation in defaults”? I think something is gained, actually, partly because, to the extent the default story is true, one could perhaps increase giving by working on the defaults, rather than trying directly to make people nicer. Just as, for organ donation, it would probably be more effective to change the default rather than to try to convince people individually, based on current defaults.

]]>On average, religious people are far more generous than secularists with their time and money. This is not just because of giving to churches—religious people are more generous than secularists towards explicitly non-religious charities as well. They are also more generous in informal ways, such as giving money to family members, and behaving honestly.

The nonworking poor—those on public assistance instead of earning low wages—give at lower levels than any other group. Meanwhile, the working poor in America give a larger percentage of their incomes to charity than any other income group, including the middle class and rich.

A religious person is 57% more likely than a secularist to help a homeless person.

Conservative households in America donate 30% more money to charity each year than liberal households.

If liberals gave blood like conservatives do, the blood supply in the U.S. would jump by about 45%.

I have a few quick thoughts:

1. These findings are interesting partly because they don’t fit into any simple story: conservatives are more generous, and upper-income people are more conservative [typo fixed; thanks, Dan], but upper-income people give less than lower-income people. Such a pattern is certainly possible–in statistical terms, corr(X,Y)>0, corr(Y,Z)>0, but corr(X,Z)<0)--but it's interesting. 2. Since conservatives are (on average) richer than liberals, I'd like to see the comparison of conservative and liberal donations made as a proportion of income rather than in total dollars. 3. I wonder how the blood donation thing was calculated. Liberals are only 25% of the population, so it's hard to imagine that increasing their blood donations could increase the total blood supply by 45%. 4. The religious angle is interesting too. I'd like to look at how that interacts with religion and ideology. 5. It would also be interesting to see giving as a function of total assets. Income can fluctuate, and you might expect (or hope) that people with more assets would give more. We're looking forward to getting into these data and making some plots. (Boris suggested the secret weapon.)

P.S. Bruce McCullough points out Jim Lindgren’s comments here on the study, questioning Brooks’s reliance on some of his survey data.

P.P.S. Also see here for more of my thoughts.

]]>