Comments on: You’ve got data on 35 countries, but it’s really just N=3 groups.

By: hay

hay — Wed, 03 Oct 2018 20:07:04 +0000

“But I don’t know of any way of solving the problem of what might be called spatial or multi-dimensional similarity.”

This is a curious statement, because the problem described of N=35 countries vs. N=3 countries in the example is exactly one of spatial autocorrelation. There are multiple ways to deal with that, and I guess it is theoretically possible that you do not have enough variation between groups to get a precise estimate of whatever quantity you are interested in. Nevertheless, as McElreath noted, the spatial distances can be defined on any variety of differences/similarities, including multidimensional similarity.

I am not sure the analogous time series example is exactly analogous, as what is described there is something of a temporal aggregation problem. There is of course a similar modifiable areal unit problem, but this should probably be thought of a distinct problem.

By: Richard

Richard — Mon, 01 Oct 2018 02:30:40 +0000

Perhaps not exactly what Jon is thinking about, but Simpson’s Paradox (https://en.wikipedia.org/wiki/Simpson%27s_paradox) is a good demonstration of the dangers of breaking groups apart for analysis.

By: nostalgebraist

nostalgebraist — Thu, 27 Sep 2018 23:14:04 +0000

In reply to Bruce Bradbury.

This sounds right to me. This isn’t a problem of drawing overly confident conclusions about the association in the sample, due to having a “misleadingly” high N; if all you care about is the association present in the sample, that confidence is warranted. But of course, you don’t just care about the sample — you’re studying it to understand the a larger population. The real objection here is just that the sample is not representative of the populations (“all existing countries” and/or “all possible countries”) we really want to learn about from a study like this.

Likewise, with those non-causal associations between US states in different regions — the associations are real, if perhaps uninteresting, so it isn’t a problem that we can conclude them with more confidence by considering more states per region. If they’re dominated by inter-region trends, then it would be inadvisable to apply them within a region (there could be a Simpson’s paradox situation). But that’s just a matter of knowing what your numbers mean, and what they don’t mean.

By: Willem

Willem — Thu, 27 Sep 2018 10:48:54 +0000

In reply to Richard McElreath.

It’s not spam if it’s relevant…

His book: http://xcelab.net/rm/statistical-rethinking/

The code: http://xcelab.net/rmpubs/rethinking/code.txt

By: Richard McElreath

Richard McElreath — Thu, 27 Sep 2018 07:08:36 +0000

In reply to gwern. My reaction as well. Anthro and evol bio both obsessed with such problems. For handling distance, most common technique I see is some covariance matrix defined by scaled distances (geographic, linguistic, phylogenetic). The Oceanic islands GP model in Chapter 13 of my textbook is a toy example, where distances are geographic.

By: Martha (Smith)

Martha (Smith) — Wed, 26 Sep 2018 18:27:05 +0000

In reply to Andrew. +1

By: Martha (Smith)

Martha (Smith) — Wed, 26 Sep 2018 18:23:38 +0000

In reply to Miguel Madeira. This was my impression as well.

By: Daniel Lakeland

Daniel Lakeland — Wed, 26 Sep 2018 13:28:42 +0000

In reply to yyw. I agree that when it comes to doing analyses it can be useful to build up to more complexity. But I think it's always a good idea to flesh out the theory you think best describes what's going on before proceeding to analyses. It's important for perspective and to help you identify how various sources of data might help, often there really is more than one or even more than a few relevant sources of data. Without some kind of broad-perspective model you won't be capable of making good use of the available data, and or identifying the best data collection effort.

By: Miguel Madeira

Miguel Madeira — Wed, 26 Sep 2018 13:00:51 +0000

In reply to Jonathan. My impression is that he was not using the word "Confederacy" to refer to the specific historic event of he secession and the creation of the CSA, but as a shorthand for describing a region of the USA (in the same way he used "New England + West Coast" or "flyover country")

By: Bruce Bradbury

Bruce Bradbury — Wed, 26 Sep 2018 11:32:32 +0000

(For some reason I can’t see the other 4 comments)

I think it is more informative to think of these sorts of problems as problems of unobserved confounder variables (here ‘culture’) rather than as small N. This encourages us to think about exactly what cultural variables might be important, how they might work and whether we can measure them. Thinking about these theoretical issues is likely to be more useful than focusing on questions of statistical inference.

I suspect Baron might reply to this statement by saying that thinking about unobserved confounders is about causality – and he is just interested in associations. My response is that if we are interested in whether the observed association will extend to some sort of broader population then these counfounders will influence this inference if they are different in the broader population than in the observed sample. For example, when we include East Asia in the sample.

By: yyw

yyw — Wed, 26 Sep 2018 03:42:03 +0000

In reply to Daniel Lakeland. Probably start small when facing such a highly complex question like this. Instead of an abstract inequality index of some sort, take a few policies implemented in recent decades and do within-country longitudinal analysis (not exactly a small problem). Then maybe go from there and expand to similar countries with somewhat different policies.

By: yyw

yyw — Wed, 26 Sep 2018 03:32:35 +0000

If we ignore all the modeling inadequacies and causation/correlation issues, this article can easily be renamed to “Societal inequalities narrow gender gaps in reading”. Statement like “inequalities are detrimental to the performance of girls relative to boys in the three topics math, reading, and science” can easily be rewritten to something like “inequalities are beneficial to the performance of boys relative to girls in the three topics math, reading, and science”. While it seems ridiculous, the original statement was pretty terrible too considering how much girls outperform boys in reading.

In any case, a zero sum outcome measure like the ratio of female/male high performers is not nearly as relevant an outcome measure as say the percentage of girls/boys that are high performers.

By: Guive

Guive — Tue, 25 Sep 2018 21:54:32 +0000

Is there a name for this problem?

By: Nat

Nat — Tue, 25 Sep 2018 21:37:21 +0000

We can also think about this problem from a design of experiment perspective. If we could control the design of the experiment and collection of data then we would like to have our points spread out across the design space. For example, nations with a wide range of “societal inequalities” or “genetic diversity”. However, it sounds like Baron is describing a situation where we are restricted to using existing data or cannot control the design of the experiment. The data points are clustered together into groups rather than spread out across the design space. We do not have much information about how the dependent variable varies for values of the independent variables between the clusters of data points. Under strong assumptions such as assuming things are linear across the design space (linear regression) I think you might still fit a model with low uncertainty. However, under weaker assumptions I think you might fit a model with high uncertainty (e.g., Gaussian process regression).

There is also the issue of spatial and temporal scales. When we are binning data spatially or temporally we are free to use very small or large bins. The size of the bins seems related to the amount of uncertainty we have in each observation. For example, if we estimate median household income at the state level then we should have much less uncertainty than estimating the value at the county or zip code level. Similarly, household spending each year is less variable than monthly or daily spending. We can consider a trade-off between lots of data points each with high uncertainty or a few data points each with low uncertainty. I think if we can incorporate this uncertainty in each observation into our modeling process then it might in some sense level the playing field.

By: Andrew

Andrew — Tue, 25 Sep 2018 19:50:11 +0000

In reply to gwern. Gwern: Yup. A multilevel regression model is still a regression model, and, like any regression model, it can be improved if there is available external information that has not been included in the predictors yet.

By: Andrew

Andrew — Tue, 25 Sep 2018 19:49:01 +0000

In reply to Jonathan. Jonathan: This Michael Porter??

By: Kyle C

Kyle C — Tue, 25 Sep 2018 18:40:33 +0000

In reply to Daniel Lakeland. Nice.

By: gwern

gwern — Tue, 25 Sep 2018 16:53:23 +0000

https://en.wikipedia.org/wiki/Galton%27s_problem strikes again.

Using a MLM strikes me as not being as efficient as possible: we know these similarities decay with distance, whether temporal, physical, or genetic, and we can even construct the phylogenetic trees. Just lumping the countries together in a cluster ignores these differences which can be large or small, and provide purchase for regression on the residuals from what would be expected from their autocorrelation. (Iceland should not be considered as similar to Sweden as, say, Finland.) The autocorrelation should be modeled as directly as possible.

By: Nikolai Vetr

Nikolai Vetr — Tue, 25 Sep 2018 16:15:11 +0000

This is pretty directly analogous to the case of phylogenetic pseudoreplication (Felsenstein 1985), where independence is incorrectly assumed between observations on separate tips of a tree.

I don’t think there are any good statistical models for how nations form (?), but another way to maybe accommodate spatial autocorrelation could be through gaussian process regression, with covariance matrix informed by pairwise geographic distances (through e.g. waypoints). Statistical Rethinking 13.4.1 has a nice walk through of how to fit this sort of model in Stan.

A Chinese Restaurant model could also be used to average over possible partitions of the countries, if one is uncertain about clustering and unwilling to assume that relatedness between countries is influenced much by their geographic closeness.

By: Daniel Lakeland

Daniel Lakeland — Tue, 25 Sep 2018 15:31:40 +0000

For all the discussions we’ve had about foundational Bayes vs Frequentist ideas etc, the real thing that gets me fired up about using Bayesian analysis is the fact that I can carry out an analysis regardless of what my model is like (computation issues aside). So in an analysis of social inequality and gender gaps in math or the like, I’d strongly recommend to stop thinking about the “statistics” (ie. traditional questions of power and choice of test and assumptions of normality and etc etc) and start thinking about the physical / mechanistic model of the phenomenon. Start with a textual description of what you think is going on causally:

“Within each country, societal issues related to gender roles result in varying degrees of economic distinctions between genders, this leads to differences in expectation of life-trajectories, and that leads to teachers having different expectations for child achievement. The result is that teachers teach children different material or spend different amounts of time explaining different things to boys vs girls. In addition there may be some inherent differences in the mean or variance of inherent talent among the populations of boys and girls, and a difference in what each child perceives as valuable and worth spending time on or what is interesting. These feedback effects develop through time as children who achieve at something tend to do more of that thing and less of another thing. The net result is through time as children age there is a widening gender gap in achievement among various topics”

Or whatever, that’s just some stylized idea of what people might think. But suppose it is what you think…Now start encoding that into a mathematical model.

mathematically we have several different academic/school related topics, perhaps math, language, sports, music, etc. We have in each country some attitude about whether each topic is “more male” or “more female”, we have associated effort and encouragement by society for each gender, we have children’s perceived gender role and level of interest, we have etc etc etc. These are the parameters we use to describe the process.

Next we need the process description:

rate of improvement in skills related to topic X for each child is functionally related to the inputs that go into skill development, including encouragement, individual child interest, time spent by the child, availability of instruction in the topic… And country or societal level parameters determine some of the encouragement, and some of the availability…. and society level parameters are similar within groups of countries…. and then across the world different groups of countries have some similarities as well…

In the end you’ll have a large model for worldwide educational variation across multiple topics in which there are thousands of parameters you are uncertain about. This is *the reality* of the problem. Now, because of these thousands of parameters, you’ll want to look for sources of data which can inform the quantities of interest: surveys of children’s interest, datasets on teacher populations: age, gender, subject they teach.. Data on spending in each country, time-series data tracking individual children’s achievement, time series data across different eras… whatever, each source of data is something you can potentially use to constrain the parameters within a given country, and thereby also constrain parameters within neighboring/similar countries, and thereby constrain parameters within continents… etc etc. But data won’t be uniformly available in all locations for all topics. So you’re going to have to work to provide reasonably well thought out priors for your parameters.

Next you’ll say: gee that’s all well and good, but I don’t have any of that information right now, and I do have this one great dataset with 18 data points in it across 3 countries, how can I make progress so that I can get grants and tenure? That’s a huge hard problem you’ve just described, I’d much rather just grab some dataset and calculate some p values…

And now we know why so little progress is made, because we’ve *institutionalized* non-science as if it were in fact the pinnacle of scientific achievement: knowledge from pretending everything comes out of random number generators and reified the idea that without really thinking about how things work very much we can just grab some small datasets and pretend that the data comes out of random number generators, and check to see if we can mathematically detect differences between RNG A and RNG B.

By: Jonathan

Jonathan — Tue, 25 Sep 2018 15:08:42 +0000

Just to note it’s an absurdity to make such a statement about Evangelical Protestantism and the Confederacy or Jim Crow, etc. It correlates but in the same way that Catholic correlates with MA: there are a lot of Irish in MA. Denominational differences between parts of the country exist. Some of these follow historical migration patterns and others reflect the growth of denominations within a region. I feel ridiculous having to mention this.

In terms of model division, I saw an op-ed by Michael Porter yesterday that cited a ‘rigorous’ social index. Is such a thing possible? I’d say no but when he cherry picks murder rate, then I know that’s a bullshit op-ed. People treat these issues as though they’re epidemiology, as though ‘determining’ penetration rates in specific populations of virulent diseases with relatively known infection rates – depending on exposure modeling, etc. – is the same thing as taking something buried way below the surface, like some measure of genetic diversity, and applying that to a population. You can see a rough connection: some groups are perhaps more prone to certain infections given certain other factors, like the way it appears HIV spreads in Africa depending on rates of already existing infections (meaning some of the research shows a form of opportunism). But in general? Treading on Wansink territory.