Our longtime collaborator Matt Salganik sent me a copy of his new textbook, “Bit by Bit: Social Research in the Digital Age.”

I really like the division into Observing Behavior, Asking Questions, Running Experiments, and Mass Collaboration (I’d remove the word “Creating” from the title of that section). It seemed awkward for Ethics to be in its own section rather than being sprinkled throughout the book, but it in any case it’s a huge plus to have any discussion of ethics at all. I’ve written a lot about ethics but very little of this has made its way into my textbooks so I appreciate that Matt did this.

Also I suggested three places where the book could be improved:

1. On page xiv, Matt writes, “I’m not going to be critical for the sake of being critical.” This seems like a straw man. Just about nobody is “critical for the sake of being critical.” For example, if I criticize junk science such as power pose, I do so because I’m concerned about waste of resources, about bad incentives (positive press and top jobs for junk science motivates students to aim for that sort of thing themselves), I’m concerned because the underlying topic is important and it’s being trivialized, I’m concerned because I’m interested in learning about human interactions, and pointing out mistakes is one way we learn, and criticism is also helpful in revealing underlying principles of research methods: when we learn how things can seem so right and go so wrong, that can help us move forward. Matt writes that he’s “going to be critical so that [he] can help you create better research.” But that’s the motivation of just about *every* critic. I have no problem with whatever balance Matt happens to choose between positive and negative examples; I just think he may be misunderstanding the reasons why people criticize mistakes in social research.

2. On pages 136 and 139, Matt refers to non-probability sampling. Actually, just about every real survey is a non-probability sample. For a probability sample, it is necessary that everyone in the population has a nonzero probability of being in the sample, and that these probabilities are known. Real polls have response rates under 10%, and there’s no way of knowing or even really defining what is the response probability for each person in the sample. Sometimes people say “probability sample” when they mean “random digit dialing (RDD) sample”, but an RDD sample is not actually a probability sample because of nonresponse.

3. In the ethics section, I’d like a discussion the idea that it can be an ethics violation to do low-quality research; see for example here, here, and here. In particular, high-quality measurement (which Matt discusses elsewhere in his book) is crucial. A researcher can be a wonderful, well-intentioned person, follow all ethical rules, IRB and otherwise—but if he or she takes crappy measurements, then the results will be crap too. Couple that with standard statistical practices (p-values etc.) and the result is junk science. Which in my view is unethical. To do a study and *not* consider data quality, on the vague hope that something interesting will come out and you can publish it, is unethical in that it is an avoidable pollution of scientific discourse.

Anyway, I think it will make an excellent textbook. I mentioned 3 little things that I think could be improved, but I could list 300 things in it that I love. It’s a great contribution.

“…just about every real survey is a non-probability sample.” Yes and no. When considered jointly with whatever your assumption is about nonresponse almost every real survey is a probability sample. Under MCAR, it *is* a probability sample. Under some model of non-response it’s a random sample with probabilities of inclusion that depend on the probability of response. I say this not just to be critical (ahem) but because the view that all modeling is a simultaneous test of some set of things that are math and some other set of things you call assumptions and some third set of things you call models are I think your best defense of Bayesianism; at least it’s the defense that convinces *me.*. Putting the math stuff to one side and saying “well at least we don’t have to test that part — that’s just theorems” is pretty artificial in just about any real study, right?

I think the point here is that a “probability sample” is usually used to refer to one in which the probability of choosing some unit is *known* a-priori (by construction). So for example where people enter a lottery and then all their names are put in a computer, a hardware random number is drawn for a seed, and then a sample is drawn from the list of names using a PRNG is a probability sample because the probabilities are known by construction through the design. Random Digit Dialing is nothing like that.

I agree with the “usually,” but my point is that if you want to be really punctilious about it, it’s not clear that even in the archetypal examples there are anything like a priori probabilities. All there ever are are probabilities under some set of assumptions. Assumptions about the PRNG for example. And when you say Random Digit Dialing is *nothing* like that, well, it’s *something* like that, just not *exactly* like that. How close it needs to be depends on the assumprtions, as I said above. If there is no distinction in what you’re measuring based on the number of phones people can be reached at, then oversampling (in expectation) those people doesn’t matter. You’ll never sample people without phones. Does this matter? It did in 1936 in the Landon/Roosevelt election, but whether it does or not for some survey today is an assumption of the model.

Stuff like this only comes from those who don’t actually collect and process data. If you take a bunch of heart rate readings from people that are above 10^5 beats per minute, what would you think?

What I meant were a priori point probabilities, of course…. A priori… y’know, before any data, and without any uncertainty. I don’t see what your example has to do with my point. There are certainly impossible probabilities, and even distributions of possible probabilities. What there *aren’t,* as I said, are a priori probabilities. There are things which for any particular purpose might be treated as a known probability, but it’s still just an assumption.

The behavior of a particular PRNG can be verified *before you design the mechanism of sampling*, so in that sense it is *a priori* known to actually obey the “assumption”. Once you move an assumption from a real assumption to a verifiable and pre-verified fact… it’s not quite the same kind of thing.

I started thinking J(ao) would not consider this to be truly “a priori”, since you used data regarding the testing the algorithm plus whatever background info that was used to come up with it. This sounds silly so I am probably wrong.

Maybe I do not understand. Does this exist? Some expectations about the universe are built into your genes. True a priori probability seems like something outside the realm of human, or any other lifeform’s, experience.

PRNGs after thorough testing (such as the die-harder suite of tests), give as close to a-priori known probability sampling as you’re going to get. And it’s close enough. So, yes, it is possible to have a-priori knowledge of the probability of inclusion in a sample. For example use a PRNG to select a subset of all American Community Survey responses. The probability that a given response is included (if you’re Bayesian, this is conditional only on the knowledge of the PRNG algorithm used) is known before you press the button.

>but an RDD sample is not actually a probability sample because of nonresponse

going deeper than that:

Nonresponse is a big issue, but the number of phones on which a person can be reached ranges from 0 to infinity and has nontrivial numbers of people at both 0 and out in the 4, 5 or 6 range, so “probability of selection” is unknown with RDD.

Well, it depends a lot what you mean by “just about every real survey.” Also it depends a lot who you are teaching in the class and what they are expected to do with the knowledge. In my beginning methods class we talk a lot about how we would sample students at our college. We think about the limitations of standing at one of the campus gates and asking everyone or every x person over a few hours as a method and talk about how at one gate you will get subway riders, another car drivers, and another people on the bus (and are you really going to go up to a couple walking together holding hands and say just to person 2 would you be willing to do my survey). Also that you’ll be tempted to not do it for all of the hours and days people come, e.g. you might miss night students or weekend students, or people who only come on Thursdays and that could lead to all kinds of bias in terms of age/gender/major/income not to mention online students. We also talk about chain referral samples and samples where you leave a stack of surveys somewhere and ask people to fill them in. This kind of thing is what is meant by non probability sample in the social research methods textbook world. Those are probably the majority of real world surveys that are done by people.

In the probability sample category yes we know it all goes wrong in practice but at least there is an attempt to think about probability and randomness in practice.

Then we might consider what if we randomly select class sections and then students in there, and then in that case, what about no-shows and people who have dropped. Or we might do an email survey but what about people who never check their college email? And we’ll likely only get 20% based on past experience. Is it better to do a massive email blast or to, yes, randomly select a list of maybe 250 and really work hard to try to get them to respond so that possibly we get 75%. (And by the way what are the odds we could get access to the needed sampling frame.) All things that are worth discussing and experimenting with (For teaching I like SAMP since it let’s us look at results with non response factored in … and the convenience sample has no non response really because the concept is not even meaningful.)

So in a methods text book it is indeed helpful to distinguish between probability sample and non probability sample, at least from my perspective.

RDD is expensive and most people, including most college faculty, do not have access to it nor is it appropriate for what they are doing.

Elin:

My point was not that students should be doing RDD sampling. I only brought up RDD sampling to point out that even that sampling method, which seems like pure probability sampling, isn’t. I agree that probability sampling can be part of a sampling design, and it’s something worth teaching, but in a practical book about data collection I think it’s important to emphasize that real-world surveys are not probability samples, because the probability of inclusion in the sample is generally not known.

Well actually I would say a lot of times we do know a fair amount, e.g. we have the names of enrolled students on a specific date. “Most” surveys are not of of the general public and so a practical guide shouldn’t focus that much on how to do the GSS or even the overnight polls, it should focus on how to do research on a group you are interested in, such as all the patients at a given hospital. Anyway, my point is that there are “non probability samples” which haven nothing at all related to randomness in their design and “probability-ish” samples where some effort is made, and the latter is better because then at least you have some possibility to think about how to adjust your estimates. The problem with saying that all samples are bad is that students hear that and don’t hear “but some are more bad than others.” Then they end up thinking since they are all bad I might as well do what’s easiest.

Elin:

1. I agree that a lot of times we do know a fair amount. In just about every sampling problem I’ve heard of, the people doing the sampling know a lot about the population. And I have seen some real-world samples that are true probability samples: these are samples not of people but of records. For example we have a few thousand files of legal cases and we sample 100 of them at random for audits. Nonresponse and missing data are not a problem here. So I overstated it when I wrote, “just about every real survey is a non-probability sample.” I should say, “just about every real survey of people is a non-probability sample.” There’s lots of sampling that’s not of people (although that’s not the topic of Salganik’s book).

2. You write, “The problem with saying that all samples are bad . . .” But I never said all samples are bad. That’s one of the problems with textbooks: they can leave students with the impression that a non-probability sample is “bad.” Non-probability samples can be just fine; we use them all the time.

Hi Andrew,

I say all samples are bad but they are bad in different ways and to different degrees all the time. But yes, I agree that some books might leave students with the impression that we just shouldn’t do what they call non probability sampling at all. I think that’s not what you see so much in methods books (at least in social research) because most people are not probability sampling at all. Lots of focus groups are really helpful at understanding things. Lots of times someone does a study inside a single school and it helps understand things at other schools.

It all does depend on who the audience is too. The first time through a topic you’re definitely not going to get all the complexity; it wouldn’t be appropriate. To me part of the problem is that often students don’t get the second or third pass through the material.