Kevin Carlson writes:

Though my graduate education is in mathematics, I teach elementary statistics to lower-division undergraduates.

The traditional elementary statistics curriculum culminates in confidence intervals and hypothesis tests. Most students can learn to perform these tests, but few understand them. It seems to me that there’s a great opportunity to reform the elementary curriculum along Bayesian lines, but I also see no texts that attempt to bring Bayesian techniques below the prerequisite level of calculus and linear algebra. Do you think it’s currently possible to teach elementary stats in a Bayesian way? If not now, what might need to happen before this became possible?

My reply:

I do think there’s a better way to teach introductory statistics but I’m not quite there yet. I think we’d want to do it using simulation, but inference is a sticking point.

To start with, let’s consider three levels of intro stat:

1. The most basic, “stats for poets” class that provides an overview but few skills and no derivations. Currently this seems to usually be taught as a baby version of a theoretical statistics class, and that doesn’t make sense. Instead I’m thinking of a course where each week is a different application area (economics, psychology, political science, medicine, sports, etc.) and then the concepts get introduced in the context of applications. Methods would focus on graphics and simulation.

2. The statistics course that would be taken by students in social science or biology. Details would depend on the subject area, but key methods would be comparisons/regression/anova, simple design of experiments and bias adjustment, and, again, simulation and graphics. The challenge here is that we’d want some inference (estimates and standard errors, and, at the theoretical level, discussions of bias and variance) but this all relies on concepts such as expectation, variance, and some version of Bayesian inference, and all of these can only be taught at a shallow level.

3. A statistics class with mathematical derivations. For this you should be able to teach the material any way you want, but in practice these classes have a pretty shallow mathematical level and give pseudo-proofs of the key results. I don’t think there’s any way to teach statistics rigorously in one semester from scratch. You really need that one semester on probability theory first.

Option #2 above is closest to what I teach, and it’s what Jennifer and Aki and I do in our forthcoming book, Regression and Other Stories. We do lots of computing, and we keep the math to a minimum. Bayes is presented as a way of propagating error in predictions, and a way to include prior information in an analysis. We don’t do any integrals.

I’m not yet sure how to do the intro stat course. Regression and Other Stories starts from scratch, but the students who take that class have already taken introductory statistics somewhere else.

For that first course, I think we need to teach the methods and the concepts, without pretending to have the derivations. Students who want the derivations can go back and learn probability theory and theoretical statistics.

In my spare time, I plan to follow this effort b/c I am fascinated by the prospect of conceiving an elementary statistics curricula. Admittedly, I am only a novice student. By way of a few epiphanies, I foresaw some fruitful exploratory avenues in the making.

Can elementary Bayesian statistics be taught by considering only two possible states of the world? For instance, the canonical example where 3 out of 100 people have cancer, all 100 are given a test that is 90% accurate, what is the probability you have cancer given you get a positive result.

You could get a lot of mileage out of this model by applying it to different areas (jury verdicts, stock value, is someone being truthful, …). Variables that are actually continuous (a stock price) can be usefully modeled with only two states: high value and low value. After the concept is digested, you might even be able to give some formulas (without derivations) for slightly more complicated priors (three states? a normal prior?).

This is actually not particularly Bayesian… It’s a Frequentist calculation using Bayes theorem and a known “base rate”… The essence of Bayesian statistics is probability distributions over parameters.

This raises a question for me, maybe a dumb one–I’ve never understood the relationship between Bayes and probability distributions. PD’s aren’t implied by Bayes theorem, and if probability is a matter of belief, what do you do with someone who doesn’t believe in PD’s?In contrast, PD’s are practically required by the frequentist idea that probability is the expected, long-run frequency of an event. I have some vague notion that modern Bayesian statistics is actually a long way from Bayes theorem, but not sure how it got there, or why the use of PD’s doesn’t render Bayesian statistics “a Bayesian calculation using frequentist theory.”

Michael:

I don’t understand your question. Bayes is entirely about probability distributions. Bayes’ theorem is a statement about probabilities! In contrast, Bayes’ theorem and Bayesian inference say nothing about “belief.” Belief is one model for probability, but only one model.

Andrew, I think my question arises from my conflating the concepts of frequency distributions and probability distributions. My sense is that, among frequentists, probability distributions are often discussed as if they are frequency distributions computed for an entire population, or conversely, frequency distributions are often discussed as if they are probability distributions for finite samples. At least, that’s how I was taught, to bring things back to the topic of your post–it’s a simplifying analogy that doesn’t strictly (or mathematically) hold but allows a teacher to move from stem-and-leaf plots to bell curves by relying on the student’s intuition. I moved on a long time ago to a deeper understanding, but my weak attempts to grasp Bayesian statistics over the years apparently unearthed this misconception without my recognizing it as such.

I see Michael Nelson has a lot of questions and maybe some confusions that could be clarified. I’ll try to help, but please ask if you have further questions.

First let’s get a few brief things cleared up

What is probability? it’s a kind of math you can do. At this level, it’s purely formal.

What does probability mean? Now we’re asking about how it can be used to match purely formal quantities to real world quantities.

Probability is a kind of math that you can do to manipulate quantities that can represent multiple kinds of things. So there *isn’t a unique meaning for probability*. For example, you can represent how often an infinite sequence of numbers will give you a certain result if you take the sequence in batches of a certain size and calculate the result from each batch (like say an average). But probability is also the rules for weighting how reasonable it is to conclude that a certain thing might be true if you require that your weighting scheme has certain properties (this is Cox’s theorem). You can also motivate probability as pure measure theory on finite measures… Probability can be used to do calculations of the weight of metal bars as a fraction of the weight of the full initial uncut bar when you subset the metal bars in certain ways for instance (or if you like, to calculate properties of chemical reactions in terms of what fraction of atoms are participating in a certain kind of chemical in solution). So it’s not a single thing.

The relationship between Bayesian Statistics and Probability Distributions is that Bayesian Statistics calculates a kind of weight of evidence for a certain subset of parameter space using the calculus of probability.

Contrast that to Frequentist probability theory in which probability is *only* used to express a model for how often observable outcomes will occur when we repeat our observations. Frequentist inference *refuses* to assign probability to quantities that are not observable in repetition.

Bayes theorem is just a mathematical theorem about any mathematical system that obeys the probability axioms. As such, it’s a fact about the numbers, and so it holds when it comes to numbers used in *either one* of the models for probability. What tells you whether a calculation is frequentist or Bayesian is whether or not probability is assigned to things that aren’t repeatable random samples. It has nothing to do with whether or not you used Bayes Theorem to do the calculation really.

For example, if we choose a random patient and give them a test that has random outcome then we can use Bayes Theorem to calculate *how often* this process of choosing a random patient and giving a test will it occur that the patient has the disease and also tests positive…. The fact that this is a *how often* question is what makes it a Frequentist calculation.

On the other hand, if we choose a *particular* person and fix them, and then give them a test and see it shows positive, we aren’t interested in how often *other* patients chosen at random would have the disease… we’re interested in the “unobservable” fact of whether or not *this patient* has the disease. So we can use probability to provide a kind of weight of credence to lend to the idea that the parameter (the true state of the disease) is equal to 1 vs equal to 0.

Doing this calculation, we can provide some prior information we have about whether the person has the disease, for example some information from the verbal history of the patient: “I was bitten by a tick last week and then I got a rash”. This lets us assign credence to the idea that “unknown value of disease variable is 1”. There are a number of ways to assign this credence, *some times* we use *how often has this sort of thing happened in the past* but it’s not the only method of coming up with information.

So, the quick version:

Frequentist calculations: are entirely about *how often OBSERVABLE things occur in repetition*. Probability is never assigned to unobservable quantities.

Bayesian calculations: are about how much credence we should lend to an idea that we have about an unobservable thing conditional on what relevant information we’ve observed or assumed. To the extent that future repetitions haven’t been observed yet, they too can be assigned Bayesian probability.

Also the reason I say that the typical “disease / test” example is Frequentist is because it uses a “known / observed base rate” in the calculation of *how often* a “random” patient given the test would turn out to have/nothave the disease. All the quantities assigned probability are essentially observables since frequency in the population is “observable” given a sufficiently large number of previous cases, and we don’t alter this assignment based on any kind of information about the patient, like for example the tick bite verbal history.

However…the calculation can also be interpreted in a Bayesian way. Gerd Gigerenzer wrote a book about this (“Calculated Risks”) where he uses this kind of calculation as a substitute for taking a prior (the rate of 3 in 100) and the likelihood (P(positive|disease)=0.9, P(negative|no disease)=0.9) to calculate the posterior probability P(disease|positive). Gigerenzer calls this method “natural frequencies”, but his interpretation is Bayesian.

It’s just that the calculation is a bit more straightforward to do than plugging into Bayes’ theorem the way we were taught it. Yes, you can interpret it as a calculation of the rate at which people with a positive test turn out to have the disease, but you can also interpret it (as does Gigerenzer) as the probability that an individual has the disease, given that that individual’s test was positive, which is definitely Bayesian.

I have used this approach many times teaching statistics-naive honors college students at two universities the basic ideas of Bayesian decision theory. It was taught as a seminar so not a lot of students each time, I would guess that over the nine times I taught the course about 200 students took it (and at the University of Vermont the course has been taken over and taught several times by another faculty member after I stopped formal course teaching). I know of several students in that course who became so interested in what they learned in that course that they went on to careers in statistics.

Let me see if I’m understanding correctly. Suppose that a patient is given a blood test for an autoimmune disease, and the result is his white blood cell count. A frequentist cares about determining the probability that a patient with a particular test result is in the population of people with the disease. The probability distribution in this case is defined by the spread and location parameters of the sick population’s white blood cell count. (or maybe she will ask whether the patient is in the population without that particular autoimmune disease, depending on the hypothesis being tested) The frequentist (who is hopefully also a doctor) uses known parameter values of the mean and variance of the population of interest to determine the probability that a patient with this count is from that population. Better yet, let’s say the frequentist has a regression model that includes the test results along with other factors (e.g., medical history, reported symptoms, recent outbreaks) as predictors with predetermined weights, in which case the frequentist cares about the probability distribution of the predicted score of those in the population of interest, still defined via spread and location.

If I’m interpreting your comment correctly, the Bayesian cares about determining the probability that this patient has the autoimmune disease, which she determines using two kinds of information: prior information (e.g., medical history, reported symptoms, recent outbreaks) and the new test results. Each piece of prior information has a weight associated with it (determined empirically or theoretically or however) and their collective weights are entered into a model for the probability that the patient has the disease. This prior probability distribution is for that probability and is defined by the spread and location parameters for the probability. The new information from the test results receive their own weight (again, determined however) and are added to the model, which then reports a new estimate of the probability that the patient has the disease.

*IF* this interpretation is correct, then it seems the two main differences are 1) the probability distribution of interest–the score of sick people for frequentists and the probability that the patient is sick for Bayesians, and 2) whether we stop to compute a result using only the prior information before including the new information–frequentists don’t and Bayesians do. (I realize that there is some sophisticated math that happens when a Bayesian modifies priors, and it’s not the same as including a new predictor in a regression model.) You mentioned that the big difference is that Bayesians assign probabilities to specific events instead of long-run occurrences, but my interpretation is that this is the same as saying that the distribution is for the probability instead of a score (difference #2). Is all this correct?

> A frequentist cares about determining the probability that a patient with a particular test result is in the population of people with the disease

nope, that’s a Bayesian concept. The Frequency concept is “how often would a random person who has the disease have this WBC count? The Frequency concept refuses to assign probability to the particular person… it’s all about the repetitive process of selecting a random person from the pool of diseased people and doing the WBC count…

If you are assigning probability to an individual rather than a process, you are doing Bayesian analysis

I guess in some sense, both approaches care about populations and individuals, but the Frequentist implicitly defines the population as “whatever stays the same when I repeat the measurement process” whereas the Bayesian explicitly defines the relationship between population and measurements.

I suspect when people are wary of Bayes, they are worried about the need to formulate an explicit generative model, because this model can of course be wrong in which case the analysis is useless (this worry is often framed in terms of “priors”, but I think people often mean “models”). Particularly given how Frequentist methods are often taught, people are led to believe that the implicit definition frees them from this kind of error, but really it just shoves the error into an assumption that is typically unstated or forgotten.

> The Frequency concept is “how often would a random person who has the disease have this WBC count?

What about “how often would a random person who has this WBC count have the disease?”

Is it a Frequency (..tist?) concept? Is it a Bayesian concept? Both? Neither?

gec, I think those are good points.

Carlos: that concept, and the concept of a population with a known base rate of disease are also frequentist. The key is defining some random selection process.

This is why I say that using the “known base rate” makes the Bayes theorem calculation Frequentist. At that point there are just frequencies under repetition involved, and the question is how often would a particular combined even occur. Bayes theorem describes the math just fine but without a probability distribution over an unobservable it isn’t particularly Bayesian, at least it doesn’t use the main technique of Bayesian stats.

Daniel: when “a frequentist cares about determining the probability that a patient with a particular test result is in the population of people with the disease” what he cares about is “how often would a random person who has this WBC count have the disease”. It cannot be at the same time a Bayesian (not Frequentist) concept and a Frequentist (not particularly Bayesian) concept.

Everything becomes so much less confusing if we use “frequency” when we mean frequency and “probability” when we mean how “probable” something is… but since “probability” is such a loaded word, let’s give it to no-one… I’ll use “frequency” and “credence”…

So there are two conceptions of the problem:

A) “a Frequentist cares about determining the frequency with which a randomly selected patient from a particular population who has a certain WBC is in the sub-population of people with the disease”

Is pretty unambiguous I think.

B) “a Bayesian cares about determining how much credence to give to the idea that a particularly given person who has a particular medical history and a particular WBC count does or does not have a disease”

Note that whether the actual person in front of you has the disease or not does not affect the answer in (A)… since the answer in (A) is about a process of selection and how often that would result in a certain event.

Whereas as information comes in from additional tests etc, the answer in (B) converges either to yes, or no… while the person’s disease status does not change… showing that the credence is a function of the available information.

By “a patient” I meant as opposed to “this patient,” but I definitely should have said “a random patient” or at least “any patient.” If my summary of the Bayesian side is accurate, though, then I’m happy.

What are the necessary prerequisites for an elementary statistics course? There is this aura around a math background. But that is not adequate I gather.

The focus should be on conceptual understanding. Did students of stat 101 get the concepts? usually not. They can solve textbook exercises but have difficulty handling problems that require applications building on the concepts they were taught. We tackle this using methods based on conceptual understanding in both the pedagogical process and in formative assessments supporting that process.

see for example https://iase-web.org/documents/papers/icots8/ICOTS8_1C3_ETKIND.pdf?1402524969

Ron, You are correct. Do you have any introductory videos of your teaching methods?

Sameera – you can get a glimpse at how MERLO sessions look like in https://goo.gl/XENVPn

MERLO stands for meaning equivalent reusable learning object

A MERLO items has 5 alternative representations, some with meaning equivalence, some with surface similarity. Learners are asked to mark down those with meaning equivalence.

The pedagogical approach is using MERLO items in formative assessment sessions, such as the one the above video.

How to use this in stats education is discussed in chapter 6 of https://www.amazon.com/Information-Quality-Potential-Analytics-Knowledge-ebook/dp/B01MEERM38/ref=sr_1_3?qid=1573050181&refinements=p_27%3ARon+S.+Kenett&s=books&sr=1-3&text=Ron+S.+Kenett

See also https://iase-web.org/documents/papers/icots8/ICOTS8_1C3_ETKIND.pdf?1402524969

Thank you for the references.

one more https://www.wiley.com/legacy/wileychi/kenett/supp/presentation/Kenett_SFdS_28_5_2016.pdf?type=SupplementaryMaterial

Thanks, these slides seem to provide a reasonable sense of your teaching ideas in less than 15 minutes – a nice morning bite. Many useful ideas and experiences are outlined.

My challenge in teaching is that (for busy science professionals) I only get an hour, maybe up to 6 episodes over the year, so little time for group discussion or assessment of their actual grasp of concepts :-(

But I do get to see what a subset subsequently do in their work :-)

So I go long on concepts using metaphors and simulation as here https://statmodeling.stat.columbia.edu/2019/10/15/the-virtue-of-fake-universes-a-purposeful-and-safe-way-to-explain-empirical-inference/

And with the hope that those who really want to learn will persist in running and modifying the code that is available.

Thank you so much. I look forward to reviewing them.

This is good stuff! I am about 50% of the way to creating such a course. This semester’s version is here. Key thing I need is a single textbook. Fortunately, Modern Dive provides a good starting point.

> The most basic, “stats for poets” class that provides an overview but few skills and no derivations.

If you offer a course like this, you are serving your students very poorly. I have plenty of poets in my class. The can learn R, R markdown and other skills for doing data analysis.

> this all relies on concepts such as expectation, variance, and some version of Bayesian inference, and all of these can only be taught at a shallow level.

Really? I am not so sure. Does math == depth? I doubt it. How often do you students, 6 months out, remember the math you showed them? Roughly, “Never,” I suspect.

Part of the problem is that by the time they’re delivered to you they’ve been marinating in statistical misunderstandings for a decade or more. In my world, law, most judges are so self-assured of their statistical fluency that they will reject out of hand e.g. an objection to an e-discovery vendor’s claim of providing “search results of which you can be 95% confident.” So far this year there have been 154 published opinions or orders containing “statistically significant” and all get it wrong, sometimes hilariously so with one court declaring it the law of the land that standard deviation and standard error are one and the same. Thus I’m quickly becoming one of those cranks who think probability out to be taught (at a very basic level) shortly after arithmetic.

P values are an attractive nuisance.

Interestingly perhaps I named my blue water sailboat “Attractive Nuisance” and am embarrassed to recall that back then I was in the habit of saying things about p-values at depositions, seminars and in courtrooms that I now recognize to be appallingly wrong. P-values are indeed disguised traps for the unwary.

Thanatos,

What a curious situation.

“The terms `standard error’ and `standard deviation’ are often confused. The former concept, standard error, concerns the reliability of Howell’s statistics … In requiring that Howell instead calculate the standard deviation, the majority perpetuates an error … Thus, the majority requires Howell to produce evidence that is not at all relevant to probing the reliability of his statistics.” – From the dissent in which Sander Greenland and Doug Altman get props. https://scholar.google.com/scholar_case?case=5084958452120810922

The dissent is clearly far more coherent than the majority opinion. The question of nonresponse is definitely valid though. Both the majority and the dissent seem to be confused about that. Enormous sample sizes can help with random sample error, but not with systematic nonresponse error.

the majority just waves it’s arms… the dissent is a little too credulous…

Then throw in sampling error for good measure, confuse the hell out of ‘um.

I begin my very, very short (9 hours!) statistics course for beginner bioscience research students with a couple of hours workshopping issues of sampling and scope of inference from sample and statistical model to real world properties. The most important ways to stuff up scientific inferences with statistical nonsense involve failures in those areas, so the time spent is important even if the students think that they should be taught primarily about t-tests, ANOVAs, and confidence intervals.

I took both your “Applied Regression and Multilevel Models” Don Green’s “Experimental Research,” both seemingly oriented toward advanced undergrads/MA students or first-year social science Ph.D students. Both classes were really important for me in (1) developing some statistical intuition and (2) just learning how to do important things that social scientists should be able to do. I took your class in the fall and Don’s class in the spring.

I think it would have been better if I had taken them in the reverse order. I think that a one-year social science stats sequence with a class like Don’s in the first semester and a class like yours in the second would be valuable for lots of undergraduates and social science grad students. This is because (you may disagree) many of the inferential concepts necessary for using statistics in an observational context are much more easily understood and introduced in an experimental context.

It seems crazy to me that so many undergrad econometrics classes jump immediately into the deep end of causal inference, often using assumptions that only a randomized experiment could realistically meet, before students have built the necessary intuitions in the experimental setting that’s being used as an implicit benchmark.

Randomization inference in the experimental context is a really intuitive introduction to the concept of simulation, and experiments (again, you may disagree) are a good mental model to have in mind when considering non-experimental data. I think starting with experiments and then moving on to a simulation-heavy class like “Applied Regression and Multilevel Models” would make a big difference for a lot of students, particularly those who (a) math-averse or (b) those who think they understand statistics but don’t (I was one of these and maybe still am).

“It seems crazy to me that so many undergrad econometrics classes jump immediately into the deep end of causal inference…”

Hear, hear! Couldn’t agree more. But there’s an alternative to your suggestion (start with experiments, then simulation), namely start with prediction and forecasting. Build a healthy respect for data, embrace uncertainty, learn not to automatically interpret things causally, etc., and then (eventually) get to causal inference.

Would work with econ because prediction/forecasting is a big part of applied econ. But perhaps not suited to many other disciplines. Not biology or psychology, I think; maybe poli sci, not sure … but I am out of my comfort zone here.

This reminds me of when I took a course in philosophy of science during my PhD program. The course was at the Department of Philosophy and I got the sense I was the only one from a natural science background (ecology).

I particularly remember a practical exercise using a shoe box with about 4 knitting needles stuck through it. When you turned the box around and over, you could here some metallic sounds from the inside, like something was threaded over the knitting needles inside the box. But it was difficult to determine just how they where arranged. Where they round? How many? Threaded in what way over which needles? You where allowed to manipulate the box and knitting needles as you liked, pulling out one after the other and observing the resulting phenomena. Of course, you weren’t allowed to open the lid and peek inside. And you only got one box, so you’d have to use the needles efficiently. The object of the exercise was to determine the configuration of the inside of the box, by experimentation and practicing the hypothetico-deductive method. You made an initial basic observation, then formulated a hypothesis, then performed an experiment – turning it over this or that way, or pulling out a particular needle a given distance – and listened to what happened inside. But of course it also was a metaphor of how we learn things about the real world. We observe most things indirectly, not being able to observe the true state or mechanism, having to use a mental representation of the world which you interpret the results through.

And it actually worked. Well at least I ended up rather confident of the insides of the box. I didn’t peek inside. I felt that would have been cheating in some profound way.

I remember the professor remarking on my report that it showed I had earlier experience with the process, coming from a science background. I was a bit stunned at that, since this was the first time I came across a problem where the hypothetico-deducto method could be applied in a clean way. All my real world problems in ecology was much messier. Observing only proxies of the thing of interest, getting messy data that can’t be interpreted in a clear-cut way, multiple working hypotheses, no obvious path of sequential deductions (or strong induction). The shoe box was a dream in comparison, and I enjoyed it greatly. But I’ve never come across it since.

Anyway, I highly recommend it to an introductory philosophy of science, or statistics class, as long as you take the time to explain how “real science” seldom is so simple. I assume teachers could get blue-prints from the Dept. of philosophy at Uppsala University, Sweden.

> not being able to observe the true state or mechanism, having to use a mental representation of the world which you interpret the results through.

I do think that is a key insight/distinction that needs to be brought out early in one’s learning about statistics.

Though instead of a physical (closed/black box) artifact I use “a shadow metaphor, then a review of analytical chemistry and move to the concept of abstract fake universes (AFUs) … ones you can conveniently define using probability models where it is easy to discern what would repeated happen – given an exactly set truth”. https://statmodeling.stat.columbia.edu/2019/10/15/the-virtue-of-fake-universes-a-purposeful-and-safe-way-to-explain-empirical-inference/

This automates the experimenting process and if the students can use R that can make their own boxes to explore. Here you can start become realistic adding such things as systematic error, selection bias, confounding, etc.

Here’s a link to the home page for a “Continuing Education” course I taught for about ten years (and have since passed on to someone else): https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

It’s a 12 hour course (3 hours for each of 4 days), but I think you can see how the idea could be expanded to the outline for a semester course. I do not do any proofs or derivations. But I do spend time discussing definitions and terminology, and pointing out frequent confusions and false assumptions.

I introduce hypothesis tests (on the third day — after spending the first two days discussing things like uncertainty, probability,, randomness, biased sampling, and problematical choice of measures) as follows:

Type of Situation where a Hypothesis Test is used:

• We suspect a certain pattern in a certain situation.

• But we realize that natural variability or imperfect

measurement might produce an apparent pattern that isn’t really there.

Basic Elements of Most Frequentist Hypothesis Tests:

Most commonly-used (“parametric”), frequentist hypothesis tests involve the following four elements:

i. Model assumptions

ii. Null and alternative hypotheses

iii. A test statistic

This is something that

a. Is calculated by a rule from a sample;

b. Is a measure of the strength of the pattern we are

studying; and

c. Has the property that, if the null hypothesis is true,

extreme values of the test statistic are rare, and hence cast doubt on the null hypothesis.

iv. A mathematical theorem saying,

“If the model assumptions and the null hypothesis are both true, then the sampling distribution of the test statistic has a certain particular form.”

Note:

• The sampling distribution is the probability distribution of the test statistic, when considering all possible suitably random samples of the same size. (More later.)

• The exact details of these four elements will depend on the particular hypothesis test.

• In particular, the form of the sampling distribution will depend on the hypothesis test.

Martha,

Why not launch it on Coursera?

I like Gottfried E. Noether’s approach of teaching nonparametric statistics as the basis for introductory statistics, instead of putting it for some reason in more ‘advanced’ sections. This approach can be ‘modernized’ based on the many rank based methods and still nonparametric but non-rank-based-methods that have been created since Noether’s time,

Justin

http://www.statisticool.com

I disagree entirely. Non-parametrics have extremely limited utility in the real world of applied modeling. Students are better off concentrating on learning probability really well (and/or linear algebra, dynamical systems, etc.etc.)

That’s interesting, but statistics is comprised of a very large number of fields. In survey statistics, for example, I use nonparametric statistics all the time. Actually, a few Nobel prize winners’ recent papers used some nonparametric tests.

Cheers,

Justin

This does not justify the use of nonparametric tests. If anything, it questions the statistical competence of these Nobel price winners.

I agree with Chris when he says the utility of these methods is extremely limited. In fact, I know of no situation where a nonparametric test is better suited than a parametric alternative. Think about it: Most of the nonparametric tests used today have been invented in the 1940s and 1950s, a time in which not much computational power was available to the ordinary researcher. Thus, tests for which no more effort than sorting and counting is necessary seemed very attractive at that time. Today, however, more sophisticated methods exist which make nonparametric tests completely needless.

I wish we introduced decision theory early.

Bayesian thinking let’s you escape the straitjacket of a yes / no mindset that comes with p vales and NHST but it’s somewhat unsatisfactory if there’s a real-world yes-no decision at the end of it all, but the course leaves you hanging about how to make that decision.

This is my beef with stat courses whether Bayesian oriented or traditional. Ironically, even as the courses progress to more advanced ones they don’t seem to add much more material on the decision theory side.

It’s somewhat of a no mans land that no course wants to tread on.

Hear hear! Decision theory should be nearly the first thing discussed in a stats class… as motivation for why we do stats in the first place. then as we learn more we should always come back to how to utilize the new stuff in decisions.

In many ways inference is just making decisions about approximations.

Indeed, and that is why I designed my honors college course (mentioned above) around decision theory rather than around “statistics”. It was a seminar course with typically around 20 students or so, taught in the honors college to Freshmen (at Texas) or Sophomores (at the University of Vermont…their honors program starts sophomore year). I had all sorts of students in the course, quite a few pre-med (some of whom came back years later to thank me, as medical decisions definitely need to consider both the probabilities as well as the patient’s loss function), some math oriented including several students who went on to stats careers, but quite a few liberal arts majors and even one dance major.

Statistics was where the calculus I struggled with as an undergrad suddenly made sense – perhaps because I am a visual thinker. That’s for the applied side. But the real a-ha moments came when I (much later) read ‘The Lady Tasting Tea’, an excellent journey through the history and types of statistical analysis. For me, seeing how statistics was developed by people as a tool gave me greater insight into what the field does and how it works.

I can easily imagine a course which begins with the problem or opportunity faced by a Pearson, for example, and then how he got to where he did, and then applying those ideas in practical form.

A great joy of a grad level statistics course was being set free with big data sets (census and land use, in urban planning) and simply exploring what one can make of it. That aspect of ‘play’ is both useful and unexpected, and may lead our poets to work that connects with them!

Do you have a date for when the new book is due out?

Rather than an intro stats course, I think it would be far more useful for most students to learn how to read a paper and to recognize potential problems, rather than to learn how to perform some tests or do a simple linear regression or whatever. This would in turn help them to avoid mistakes when doing their own research, and prompt them to think. Maybe stats wouldn’t be viewed as the required blessing oracle for one’s research. The course would be more analogous to reading this blog. The focus would be on thinking. The course would mostly be taught with lots of plots.

The medical residents (and docs) that I help have had stat 101 courses. They’ve had some brief biostat lectures in MS1 year. They don’t get it, nor should they, because I’m sure the main concern was the couple of questions on the USMLE Step 2 exam rather than learning (they have lots of other stuff to learn!). It would be far more relevant for them (and I am sure a variety of other disciplines) to never take a Stat 101 course, but rather a course on how to think about research and how to read a paper.

jd:

To extend your thoughts a bit, I think a course that took a single or a few applied problems and worked them from start to finish would bring up most of the issues that are most problematic such as unrealistic modeling of error, assumptions of causality without any argument for it, and unmeasured variables. They could not be of the sort typically presented in a text book and would need to be unstructured in a way that the structuring of the problem would be a primary focus of the course and the statistical methods a way to model reality relative to the problem at hand.

1. Formulating the question

2. Identifying sources of data that will be useful

3. Data collection/acquisition

4. Models of the problem

5. Statistical methods that would be useful for shedding light on these models

I agree that would be a useful course. I actually wish I had taken a course like that in my program. I think that would actually be a good first course for both stat majors and non-majors. It would have been helpful for me (and I think for many others) to get a good view of the forest before studying each tree. And, some people never need to study the trees, only view the forest.

I do think it is helpful to view mistakes though, like the examples on this blog, and the ideas that go along with those examples. So I would include that in a course as well.

I teach a course like this to sophomore biology majors (mostly pre-med). We designed the course based partly on comments like jd’s. Upper level undergrads often said they wished they’d had a course in data analysis and writing before they had to do it all in senior level classes or in their required senior biostats course. We work in R (they don’t know excel or any other program anyway, so there is no “easy” default for them) and focus a lot on thinking about questions, gathering data, plotting, rethinking their questions, etc.

The students in the course generally hate it. They’re 19 and it’s very hard for them to see the relevance of having quantitative. In general, the course fits some of the advice that Andrew gives on this blog – there is no single course that can cover it all or fix any deeply held misunderstandings. At least not for a majority of students. The main weakness in our approach is that they don’t have repeated consistency in later classes. We can teach them best principles in the intro course, but then they learn all kinds of things from professors in later courses that don’t match up, and they become confused and frustrated. In a perfect world, this course would be a first step and everything later would build off of it.

You mean an emphasis on critical thinking. In medical education, this would be a challenge to introduce based on Gerd Gigerenzer’s efforts to introduce critical thinking in the continuing medical education space.

I’m enamored with the idea of beginning an intro course with the very broad idea of models. People are deeply familiar with models in their daily lives–maps, blueprints, essay outlines, the periodic table of elements, weather forecasts, the little plastic planes you glue together, ships in bottles…. This leads into formal ideas like constructs, parameters, distributions, etc.

You could also bring in the notion that we all act as amateur scientists all the time. We apply implicit models of events and behavior, like when we plan what we’ll wear tomorrow, or evaluate why an acquaintance is acting strangely, or decide whether a girl likes you or is just being nice. These models are devised by reflecting on the past, tested by observing and asking questions, and revised based on those results. This could lead to introducing the scientific method generally and Bayesian priors in particular.

I love the focus on models – this also helps get students (a) away from the idea that models are always simple “no difference” null hypotheses, and puts the focus on the parameters rather than binary decisions, and (b) is a natural way to introduce likelihood.

Here’s my question. Why isn’t likelihood and maximum likelihood in all intro stats classes, and somewhere near the front? I, and I think many nonstatistician scientists, are surprised when we eventually learn that most of the standard cookbook stats tests and estimators can be derived from ML, and that the same method is also vastly more flexible than the cookbook tests of STATS101. Its not even that difficult, it can be introduced conceptually and done in lab with a hill-climbing search algorithm – derivative calculus to find analytic solutions for the ML estimator can be described conceptually also, and then students have some idea of where these formula come from, and also what to do when data don’t fit in one of the cookbook boxes.

Seems like it would make sense to yes chi the big unifying idea early on! I saw George Cobb make this argument in a 2015 paper, but it still doesn’t seem to be catching on in intro stats.

Possibly the reason is the same that the standard experimental design course is built around the ANOVA–these are courses that are taught by either grad students or postdocs, who lack the experience and insight (and often the autonomy) to make such changes, or by profs who have a disincentive to invest extra time in an intro course for which little is expected and doing more than what’s expected is not rewarded.

Thanks for your thoughts! I noticed a typo from my phone autocorrect:

“Seems like it would make sense to yes chi the big unifying idea” -> “Seems like it would make sense to introduce the big unifying idea”

I found the George Cobb quote:

“Our junior-senior course in ‘mathematical’ statistics sits serenely atop its mountain of prerequisites, unperturbed by first and second year students. Tradition takes it for granted that they must climb through the required courses in calculus and probability before ascending to maximum likelihood and the richness of profound ideas. Of course we want students to learn calculus and probability, but it would be *nice* if we could join all the other sciences in teaching the fundamental concepts of our subject to first year students.”

(emphasis original)

— p. 276 of: Cobb, George (2015). “Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up.” _The American Statistician_. 69:4, 266-282. DOI: 10.1080/00031305.2015.1093029

“Our junior-senior course in ‘mathematical’ statistics sits serenely atop its mountain of prerequisites, unperturbed by first and second year students. Tradition takes it for granted that they must climb through the required courses in calculus and probability before ascending to maximum likelihood and the richness of profound ideas. Of course we want students to learn calculus and probability, but it would be *nice* if we could join all the other sciences in teaching the fundamental concepts of our subject to first year students.”

(emphasis original)

— p. 276 of: Cobb, George (2015). “Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up.” _The American Statistician_. 69:4, 266-282. [http://dx.doi.org/10.1080/00031305.2015.1093029]

…was the quote I was thinking of. Also, fixing a phone typo in my previous post:

“Seems like it would make sense to yes chi the big unifying idea early on!” –> “Seems like it would make sense to introduce the big unifying idea early on!”

I have been having an interesting re-learning experience this Fall. My son is at a major state university. He is a environmental science student who is taking Intro Stats right now. I have a PhD in Industrial Engineering.

After completely bombing his first test, he came to me for tutoring. I was shocked that all his lectures slides were only equations, not a single plot of some data to illustrate the concepts. I couldn’t believe that this level of instruction is considered acceptable. It was easy to find good lectures online, and use this material to tutor my son and his friends but I still cant get over the lack of recognition that people have a variety of learning styles and ‘statistics by proof’ does not serve all learners.

I would start a course with a text like Phillip Good’s Resampling Methods. Lots of examples, and then build the math from the data and the problems. That would provide a reason for the math, and a background in data handling as the course progresses.

Chris said,

“I still cant get over the lack of recognition that people have a variety of learning styles and ‘statistics by proof’ does not serve all learners.”

This is not just a matter of a variety of learning styles — it’s that “statistics by proof” is not real statistics. – -it’s (I assume) just giving proofs, with no discussion of when they are appropriate to apply, and no discussion (including examples) of what the concepts are (and are not), of model assumptions, etc.

I’m a mathematician, but for non-math students, I don’t give proofs — but focus on what a relevant theorem does and does not say in various contexts. Even when teaching math students, I emphasize the applications, and only give proofs when they seem to make an important point; in stead, I emphasize conceptual understanding and discussion of what the hypotheses and conclusions of the theorem say or don’t say in applications. Seeing the proof of a theorem is (usually) a waste of time, compared to learning NOT to “apply” the theorem in a garbage in, garbage out way.

Possible the most important audience for introductory stats courses are engineering students. They typically do have decent good training in basic mathematics. The typical stats courses for engineers are a mix of basic probability and hypothesis testing – with a narrative grounded in the notion that the world is dedicated to quality control in the manufacture of widgets (a context where the hypothesis testing paradigm does fine). Now many engineering students want to learn machine learning and the undefined “big data” and the like, and will take this classical perspective to contexts where it may or may not work well, with little or no background in inference and statistical thinking. The basic stats courses for engineers need rethinking too. Some have tried, but my sense is that this often turns into a lot of rote exercises mechanically processing data, whose provenance and quality is unexamined, using R or python (what is useful for the student is learning to use R or python).

Yes, all of this makes good sense. I TAed this class several times in grad school and was generally unimpressed with the structure. It wasn’t ambitious enough. most of the students had a Matlab programming course already, but there was virtually no data analysis. Almost all the course was devoted to by hand calculations of toy probability models… poisson and binomial and normal distribution calculations… still using the tables in the back of the book! they almost didn’t do any graphing.

Daniel said, “Almost all the course was devoted to by hand calculations of toy probability models… poisson and binomial and normal distribution calculations… still using the tables in the back of the book! they almost didn’t do any graphing.”

So sad! Applets can be so much more helpful in instilling conceptual understanding, which is what is important in real application.

I guess you could get much of (1) out of posts on this blog. And part of (3) from Dan Simpson’s posts…

Agree it makes sense to differentiate intro. classes for people who will continue in this field vs. intro. classes that are part of the general knowledge we need as informed citizens/professionals.

One example in a science field which is like Option #1 is the “Physics for Future Presidents” course at Berkeley, where instead of going though the standard toy problems of intro physics (force diagrams, inclined planes and so on) focuses on the physics concepts underlying various topics of interest (nuclear power, earthquakes, climate change, and so forth), with few to no equations.