A couple years ago I gave a talk at West Point. It was fun. The students are all undergraduates, and most of the instructors were just doing the job for two years or so between other assignments. The permanent faculty were focused on teaching and organizing the curriculum.
As part of my visit I sat in on an intro statistics class and did a demo for them (probably it was the candy weighing but I don’t remember). At that time I picked up an information sheet for the course: “Memorandum for Academic Year (AY) 13-02 MA206 Students, United States Military Academy.” Lots of details (as one would expect in that military-bureaucratic ways), also this list of specific objectives of the course:
1. Understanding the notion of randomness and the role of variability and sampling in making inference.
2. Apply the axioms and basic properties of probability and conditional probability to quantify the likelihood of events.
3. Employ models using discrete or continuous random variables to answer basic probability questions.
4. Be able to draw appropriate conclusions from confidence intervals.
5. Construct hypothesis tests and draw appropriate conclusions from p-values.
6. Apply and assess linear regression models for point estimation and association between explanatory and dependent variables.
7. Critically evaluate statistical arguments in print media and scientific journals.
This is all ok except for items 4 and 5, I suppose.
Also, at the end, a list of rules, beginning with:
a. All cadets are expected to maintain proper military bearing and appearance during instruction in accordance with appropriate regulations.
b. Respect others in the classroom – No profanity, unprofessional jokes, or unprofessional computer items . . .
e. Jackets are not permitted in the classroom . . .
g. Drinks must be inside a closed container (plastic bottle with a top, for example) or in the Dean-approved mug . . .
and ending with this:
j. Rules common to blackboards, written work, and examinations:
1) Draw and label figures or graphs when appropriate.
2) Report numerical answers using the appropriate number of significant digits and units of measure.
Now those are some rules I can get behind. They should be part of every statistics honor code.
On point j2 (and part II of that), I was interested in views about differences between disciplines.
I started undergraduate in the physical sciences, where there was much stress on units of measurement. Getting those right was often useful to make sure you were using the right equations, so there was immediate value for it (beyond just not losing marks on assignments). In grad school moved into psychology. While measurement is a big topic in some parts psychology (and there is much work in part of psychology, e.g., the Foundations of Measurement trilogy), it seems less stressed to report units. It seems all the more important in psychology where sometimes the measure is some combination of several rating scale responses. Is my view of this discipline difference something other people also see?
It’s quite funny to read about, say, government debt being 110% of GDP. The first is, of course, just a sum of money and the second is money per year.
Ah…the Corps has. When I took that course some 30 years ago, there was no thought of taking coffee into the classroom. Other than that, the list of objectives look rather familiar. That’s a core course, would imagine the curriculum evolves more slowly than the electives.
“This is all ok except for items 4 and 5, I suppose.”
You think drawing inappropriate conclusions would be better? ;-)
Why is #4 bad? Not being sarcastic here. What’s wrong with confidence intervals?
Hoekstra, Rink, Richard D. Morey, Jeffrey N. Rouder, und Eric-Jan Wagenmakers. „Robust Misinterpretation of Confidence Intervals“. Psychonomic Bulletin & Review, 14. Januar 2014, 1–8. doi:10.3758/s13423-013-0572-3.
http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf
“Confidence intervals (CIs) have
frequently been proposed as a more useful alternative to
NHST, and their use is strongly encouraged in the APA
Manual. Nevertheless, little is known about how researchers
interpret CIs. In this study, 120 researchers
and 442 students—all in the field of psychology—were
asked to assess the truth value of six particular statements
involving different interpretations of a CI.
Although all six statements were false, both researchers
and students endorsed, on average, more than three
statements, indicating a gross misunderstanding of CIs.
Self-declared experience with statistics was not related
to researchers’ performance, and, even more surprisingly,
researchers hardly outperformed the students, even
though the students had not received any education on
statistical inference whatsoever. Our findings suggest
that many researchers do not know the correct interpretation
of a CI. The misunderstandings surrounding pvalues
and CIs are particularly unfortunate because they
constitute the main tools by which psychologists draw
conclusions from data.”
Teach them how to get it right then…
The thing is, that CI does not mean what people *want* it to mean. It’s the same problem as with the p-value/NHST but less dangerous because it is used less strictly as a decision tool and gives a better *idea* about uncertainty.
Daniel:
Indeed. Recall Larry Wasserman’s statement:
To which I replied:
The point is, Larry wanted the confidence interval to be something it isn’t.
Isn’t this just a criticism of model quality?
A bad model will give badly performing CI’s, yes. But a bad model might / will also give you bad means, medians, sums whatever other property you may choose to extract from it.
Am I misunderstanding?
It is such a common error in science (it’s even inspired poetry, see below) but the one variation I dislike the most is taking the implications of continuity as relevant in science as in science continuity is only a very convenient approximation.
“The word [model of a] butterfly is not a real butterfly. There is the word [model] and there is the butterfly. If you confuse these two items people have the right to laugh at you. Do not make so much of the word [model].”
http://www.youtube.com/watch?v=r2XkfBWSmcs
I agree with Daniel that a big part of the problem is that the CI and p-value are not what people “want”. So I often discuss those concepts starting out with “what we want” and ending with “what we get.” I think (at least hope) it helps more people get it right – or at least, realize that they don’t really understand.
Also, I’ve found (in a master’s course for high school teachers, who had already had a more-or-less standard first course in statistics) that doing a little Bayesian analysis helped them gain a better understanding of frequentist CI’s (and convinced some of them that they’d prefer the Bayesian approach.)
Martha, that’s interesting to hear and while I won’t be able to reach any Bayesian statistics in our introduction to quantitative methods I might be able to try to give a better idea about then true meaning of p values and CIs. Need to be careful not to irritate the students too much.
@Daniel
So, if not CI’s nor p-values, is there another metric that captures what people *want* it to mean?
i.e. If we cannot get rid people of their misinterpretations, can we produce a metric to be aligned with what people are expecting?
Or are people expecting answers we just cannot deliver.
Well, most often CIs and the like are interpreted as if they were BCIs but those of course have their own caveats and I guess they would be misinterpreted as well. They at least pretend to deliver something more akin to what we want, though. I guess it would be important to stress more strongly though what statistics just can’t deliver. Embrace uncertainty. ;)
Nothing as long as the problem is simple, has no nuisance parameters, no strong prior information, and useful sufficient statistics. Outside of that problem domain, it quickly leads to nonsense.
Then there’s the issue that even within that domain CI’s don’t have the coverage property advertised. For this to happen, the system has to have approximately stable frequency distributions. This is never checked by frequentist model checking, and rarely checked by independent experiments. It’s hardly ever true which is why as a rule of thumb 95% CI’s are lucky to have 30% coverage.
Then there’s the whole issue that CI’s don’t answer the question people want answered. They don’t want an interval that works on average over some non-existent future repetitions, they want a best interval for the one repetition (data set) that exists.
But hey, if we give frequentist another 70 year monopoly on the teaching of introductory stat, maybe they’ll somehow magically just ‘teach’ all this right and all the problems will go away.
Or we could recognize that it was an unfortunate historical accident that Fisher, Neyman, Pearson, checked their ideas on these simple kinds of problems which happen to give answers operationally equivalent to Bayes, and so seemed to work, and wrongly assumed they’d still give good results more generally. Once this unfortunate historical accident is recognized, then we can all just use the Bayesian answers, which work well even outside this narrow problem domain.
That reminded me of the quip that 95% of all statistics are made up. :)
If you are on a Crusade it’s OK to pillage a few villages on your way or maybe sack Constantinople once in a while. Otherwise, students need to know what is written in all those papers in all those journals printing p-values and CIs and such. Change does not come from the classroom, it’s a top-down approach.
“most of the instructors were just doing the job for two years or so between other assignments”
Reminds me of when I was an instructor at Rice in the early 70’s. One of the graduate students (taking one of my undergraduate courses) was in the Army and was there to get a master’s degree in math (Rice usually only admitted graduate students for a Ph.D.), after which he “would be” (his words) and instructor at West Point, where he had been an undergraduate.
And that reminds me of the student I had a few years ago in a graduate statistics course at Texas who was in the service ( a captain, I think?) and was getting a Ph.D. in business/OR, after which he “would be” in charge of a VA hospital. The military is a different world than I’m used to.
But here’s a line of arguments I have not been able to refute if students bring it up.
From the cited paper (http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf), the following statement S1 was labeled correct (as I have also learned it):
S1: “If we were to repeat the experiment over and over, then 95 % of the time the confidence intervals contain the true mean.”
S2: “Then we can be 95% confident that we got one of the CIs containing the true mean”
S3: “Then we can be 95% confident that the true mean lies between 0.1 and 0.4”
Statement 3 was commented with: “mentions the boundaries of the CI (i.e., 0.1 and 0.4), whereas, as was stated above, a CI can be used to evaluate only the procedure and not a specific interval”.
I have not found a good reply to students arguing S1 -> S2 -> S3. Can anyone help me out here?
I know we should change teaching statistics to a bayesian framework, but frequentist stats is so ubiquitary that they should also be able to interpret it correctly.
This is because for simple examples like that one the confidence interval and credible interval using a uniform prior are the same:
https://stats.stackexchange.com/questions/12567/examples-of-when-confidence-interval-and-credible-interval-coincide
For simple examples like that one the confidence interval and credible interval using a uniform prior are the same:
https://stats.stackexchange.com/a/12571
Berry,
In many cases the confidence interval is just an alternative way of deriving the credible interval using a uniform prior:
https://stats.stackexchange.com/a/12571
Berry,
I avoid (and urge students to avoid) using the phrase “95% confident,” since it is so vague as to be open to a variety of interpretations. This seems to be what is happening in the line of “reasoning” you describe.
Do you suggest an alternative phrase?
Just talk about a 95% confidence interval. If they insist on a phrase starting with “we,” give them something more explanatory like “We have used a process which, for 95% of all suitable random samples, produces an interval containing the true value of the parameter.” Saying “we are 95% confident” is deceptive because it avoids the complexity that is inherent in the concept. It may also be deceptive because it invokes a feeling, which many may find more comfortable than the complexity needed to understand.
A good point with which I totally agree.
For me the main thing about “we’re 95% confident” is that it pretends to be an “interpretation” of the CI but it isn’t, because apart from exactly this use there is no definition of what “percentage confidence” means, which means that whenever somebody asks: “And what does being 95% confident mean?” – one can only reply with the exact and complex definition of confidence intervals, whereas the people who use the wording “we are 95% confident” apparently hope that this “explains” something more than the underlying formal definition.
S2 introduces the natural-language concept of “being X% confident that” – what does this really mean? (In contrast, S1 just uses the word “confidence” as an arbitrary technical label), and S3 goes further in applying this concept to concrete results of an actual observation (so there’s no clear sample space left). I assume you (or your students, if of a frequentist bent) would not be comfortable substituting “the probability is X% that…”. But if it’s not (Bayesian) probability, if hard to critique the argument further without hearing what this concept actually means.
There are many examples were the specific interval produced by an entirely correct 95% interval procedure can be seen to be absolutely certain to contain the correct value (or in other examples: provably cannot contain the correct value). So when we pin down want “95% confidence” means, it needs to be consistent with “and furthermore, we know for sure that the statement false” or
“and we know for sure that is true”. So it’s fair to press a bit more on what is actually being said here, because it’s surely
not obvious.
I guess what S1->S2 transition means is that instead of “let’s repeat experiment A N-times, then for 0.95N of them the true value will be within CI” one says “let’s go through life doing each experiment ones or only a few times, then out of accumulated N different non-repetitive experiments 0.95N will have true value within CI”.
I can’t distinguish this from a statement about probability. Perhaps the meat of it then is S2 -> S3, when the step is made that we are making a statement – not about the probabilty of correctness across our life experiences with many tests, but a statement about a particular interval – nothing random left. What interpetation other than Bayesian is left?
What if the interval is provably wrong, as can happen?
Items j.1 and j.2 have been complaint of mine with co-workers pretty much since I was a postdoc. Widespread obliviousness to significant figures and to the importance of properly labeling x- and y-axes never ceases to amaze (dismay?) me. (My sense is that people working in academia or academic-oriented environments tend to be a bit better at it.) I’d like to believe that if you’ve been through grad school in the physical sciences, engineering, or the like – basically any field which requires that you report quantitative results in writing and/or create graphs on a regular basis – would at least have the basics down but my experience doesn’t support it. Anecdotally**, there are far more people who are highly-skilled at the technical aspects of what they do than there are people who can accurately and effectively communicate the results of their work.
** Caveat: The plural of anecdote is “bullshit”.
“as a rule of thumb 95% CI’s are lucky to have 30% coverage.”
Well, if the model assumptions (including iid) don’t hold precisely, the “true parameter” is not defined, and therefore it is not well defined what “coverage” means. So one may say that 95% coverage is illusory, but quite certainly there is no other more correct “true coverage percentage”-value.