The effect is totally useless if it only manifests itself if you don’t know what is going on. Ranehill’s study has superior methodology given Cuddy’s treatment of it as a self-help method.

That said, I am interested in why they didn’t just replicate the method exactly.

]]>You should do it at 1 minute, 2 minutes, 3 minutes, and 5-6 minutes if you want to really do a proper experiment.

2 minutes seems extremely arbitrary – and thus possibly cherrypicked. What if they had them hold poses for other lengths of time, and it didn’t work? That would suggest that the two minute number is arbitrary.

Going for 1 minute and 3 minutes gives you an idea if plus minus 50% matters. 5-6 minutes would test if holding it for much longer gave a larger effect or negated it or whatever.

If I saw an effect, I’d probably go for 15 and 30 seconds as well and see if there’s an effect for shorter time periods as well. See if there’s some minimum time.

]]>Also, let’s face reality here: she’s upset because she’s trying to make money off of this, and it turns out there’s no good evidence that it is a real effect.

]]>Thank you, Martha, for all of this (including the Yogi Berra quote). I look forward to learning more about methods of building and checking a model and ascertaining degrees of credibility.

]]>I should also add that things change — so that a model that might in fact be a good one for new data collected “now” might not be a good model for prediction to a month from now (think, e.g., of hurricanes, the economy, cultural changes, …)

Also, I am not defending Cuddy — just trying to give some background on the general problem of prediction, as well as the standard (but confusing) use of the word.

]]>Diana,

I gave the example of training and holdout data sets as just one example of how “prediction” is often used in statistics when there is no check with a future data set. The use is understandably misleading to those who are not familiar with it. When teaching statistics, I have tried to be careful to point out the difference between the technical and everyday use of “prediction,” as well as of other words that have both technical and everyday meanings. I’m afraid some people don’t give much thought to the possibility of confusing the two meanings, and consequently the two meanings easily become confounded in the learners’ minds.

Some background comments:

Prediction to the future is very difficult. (As Yogi Berra famously said: “It’s tough to make predictions, especially about the future.”) So we need to be aware of “degrees of credibility” of methods of trying to do so. Ideally, we would check a proposed model with future data, but that means waiting till the future. In many cases, we don’t have that luxury. (e.g., we can’t use tomorrow’s weather to give a prediction today for the day after tomorrow.) People have developed lots of methods (holdout/training sets is just one.) But there are also trade-offs in using one method rather than another. For example, building a model with a holdout set seems at first blush to be better than building a model not using a hold-out set to check it on. But if there is not much data, using a holdout set leaves too little data in the training set to build a good model. Statisticians have developed lots of methods of trying to build and check a model that might be the best one can do with the data at hand, but they they all have their strengths and weaknesses.

]]>Martha,

Thank you very much for this explanation. This does not seem to apply, though, to Cuddy’s use of the word. I see little likelihood here of anything like a “training” and a “hold-out” set.

Moreover, the training and hold-out sets would have to work in the same temporal direction, wouldn’t they? If you’re trying to predict a future event B on the basis of A, wouldn’t you start with A and then analyze its possible relation to a subsequent B? You couldn’t, say, establish a relation between doctors’ facial expressions and *past* lawsuits and then use that very model to predict *future* lawsuits. There’s too much ambiguity of cause and effect.

I could be wrong about this–but it seems to me that Cuddy is conflating two studies, misinterpreting a study, or both. (To find out, I would have to track down the study in question; so far, it has eluded me.)

If I am right in thinking that she conflated two studies, then I go to the “Surgeons’ Tone of Voice” study and look at Ambady’s wording: “Controlling for content, ratings of higher dominance and lower concern/anxiety in their voice tones significantly identified surgeons with previous claims compared with those who had no claims.” (The association is between tone of voice and *previous* claims.

Later on in the paper, she writes, “Logistic regressions were performed to examine the contribution of voice tone, beyond the content of speech, to predicting malpractice claims history.” The use of the word “history” here still suggests that she is referring to past claims, not future claims. Any prediction here is retrospective.

In the “Half a Minute” study, involving soundless clips of teachers, Ambady writes, “consensual judgments of college teachers’ molar nonverbal behavior based on very brief (under 30s) silent video clips significantly predicted global end-of-semester student evaluations of teachers.” Here the word “predict” is used to refer to the future.

So, in the one study (involving physicians), the word “predict” refers to a relation between the data and past claims the past; in the other (involving teachers), it refers to a relation between the data and future evaluations.

I find it unlikely that there is a study that “predicts” future malpractice suits, even in the way that you describe. It just seems too difficult to pull off, both legally and logistically. How many doctors would agree to this in the first place? “We are going to videotape you and have the videos judged, and then we’ll follow you over the next decade to see whether you get sued. Mind you, you’re just our training set.” I imagine the first lawsuit might come from one of the doctors.

Again, I could be flat-out wrong–but I suspect there is no Ambady study that predicts, on the basis of 30-second video clips, whether physicians will be sued in the future.

]]>Diana,

In statistics, “predict” is sometimes used in a different way from the ordinary usage of “predict something that will occur in the future.” For example, in trying to develop a model that might be useful for prediction of future events, a data set may be divided into a “training” and a “hold-out” set. The training set is used to develop a model, then that model is tested on the hold-out. If the model gives good* predictions on the hold-out data, it has some credibility for prediction in future cases.

* “good” is often a fuzzy or subjective or conditional-on-context term here– e.g., weather predictions may be using the best available techniques, but still be inaccurate fairly often.

]]>@ Steve

I don’t take everything Andrew says or does as gospel, but your criticisms of Andrew in this instance don’t hold water for me — there is too much that sounds like you are claiming you can read his mind and too much that sounds like you are trying to hold him to criteria that (from my perspective) seem idiosyncratic to you.

]]>If it didn’t make a sound, then what brought you here? Smell?

]]>I have not found that study. I wonder whether Cuddy may be conflating two separate studies by Ambady: one of surgeons’ tone of voice, and another of soundless video clips of college teachers. In any case, I am skeptical of her use of the word “predict.” It’s unlikely that the study found that judgments of physicians’ niceness (from soundless video clips) predict *future* lawsuit patterns. Rather, I suspect that the study in question, whichever it was, related the videos to the physicians’ *existing* history of lawsuits. (Why do I suspect this? Because the former study would be extremely difficult to pull off.)

When you misuse the word “predict” in this manner, you fuel the expectation that science will lead to magical findings. Cuddy’s statement has been quoted all over the place.

I blogged about this (briefly) here: https://dianasenechal.wordpress.com/2016/10/06/what-does-predict-mean-in-research/

]]>Steve:

No, I never “reported the one comparison as p = .08.” I never computed any p-values at all! You are perhaps mixing up something you heard in my talk with something you read somewhere else.

]]>Andrew:

Fair enough, but when we are talking about the difference between a Baysian analysis and a frequentist analysis, we are talking about a very different approach to the analyses. Whereas the difference between a t-test and and F-tests we are talking the same approach to the analysis. In the same way that if an author is doing a Baysian analysis and was using a pretty suspect prior and a much more defensible prior was available, one would expect the Baysian critic to challenge the prior, in this analysis when the authors were using a t-test and a more sensitive analysis with an F-test was available I would expect you as a critic to recommend the more sensitive analysis with error based on the larger sample. If you would have taken a Baysian approach to your critique I would have had no problem, but when you took a frequentist approach I would expect you to do the more appropriate frequentist analysis. I still think to not do so was to botch the analysis. Further, to take the frequentist approach and then to argue that you don’t think it matters whether the effect is statistically significant is misleading. You have to know when you reported the one comparison as p = .08 and you took a frequentist approach that many people would read that finding (rightly or wrongly) as saying that Cuddy, Norton, and Fiske did not find evidence for the claim they made, and they needed to update their interpretation because of that change in significance. The truth, however, is that with the more appropriate frequentist analysis may well lead to the same interpretation if you are taking a frequentist/null hypothesis testing approach. So, when Cuddy and Fiske say that the analysis error ( the miscalculation you noted) when corrected didn’t change the interpretation of their results they may well have been just following the typically frequents/null hypothesis testing approach and any beef you have with their not updating their interpretation, which is a beef you certainly do seem to have, isn’t a beef with their failure to do statistics properly within their approach and their vague hypothesis, but really a beef with taking a frequentist approach. I don’t see their hypothesis as vague here at all, what I see is that they may well have been following a frequentist approach and from that approach the correction of their miscalculation does not change the interpretation of their data. Said another way, I don’t think there is good evidence here that they are being vague and overly flexible with their analysis here, instead it appears they may well have just been following a frequentist approach and coming to the proper conclusions that that approach would suggest. Your overall analysis that they aren’t adjusting their interpretation seems like quite a stretch. So, perhaps you can admit not only that the t-test you computed was not a very good way to approach the data, but also that your conclusion that they went weren’t willing to adjust their interpretation based on this not very good reanalysis went too far and was a stretch.

]]>Hi Eva and Anna,

is it possible to get the data (and code) of your studies?

]]>Steve:

Maybe next time I discuss such an example, I will add, “Not that I’m endorsing their analysis; I’m just saying that, conditional on that analysis being done, the calculations were in error.” Certainly no harm in making this clear.

]]>Steve:

OK, now I get it. You are unhappy that, in my blog comment and in my talk, I didn’t ever point out that the t-test that you recomputed wasn’t the most appropriate analysis.

I’m happy to point that out now.

Also, you write, “one should be analyzing the data with an F-test with pooled variance.” I actually don’t think an F test is the most appropriate analysis, either. But if you do an F test in one of your papers, I won’t say you botched the analysis, I’ll just say that you did an analysis that I do not recommend. To get a sense of the sorts of analyses I prefer, you can take a look at my books on multilevel modeling and Bayesian statistics.

]]>Andrew:

The botched analysis in a three group design is comparing the individual means with a t-test instead of and F-test in which the denominator is the pooled variance across the three groups. As I pointed out in my first post above Fischer ages ago wrote on just this situation and argued effectively that the F-test with pooled variance is the most sensitive analysis and does not inflate experiment-wise error. Cuddy, Norton, & Fiske should not have been comparing the individual means with a t-test and they should have computed the t-test properly. In evaluating the analysis Nick Brown and you, in my view, should have realized that the most appropriate way to analyze the data here was not to just recompute the t-test, but rather to recommend the analysis suggested by Fischer ages ago. In my view, it is botching the analysis to miss this pretty basic point. By recomputing the t-test without recognizing it really isn’t the best way to analyze the data you aren’t correcting and in fact are reiterating the botch analysis principle that a t-test is the way to do the comparisons in this situation. So, yes, in my view you botched the analysis here by not recognizing that recomputing the t-test isn’t the right thing to do. Instead, one should be analyzing the data with an F-test with pooled variance. Yeah, you computed the t-test properly, but you missed (or at least didn’t point out, which is as bad in my book) that it was the wrong analysis to do. A proper analysis is not only doing the math right, it is of course, doing the most appropriate analysis and you never pointed out that the t-test that you recomputed wasn’t the most appropriate analysis.

]]>Steve:

OK, now we’re getting somewhere. You write that I “botched the analysis.”

So let’s be clear: I did not “botch the analysis.” I didn’t analyze their data at all. All I did was recalculate a couple of t-statistics, and I just did that to check a recalculation that someone, I think it was Nick Brown, had already done.

If you can tell me what you think I actually botched, that would be a start. You first seemed to say that my error was in thinking that a t-test was “the most appropriate analysis,” but I never thought that (nor did I say it). Then you said that I “should acknowledge that the t-test is not the most appropriate analysis here.” I’m happy to acknowledge this, especially as I never made that claim! Then you said, “whether that comparison is significant or not would require additional analysis,” which is fine, I never claimed this one way or another. Then you criticized what you called “a clear implication” of mine, but again it was something I never actually said or wrote or thought.

So, no, I don’t see any botched analysis of mine. All I see is a calculation which you yourself said was correct. I never performed a reanalysis of these data, nor did I ever claim to.

As I said, this whole think is kinda weird to me. Nick Brown pointed to an error in that paper, an error in which t statistics were reported as 5.03 and 11.34 but were actually something like 1.79 and 3.34. I did some calculations to confirm this (under some assumptions), and then in a talk I pointed this out. No botched analysis.

]]>Andrew:

It is clear that you really really really don’t compare about this comparison and in fact that you care so little about it that you can’t even be bothered to do the most appropriate analysis or even consider it. From where I sit it looks like Cuddy, Norton, and Fiske botched the analysis, but it looks like you botched the analysis too and then made a big deal about them botching the analysis and not responding to your botched analysis. That might make Cuddy, Norton, and Fiske sloppy in their initial analysis, but to me it looks like you can’t even be bother to get it right before you expect people to respond to your analysis. Yes, Cuddy and colleagues shouldn’t have botch the analysis, but neither should have you botched the analysis and it seems a bit hypocritical to chatise other people for botching an analysis that you then botch as well and then further criticize them for not revising their interpretation based on your botched analysis.

Steve:

I really really really don’t care if a particular comparison in the paper by Cuddy et al. is “statistically significant” or not. I’m guessing Cuddy et al. *did* care about this, but I really don’t. As I said in my talk, I see the sloppiness in their data analysis, and the fact that they don’t reassess their conclusions when people point out their errors, as related to the larger point that they have these vague flexible hypotheses and vague flexible data analysis strategies that allow them to claim success from just about any data.

You write, “Your clear implication was that because that contrast wasn’t significant they should change the way the talk about the study. You made a similar charge against Susan Fiske based on the same analysis when your were responding to her editorial in Perspectives in Psychological Science.” I’m sorry but I never said that. I don’t think there’s anything special about statistical significance.

You’ll just have to go with what I say and what I write, not on “clear implications” that are coming from you, not me.

]]>Hi Andrew,

I guess I am reacting to your talk at OSU in which you were criticizing Cuddy and colleagues for not responding to evidence and particularly this particular piece of evidence that the comparison for the t-test was not significant and yet they still talk about their results as if the new analyses don’t change their interpretations. Your clear implication was that because that contrast wasn’t significant they should change the way the talk about the study. You made a similar charge against Susan Fiske based on the same analysis when your were responding to her editorial in Perspectives in Psychological Science. What I am trying to point out is that contrary to your claims both at the OSU talk and in your response to Fiske we do not know without further analyses if the conclusions they made in the paper should change or not. You suggested they should change their conclusions. I think this suggestions is way premature until the more appropriate analyses are done. It might well work out that they are right that when the analyses are done properly their is no reason to change their interpretations.

]]>Steve:

I really don’t know what you’re talking about. You write, “your statement that has been repeated many times that the one comparison is not significant,” but I never said anything about a comparison being significant. What I said was that they reported t statistics of 5.03 and 11.34, but the correct calculations give 1.79 and 3.34. I never said that I recommended this analysis, I just reported the numbers. You say I “quit making that charge and correct the record,” but there’s nothing for me to correct. This whole exchange is just weird. If you have a problem with miscalculations of t statistics, you should take it up with Cuddy, Norton, and Fiske—they’re the ones who lost control of their own data!

]]>Yes, but Andrew you should acknowledge that the t-test is not the most appropriate analysis here, No? And further your statement that has been repeated many times that the one comparison is not significant is potentially misleading. It may well be significant with the most appropriate analysis. At a minimum I think you should quit making that charge and correct the record by stating that whether that comparison is significant or not would require additional analysis.

]]>Steve:

Just to clarify, I would not say that I have every performed a reanalysis of Cuddy’s work. All I did was recalculate a t-statistic, and that was just to check what Nick had sent me. Given all the researcher degrees of freedom in Cuddy’s paper, I think any reanalysis would have to start with the rawest of raw data and then consider all possible comparisons of interest. It would be a lot of work and, I think, not worth the effort.

As has been said many times, there is a tradeoff between effort in design and effort in analysis.

]]>Andrew, you need to correct an error in your reanalysis of Amy Cuddy’s work. I first noticed the error in the talk you gave at Ohio State a couple of weeks ago. In that talk you noted how Cuddy and her colleague (Micheal Norton and Susan Fiske) had reported a miscalculated t-test. It is true they miscalculated the t-test (and their numbers are way off), but even though you correctly calculated the t-tests they report, those t-tests are not the most appropriate analysis. In this situation with a three group Analysis of Variance, which is the analysis they conducted, one should first demonstrate that the F-test for the overall analysis is significant, which they did do properly. To compare the means, however, one should not simply do a t-test between the means. As Fischer demonstrated about 80 years ago the best test (unless there is some reason to suspect the variance differs in the three groups and I can’t see any reason to suspect that in this case) is to compare the means with a follow up F-test that uses the within subjects error term for all three groups in the denominator. As Fischer demonstrated this procedure does not inflate experiment-wise error above alpha. Those F-tests and not the t-tests that Cuddy and her colleagues tested and which you recalculated are the best test of the hypothesis and the t-test that you noted that has a value of t=1.79 and p = .08, might well be significant with the proper testing procedure that Fischer developed so long ago. As you no doubt no the pooled variance because it based on including another third of the participants included in the estimate of the error term will on average across tests be more sensitive and this is part of the reason that Fischer recommended this procedure in this type of situation.

]]>More pessimistically, if Cuddy is a psychological scientist and her “work” is taken as psychological science, then the domain is showing very significant fraying at its seams.

the groundwork is (whether psychology realizes it or not) being prepared for a rebirth of the ascendancy of behaviorism. Cuddy and her (many) ilk are spending great amounts of time and effort fertilizing the soil with their dung.

]]>I disagree. I think that we most definitely need to call out bad scientists when they are doing bad science. These people often get tenured at the very top universities in the world on the merit of these bad studies. These same people may also create entire cottage industries around their initial results that we now know are bunk. So, yes, let’s take on the person as much as their science. There is a connection.

]]>>”Per Andrew’s comment, is it not true that our modern lives are vastly enhanced by an endless number of tests for statistical significance? The CO detector went off. The blood test was positive. Factory QA signaled the batch was out of tolerance. Etc etc. This is not pseudo-science.”

Please supply a reference you trust describing how one/some/all of these supposed use-cases is actually implemented. I am certain that upon investigation it will turn out to be some method other than NHST (as I defined it here multiple times) or there is no actual evidence of success. Also, note the pitfall of using the “evidence” produced by the NHST procedure to prove the usefulness of NHST.

]]>Cuddy and power poses also got mentioned yesterday in the student rag at my university: http://www.dailytexanonline.com/2016/02/04/challenging-cat-callers-preserves-female-agency

(Has “power pose” become a meme?)

]]>Anon> ***The extent of the appropriate conclusion is limited to a statement about the statistical significance***

>the outcome of a default-nil-null hypothesis significance test tells you nothing about the external reality or any theory of interest that may make useful predictions

> I don’t understand why/how you want to to interpret them as success cases for NHST.

Andrew> I think there are applications for which the concept of false positive and false negative, sensitivity and specificity, ROC curves, etc., are relevant. There are actual classification problems. I just don’t think scientific theories are typically best modeled in that way.

Maybe this is really the crux of our difference, Anon.? You seem to be very theory-oriented, and you seem to be an insightful and creative thinker. Your discussion of male births in India was very interesting. I can understand your disdain at the lack of explanatory power in NHST.

My own statistical goals are very modest. I appreciate having NHST at my disposal, to tell me that something seems to be commonplace, on whatever metric. Alternatively, testing an appropriate statistical hypothesis may signal that something seems extremely unusual, relative to appropriate expectations. In that case, I understand that it’s up to me to do my own thinking from there, with “possibly why?” questions and orthogonal deductive reasoning.

Per Andrew’s comment, is it not true that our modern lives are vastly enhanced by an endless number of tests for statistical significance? The CO detector went off. The blood test was positive. Factory QA signaled the batch was out of tolerance. Etc etc. This is not pseudo-science.

You’ve definitely made your point, though, which I appreciate and acknowledge. Thank you again for the very helpful discussion :) I’ll close with the following quote:

“Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis.”

https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Criticism

Anoneuoid:

Fisher in the 1930’s with agricultural uniformity trials.

Now, Peirce in 188? argued that randomization clearly justified that a random sample of a population justified inference.

More recently, Efron for gene expression studies.

Epidemiological studies are a different kettle of fish.

]]>Keith, are you aware of anyone that has measured the distribution of p-values that result from samples taken from supposedly the same population under very controlled settings with randomization? I think that could provide a lower bound on the types of deviation from uniform we would expect.

]]>Thanks for posting the example. Let’s start with this: “However, their H seems like a pretty conservative and “statistical” (to use your term) type of hypothesis, similar to what I proposed, which you said isn’t seen in practice.”

1) First, I went and got the paper[1] and found the authors did not perform this analysis, so this is not an example of something found in practice. Second, I did not claim anything wasn’t “seen in practice”, rather that “scenario does not correspond to any actual use case I can think of”. Indeed, the textbook acknowledges there is no actual use for this test: “That’s clearly significant, but don’t jump to other conclusions.”

Exactly! ***The extent of the appropriate conclusion is limited to a statement about the statistical significance*** When properly interpreted, the outcome of a default-nill-null hypothesis significance test tells you nothing about the external reality or any theory of interest that may make useful predictions, the only “use” is to proceed to commit any of a number of logical fallacies.

2) I also found that some of the information in this question appears to be incorrect. Specifically, the number of 550 births in 1993 appears made up and the baseline was from the same hospital ~10 years earlier, not “for the region”. So in terms of real life applicability, this analysis is also an example of GIGO.

3) Ignoring the above, they still do commit the affirming the consequent error we have been discussing. The problem is that reported/registered live births does not necessarily correspond to actual live births. So the actual sex ratio of births at this hospital could be exactly the same as the baseline, but male births are reported more often for some reason.

As described in the question, there is a preference for male births in India. Do the hospitals have any incentive to report a greater percentage of male births? Perhaps more patients will choose to go there out of superstition because they hear male births are more common at that hospital, so the high sex ratio amounts to advertising. Or perhaps there is a larger number of male vs female still-births, and still-births reflect badly on the hospital. Then these still-births may get inaccurately recorded as live births. Maybe there are more unwanted female babies that get abandoned at the hospital which would lead to undesirable paperwork for the staff, so instead they just drop them off at the orphanage or find a foster home off the books. Etc, etc. The number of plausible reasons for the mere presence of a deviation from the null hypothesis is essentially limitless.

4) With regards to: “In words, the researchers are confident something very significant is going on, but don’t know why. Autism and bee colony collapse disorder are two headline news examples which may have this type of research behind them.” The news coverage is based on the magnitude of the change, not mere statistical significance. Also, these are two areas where nothing has been figured despite many years of research. Instead, wild speculation, fraud, and conspiracy theories have proliferated. So I don’t understand why/how you want to to interpret them as success cases for NHST.

]]>> some high-profile epidemiological research

If it is an epidemiological study there is confounding and bias (or at least these can’t be ruled out) and this implies the distribution of p_values when the null is true is undefined and that makes this claim non-sense “observed … as different as … would occur at random only about x times in 1000”.

Surprising how many, even those doing teaching and research appear to be unaware of this, see http://www.stat.columbia.edu/~gelman/research/published/GelmanORourkeBiostatistics.pdf

Now, the book you are quoting does seem to be at least (implicitly) pointing that problem out.

]]>Thank you A.! I sincerely appreciate you breaking this down for me :) Thanks also for clarifying your previous posts and pointing out that I’m thinking in hybrid NHST terms.

I didn’t want to get further into the textbook before getting all this conceptual groundwork sorted out. Thank you for your patience with me on that. Here is a worked example from Bock et al, 4e (big thanks to ABBYY OCR! ;). It appears shortly after the above-quoted statements about the null hypothesis:

(begin quote)

Step-by-step example: Testing a hypothesis

Advances in medical care such as prenatal ultrasound examination now make it possible to determine a child’s sex early in a pregnancy. There is a fear that in some cultures some parents may use this technology to select the sex of their children. A study from Punjab, India (E. E. Booth, M. Verma, and R. S. Beri, “Fetal Sex Determination in Infants in Punjab, India: Correlations and Implications,” BMJ309 [12 November 1994]: 1259-1261), reports that, in 1993, in one hospital, 56.9% of the 550 live births that year were boys. It’s a medical fact that male babies are slightly more common than female babies. The study’s authors report a baseline for this region of 51.7% male live births.

Question: Is there evidence that the proportion of male births is different for this hospital?

Hypotheses:

The null hypothesis makes the claim of no difference from the baseline.

The parameter of interest, p, is the proportion of male births:

Hsub0 : p = 0.517

HsubA : p 0.517 The alternative hypothesis is two-sided.

Model : Think about the assumptions and check the appropriate conditions. (content skipped for brevity)

Mechanics : (content skipped for brevity)

Conclusion : The P-value of 0.0146 says that if the true proportion of male babies were still at 51.7%, then an observed proportion as different as 56.9% male babies would occur at random only about 15 times in 1000. With a P-value this small, I reject Hsub0. This is strong evidence that the proportion of boys is not equal to the baseline for the region. It appears that the proportion of boys may be larger.

State your conclusion in context : That’s clearly significant, but don’t jump to other conclusions. We can’t be sure how this deviation came about. For instance, we don’t know whether this hospital is typical, or whether the time period studied was selected at random. And we certainly can’t conclude that ultrasound played any role.

(end quote)

Applying your H/P/Q logic, we would seem to have:

H : The proportion of male live births may be significantly atypical at one particular hospital

P : H is true

Q : The appropriate significance-test P-value will be very low, indicating that the proportion for this hospital is unlikely to reflect the population value for all live births in the region.

So the authors do infer from Q “strong evidence” that H is true. However, their H seems like a pretty conservative and “statistical” (to use your term) type of hypothesis, similar to what I proposed, which you said isn’t seen in practice. They also explicitly caution the student against inferring a broader “causal” H from Q.

My vague memory is that at least some high-profile epidemiological research has concluded in similar fashion : “What causal H could be responsible for Q? Here are some ideas : H1,H2,H3,etc”. In words, the researchers are confident something very significant is going on, but don’t know why. Autism and bee colony collapse disorder are two headline news examples which may have this type of research behind them.

]]>I actually didn’t listen to more than a few minutes because every time she said “I’m a scientist, so…” I felt distressed.

]]>+1

]]>Thanks for reminding me of this quote of Gerd’s

“Who is to blame for the null ritual?

Always someone else. A smart graduate student told me that he did not want problems with

his thesis advisor. When he finally got his Ph.D. and a post-doc, his concern was to get

a real job. Soon he was an assistant professor at a respected university, but he still felt he

could not afford statistical thinking because he needed to publish quickly to get tenure. The

editors required the ritual, he apologized, but after tenure, everything would be different

and he would be a free man. Years later, he found himself tenured, but still in the same

environment. And he had been asked to teach a statistics course, featuring the null ritual.

He did. As long as the editors of the major journals punish statistical thinking, he concluded,

nothing will change.”

>”With respect, it sounds like you may be moving the goalposts here.”

Sorry for the miscommunication. I originally asked for ‘an example of some form of “significance/hypothesis test”’, I later again asked: “Can you give some examples of these[example problems and what conclusions are drawn or asked of the student]”? These are meant to ask for the same thing, ie an actual example of the testing procedure being applied. Since it is not (easily) possible for me to check this book myself, I asked for additional info so that my initial request could be met. There is no moving the goalposts.

>”Bock et al’s logic directly follows Fisher’s…If a precise null hypothesis is found extremely unlikely to be true, then an appropriately defined alternative hypothesis may be accepted as significantly more likely to be true, subject to the statistical power of the particular significance test used.”

1) The first issue with what you have written is only tangential to my point and I do not wish to focus further on it (since it has been a major a distraction from the main problem), but the alternative hypothesis was not a part of Fisher’s logic. You appear to be working with a hybrid of Fisher and Neyman-Pearson. You can search “NHST hybrid” for much discussion on this, but for an introduction to this phenomenon see: Mindless statistics. Gerd Gigerenzer. The Journal of Socio-Economics 33 (2004) 587–606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf

Also, here is Fisher on the alternative hypothesis and type II error:

“It was only when the relation between a test of significance and its corresponding null hypothesis was confused with an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is rejected wrongly from errors in which it is “accepted wrongly” as the phrase does. The frequency of the first class, relative to the frequency with which the hypothesis is true, is calculable, and therefore controllable simply from the specification of the null hypothesis. The frequency of the second kind must depend not only on the frequency with which rival hypotheses are in fact true, but also greatly on how closely they resemble the null hypothesis. Such errors are therefore incalculable both in frequency and in magnitude merely from the specification of the null hypothesis, and would never have come into consideration in the theory only of tests of significance, had the logic of such tests not been confused with that of acceptance procedures.”

Ronald Fisher. Journal of the Royal Statistical Society. Series B (Methodological). Vol. 17, No. 1 (1955), pp. 69-78. http://www.phil.vt.edu/dmayo/PhilStatistics/Triad/Fisher%201955.pdf

2) I linked above to the Meehl (1990) paper where this is explained in (very great) detail, I also gave a real life example of it in action with the NEJM paper. The problem arises from the many-to-one mapping of research hypotheses to a vague statistical alternative hypothesis, that is why I said this issue is “not actually a problem in the realm of statistics”.

Your example does not include any research hypothesis, only two statistical hypotheses. This scenario does not correspond to any actual use case I can think of. In practice it usually goes like this:

-A)”If P [the research hypothesis H is true] then Q [the parameter (difference between means) will be greater than zero].

-B) Q [the null hypothesis that the parameter is equal to zero is unlikely and also the parameter was measured to to be positive therefore: the parameter is likely to be greater than zero].

-C) Therefore, P [the research hypothesis H is true].

We can even forget that we have uncertainty about the value of the parameter (ie Omniscient Jones told us the value) and change step B to “Q [the parameter is greater than zero].” As we see, considering this limiting case where our uncertainty approaches zero does not fix the problem.

Note that if you observe (where, ~ = “not”) ~Q [the parameter is not greater than zero], it is valid to deduce ~P [the research hypothesis H is not true] in this simplified description. In reality though, P is never so simple. The theory is never tested alone, there are also auxiliary considerations A (eg no malfunctioning equipment, etc). So in practice ~P = (~H and/or ~A). In other words, the data or some other assumption can be wrong instead of, or in addition to, the research hypothesis.

]]>>But what I would look at is how the testing procedure is actually applied in the example problems and what conclusions are drawn or asked of the student. Can you give some examples of these?

With respect, it sounds like you may be moving the goalposts here. I provided a specific example, which you doubted I could provide. All three of the my text’s authors are award-winning educators. It seems a bit presumptive to suggest (particularly from safe anonymity) that they actually don’t understand what they’re teaching, and are in fact misteaching statistics.

>Here we have an explicit encouragement to reject the null hypothesis and accept “your claim”…

I’m not sure what your criticism is? The “claim” is a well-defined algebraic statement involving a parameter value, as you insisted earlier.

>..ie the student is taught to make the basic logical error of affirming the consequent.

There is a sidebar on the page I referenced above, right next to the paragraph I quoted. In large font is printed Fisher’s classic quote, ending with:

“Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

As far as I can tell, Bock et al’s logic directly follows Fisher’s. Now, are you claiming that this type of classic NHST inference is logically deficient? I’m having trouble finding authoritative discussion on the web in support of your thesis. Let’s please work out exactly what you mean. What do you think of the next paragraph, which I have formulated on my own, following the reference below. I believe it summarizes NHST inference, in the form of “If P, then Q”

If a precise null hypothesis is found extremely unlikely to be true, then an appropriately defined alternative hypothesis may be accepted as significantly more likely to be true, subject to the statistical power of the particular significance test used.

Per reference below, “Affirming the consequent” follows the form:

If P, then Q.

Q.

Therefore, P.

Would you please demonstrate via simple example how an NHST deduction, following the lines of my conditional above, might fall prey to the fallacy?

Reference: https://en.wikipedia.org/wiki/Affirming_the_consequent

]]>I realized something really disturbing here. This stats teacher has learned that high school students find it “tempting” to test their actual research hypothesis. This is consistent with my personal experience (although after HS), I still remember being in their shoes and also finding it “tempting” to test my actual hypothesis rather than some other default hypothesis.

As early as high school, students are thinking scientifically but are being specifically told to stop doing so during class. This must be happening all across western civilization at earlier and earlier ages. I really didn’t believe that my estimation of degree of damage being caused by NHST could get any worse, but I had never considered that the age at which people start learning this stuff is also becoming younger.

]]>Not sure if these are official, but check out ppt 21 slide 5:

-There is a temptation to state your claim as the null hypothesis.

–However, you cannot prove a null hypothesis true.

-So, it makes more sense to use what you want to show as the alternative.

–This way, when you reject the null, you are left with what you want to show.

https://mhsapstats.wikispaces.com/BVD+Powerpoints+and+Chapter+Notes

Here we have an explicit encouragement to reject the null hypothesis and accept “your claim”, ie the student is taught to make the basic logical error of affirming the consequent.

]]>I was unable to get access to the book. But what I would look at is how the testing procedure is actually applied in the example problems and what conclusions are drawn or asked of the student. Can you give some examples of these?

]]>>..can be dealt with by instead having the statistical null hypothesis correspond to a precise prediction of the research hypothesis

Bock et al, “Stats : Modeling the World” 4e (2015) is an excellent AP-level HS text. From Chapter 19, “Testing Hypotheses About Proportions”, p. 497:

“In statistical hypothesis testing, hypotheses are almost always about model parameters.. The null hypothesis specifies a particular parameter value to use in [the] model.. We write (Hsub0) : parameter = hypothesized value. (HsubA) contains the values of the parameter we consider plausible when we reject the null.”

]]>>”Hmm, I thought we had agreed that there is actually a problem with non-statisticians misapplying statistics..? I predict this will continue to be an issue for some time.. ;)”

The typical end user of statistics has no conception of a distribution and uses SEM error bars rather than SD because they are narrower. They will have to take the time to educate themselves, there is nothing we can do besides ignore their analyses.

Still, I doubt you can find a recent statistics textbook that gives an example of some form of “significance/hypothesis test” and not make the error I described above. I have seen that Student makes it, Neyman makes it, but interestingly I have never seen Fisher fall prey to it.

]]>Thank you for the interesting link.

>..So I don’t think your idea will work since this is not actually a problem in the realm of statistics.

Hmm, I thought we had agreed that there is actually a problem with non-statisticians misapplying statistics..? I predict this will continue to be an issue for some time.. ;)

]]>That wasn’t an exact replication. They mention in the link you posted:

The poses and procedure for collecting the saliva samples were identical to the original study. However, a facial emotion task was included in this study. Although the task did not change the amount of time between the pre- and post-saliva tests, it did increase the amount of time a pose was maintained from 2-3 minutes to 10-12 minutes.

]]>I blame Andrew for always drilling “God is in every leaf of every tree” into us helpless readers.

]]>>”We might imagine a better system, modeled on the construction industry, in which research papers would require sign-off by a licensed statistician.”

The problem with the usual approach is there is a many to one mapping of research hypotheses to the statistical “alternative hypothesis”, so rejection of the null hypothesis can only lead to affirming the consequent errors.

This can be dealt with by instead having the statistical null hypothesis correspond to a precise prediction of the research hypothesis (it is obvious to anyone not trained in NHST that it should always have been this way). Then, when the research hypothesis/theory has survived a strong test (due to the precision and accuracy of the prediction), we tend to believe it has something going for it. So I don’t think your idea will work since this is not actually a problem in the realm of statistics. See eg fig 2 here:

Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It. Paul E. Meehl. Psychological Inquiry. 1990, Vol. 1, No. 2, 108-141. http://www.psych.umn.edu/people/meehlp/WebNEW/PUBLICATIONS/147AppraisingAmending.pdf

]]>At any rate…

1. Yes, that is Amy Cuddy’s husband, and

2. He (wisely) deleted the thread.

]]>