Our story begins with this article by Sanjay Kaul and George Diamond:

The randomized controlled clinical trial is the gold standard scientific method for the evaluation of diagnostic and treatment interventions. Such trials are cited frequently as the authoritative foundation for evidence-based management policies. Nevertheless, they have a number of limitations that challenge the interpretation of the results. The strength of evidence is often judged by conventional tests that rely heavily on statistical significance. Less attention has been paid to the clinical significance or the practical importance of the treatment effects. One should be cautious that extremely large studies might be more likely to find a formally statistically significant difference for a trivial effect that is not really meaningfully different from the null. Trials often employ composite end points that, although they enable assessment of nonfatal events and improve trial efficiency and statistical precision, entail a number of shortcomings that can potentially undermine the scientific validity of the conclusions drawn from these trials. Finally, clinical trials often employ extensive subgroup analysis. However, lack of attention to proper methods can lead to chance findings that might misinform research and result in suboptimal practice. Accordingly, this review highlights these limitations using numerous examples of published clinical trials and describes ways to overcome these limitations, thereby improving the interpretability of research findings.

This reasonable article reminds me of a number of things that come up repeatedly on this blog and in my work, including the distinction between statistical and practical significance, the importance of interactions, and how much I hate acronyms.

They also recommend composite end points (see page 418 of the above-linked article), which is a point that Jennifer and I emphasize in chapter 4 of our book and which comes up *all the time, over and over* in my applied research and consulting. If I had to come up with one statistical tip that would be most useful to you–that is, good advice that’s easy to apply and which you might not already know–it would be to use transformations. Log, square-root, etc.–yes, all that, but more! I’m talking about transforming a continuous variable into several discrete variables (to model nonlinear patterns such as voting by age) and combining several discrete variables to make something continuous (those “total scores” that we all love). And *not* doing dumb transformations such as the use of a threshold to break up a perfectly useful continuous variable into something binary. I don’t care if the threshold is “clinically relevant” or whatever–just don’t do it. If you gotta discretize, for Christ’s sake break the variable into 3 categories.

This all seems quite obvious but people don’t know about it. What gives? I have a theory, which goes like this. People are trained to run regressions “out of the box,” not touching their data at all. Why? For two reasons:

1. Touching your data before analysis seems like cheating. If you do your analysis blind (perhaps not even changing your variable names or converting them from ALL CAPS), then you can’t cheat.

2. In classical (non-Bayesian) statistics, linear transformations on the predictors have no effect on inferences for linear regression or generalized linear models. When you’re learning applied statistics from a classical perspective, transformations tend to get downplayed, and they are considered as little more than tricks to approximate a normal error term (and the error term, as we discuss in our book, is generally the least important part of a model).

Once you take a Bayesian approach, however, and think of your coefficients as not being mathematical abstractions but actually having some meaning, you move naturally into model building and transformations.

P.S. On page 426, Kaul and Diamond recommend that, in subgroup analysis, researchers “perform adjustments for multiple comparisons.” I’m ok with that, as long as they include multilevel modeling as such an adjustment. (See here for our discussion of that point.)

P.P.S. Also don’t forget economist James Heckman’s argument, from a completely different direction, as to why randomized experiments should not be considered gold standard. I don’t know if I agree with Heckman’s sentiments (my full thoughts are here), but they’re definitely worth thinking about.

I think that a lot of people work with data as they get them because once you start interacting variables, it is like opening a Pandora's Box, and even if you restricted yourself to sensible prescriptions, you'd be a while before you ran out of things to check. Not to say that this should stop you from doing it…

Also, an aside on transformations in regressions, particularly GLMs – there are some nice results on consistency (up to scalar multiples) of regression coefficients under link violation (Li and Duan, Annals of Statistics, 1989) when the predictors are elliptically symmetrical and transformations to achieve elliptical symmetry of the predictor space (Cook and Nachtsheim, JASA, 1994). But then you are only talking about linear transforms I guess.

I don't have anything to add to your post, but after skimming the paper linked in the first PS (I didn't exactly understand everything, but I got the gist of it) in relation to multi-level modeling I have one question I was hoping you could explain (I'm only a sophomore so bare with my ignorance). I was helping a friend play around with some natural experiment data. We tried multi-level modeling and we thought we could pick up causality by abusing the fact that we can talk about the prediction effects outside the group level. Yet, when we tried that we got bizarre conclusions that didn't really make sense in context. I was wondering if you could explain why that may not work (or maybe it was just the data set for some reason, I don't really know much about the set).

Yeah, my question had pretty much nothing to do with your post, but I was curious since you brought up multi-level modeling.

Hello Prof. Gelman,

Are you saying "model building" will naturally lead to applying fruitful transformations that will lead to statistics that do more than only prove "a formally statistically significant difference for a trivial effect"?

(By "model building" you mean the scientist taking responsibility for an abstraction that goes beyond statistics, i.e. causality and value judgments about what is more than a trivial effect.)

I am having trouble translating your description into something I can understand, so I would appreciate your help if I made a hash of things with my little summary.

Nice point – "think of your coefficients [parameters] as not being mathematical abstractions but actually having some meaning."

Noticed when sitting in on some of David Dunson's lectures the he distinctively does.

Not so sure its just a function of taking a Bayesian approach though.

K

p.s. I'll have to remember to read your 3 groups paper, spent a couple afternoons when I was a grad student hopelessly trying to prove that for non-linear but monotonic effects – throwing away the middle group was optimal in some sense. Definitely my pre-Bayes days though.

Gee, and I thought we were going to discuss literature…

But if we're trading tips here are couple that have served me well:

1. If you're using any kind demographic data, include population density. It's reliable and easy to score on a zip+4. I usually split it into urban (above median income), urban (below), suburban and rural.

2. If you have sufficient data, run a Chaid or a CART (w. bagging) on the raw data before building your traditional model. This is an excellent variable selection tool to get you started and if a variable or set of variables keep showing up in your perturbed trees but keep dropping out of your model, you know you need to try transformations and/or interaction terms on them.

Gary Simon, Mitzi's professor for an applied regression class at NYU's Stern Busienss School, is preaching the same gospel.

Mitzi's loving the class because it's organized around practical model building and model checking. The homework is to produce written reports on data sets with plots and analysis, as if written for a consulting client.

You'd love his homework commentary. For each assignment, in addition to providing detailed comments on the homework (must take forever), he has a page of checkboxes for common mistakes, one of which is failure to transform data appropriately.

The checkboxes are informative because they're telling you where analyses tend to go wrong.

Re statistical vs practical significance: Some years ago Al Blumstein suggested that we call the former

statistical discernibility, which is more on target than the ambiguous term "significance."Just so I'm clear: composite end points are making your outcome variable a function of several variables, aren't they? So if you're studying plant infestation rather than studying the number of aphids, you'd study aphids + whitefly. Or some other defined variable.

I've got to say, my instinct is that this makes absolutely no sense. What's the advantage with setting up this equation:

aphids + whitefly = a + b1 leaf_greeness + b2 sunlight + error

rather than working with these:

aphids = a + b1 leaf_greeness + b2 sunlight + error

whitefly = a + b1 leaf_greeness + b2 sunlight + error

and then doing the maths to combine the formulas and considering how the error terms correlate? It just strikes me that – even if a composite outcomes a good idea in itself – the best way of gaining understanding it to fit individual models and combine them and try and figure out how they interact and what drives the combined error term, rather than assuming they have the same structure from the start. That's is unless you've some very strong theory backing up your model.

"And not doing dumb transformations such as the use of a threshold to break up a perfectly useful continuous variable into something binary. I don't care if the threshold is "clinically relevant" or whatever–just don't do it."I've never really been able to get my head around this. I understand in a theoretical sense it's very bad as you lose information if instead of x you use z = 1 if x > threshold, and z = 0 otherwise; because you can go from x to z but not backwards. But practically the way you'd set it up in a regression is:

(1a) y = a + b1 x + e; for a slope.

(2) y = a + b1 x + b2 z + e; if you want to expand to include a step.

(3) y = a + b1 x + b2 z + b3 xz + e; carrying on expanding to include varying slopes, and so on.

Now, put like that I can't see why you'd prefer (1a) over:

(1b) y = a + b2 z + e

They're both subsets of (2) after all. And looked at that way leaving one out is as bad as leaving out the other. I can't see why x is more important than z. They're both potential misspecifications, and there must even be some sort of omitted variable bias result you can derive to tell you which is the worst in what circumstances. I know theoretically it's throwing away information and sinful, but practically I don't see it.

They also recommend composite end points (see page 418 of the above-linked article), which is a point that Jennifer and I emphasizeSorry for not having read your book but could you please elaborate on your composites recommendation? Vast majority of the biomedical literature I am familiar with uses composites of non-independent parameters. In most cases, that's the only way they can get the "statistical significance". But isn't this practice, basically, a form of cheating? Like in the paper you link to: death and stroke are sure as hell correlated, so isn't it the case that adding them up has an effect of effectively inflating N?

alex: I think composite endpoint refers to a stopping criteria in a randomized trial. Trials have to last a finite, not too long, amount of time. Sometimes the amount of time is picked in advance (like 5 year survival rates). An alternative is to track people until something happens to them…

The idea with composite endpoints is to make the event that triggers "end of study" of that subject be something composite, such as "either the person dies, or they have two serious events leading to a hospitalization, or they have 4 less serious events needing outpatient doctor care, or 5 years passes"

Andrew, please correct me if I'm wrong. I think many people are not familiar with the term "composite endpoint" myself included.

Ziggie: The short answer is that all sorts of transformations matter more once you have informative prior distributions.

Ted: I don't know, but there are certainly lots of ways to get wrong answers, with our without multilevel modeling.

Manuelg: You can look at some of the examples in my book with Jennifer to get a sense of what I mean by model building. It's not really so much about value judgments so much as about trying to model the underlying process.

Keith: See my 1996 paper with Bois and Jiang for a discussion of the connection between Bayesian inference and scientifically-based modeling.

Mark: I'm glad to know that someone reads my literature posts. Don't worry, there will be more.

Bob: That sounds good to me. When I've tried to structure my courses this way, the trouble is that it seems like the students never learn the basic techniques. I'd love to have a way of doing it all. And it's really too bad the university took away their statistics department. Really silly, considering that running a statistics department in NYC is pretty much a license to print money.

Mike: That would be fine with me but I think the coordination problem is too difficult to overcome.

Alex (1): You might be right. I don't know anything about aphids. In many examples I've worked on, it can be helpful to start with the combined score, see what things look like, then break up the analysis from there.

Alex (2): In the problems I've studied, the linear model seems much more close to reasonable than the step model. It's not about whether something is "sinful," it's about how best to use the information available. My impression is that people often take continuous variables and make them binary for no good reason at all. As I said, if you're going to discretize, I'd at least recommend 3 categories instead of 2. Of course if you really expect a step, that's another story, but usually it's not that, usually it's just a numerical measure that's cut off at some value.

DK: Our book is a mere $40. Truly good value, I'd say.

Dan: I'm not sure what the original article was referring to, but I was talking more generally about combining or breaking up input variables rather than simply performing an analysis of the variables straight out of the box.

Andrew, my experience with continuous -> binary variable conversion is similar to yours. When I was working in forensic engineering it was extremely popular to make up data collection forms in which check-boxes would be used to determine if an observed variable had met a code requirement or not. For example, number of nails per linear foot of perimeter along a wooden shear wall.

The difference between having 11 nails in 3 feet and 5 nails in 3 feet is extremely significant to the performance of a shear wall, but if 12 was the required number then a building with 11 would classify the same as a building with 5 (i'm making up these numbers on the spot for example purposes only).

I generally tried to encourage forms where this sort of thing was recorded in raw form. Better to analyze the number of nails per foot than the probability of meeting or exceeding a code requirement.

In that field plaintiff investigators were in the business of finding where things failed to meet some technical requirement, and defense investigators were in the business of backing-out the actual estimate of the effect on strength, reliability, etc. I did a calculation on the use of "box" vs "common" nails in shear walls (box nails are thinner shank and go in easier but are not as strong as "common").

I think I found that the best prediction of strength reduction from consistent use of the wrong nails was around 15% reduced strength. A similar result could occur with common nails if fewer of them were used than design requirements or if some of them missed the studs. There might be reason to believe that the bigger nails would "miss" more often (split the wood etc).

However, if a crew used box nails throughout, it was a clear 100% violation of the binary variable, very favorable for the plaintiff.

I would say that it's relatively rare for systems of interest to display fast transitions, and binary type behavior. Keeping things continuous and applying a data transformation if necessary would be preferable almost always.