Skip to content

What to do when your measured outcome doesn’t quite line up with what you’re interested in?

Matthew Poes writes:

I’m writing a research memo discussing the importance of precisely aligning the outcome measures to the intervention activities. I’m making the point that an evaluation of the outcomes for a given intervention may net null results for many reasons, one of which could simply be that you are looking in the wrong place. That the outcome you chose isn’t what the intervention actually influences. I’m thinking about this in a nuanced way (so not like you set sail for Hawaii and hit Australia, but more like you ended up off the coast of Nihue). An intervention may seek to improve parent child interaction, which is a broad domain with many potential sub-domains, but in looking at the specific activities, it may be that it only seeks to improve a particular type of parent child interaction. General measures may not pick this up, or they may, but may not be particularly sensitive to change in that kind of parent child interaction. So while there are many potential failed assumptions that lead to a null finding, certainly one could be a misaligned outcome measure.

My question is, is this a statistical concept, is there a statistical property to the specificity or alignment of the outcome to the intervention activities, or is this purely a conceptual issue with how experiments work. Related to that question, is it fair to say that poorly aligned outcome measures may also attenuate effect sizes even when positive effects are found? An example might be that your measure gets at 4 domains of parent child interaction: affection, responsiveness, encouragement, and teaching. Our interest may be to impact how parents interact with their child to improve a child’s social emotional development. Maybe we know that affection and encouragement are the major predictors of S.E. from this tool, but it’s not a direct measure. So this measure isn’t a perfect match to what we are trying to impact, with past research primarily showing that maternal sensitivity is what matters (but we aren’t measuring that in this fictitious study, its indirectly coming from related concepts). My thought is that in such a scenario, while we have superficially measured our outcome of interest (parent child interaction) we failed to drill down to what actually is taking place (and matters), maternal sensitivity. Instead we have a few items from a more general scale that may get at maternal sensitivity. As such, we may have attenuated effects because we aren’t measuring all the things that are improving from the intervention, as well as because only having a small number of items may not make for a very reliable or valid scale for our actual outcome of interest.

I don’t have a detailed response here, but I wanted to post this because it relates to our recent post on estimating interactions. Some people might have taken from that post the message that it’s too hard to estimate interactions so we shouldn’t bother. That’s not what I think, though. What I think is that we do need to estimate interactions in observational settings, because unmodeled interactions can pollute the estimates of effects that we care about. See this paper from 2003 for an example.

What, then, to do? I think we should be including lots of interactions, estimating them using strong priors that pool toward zero: this will give more reasonable estimates and will allow the included interactions to improve our estimates of main effects without adding noise.

The point of the recent post on estimated interactions having large uncertainties is that, when uncertainties are large and underlying parameters are small, that’s when Bayesian inference with informative priors can really help.

I asked Beth Tipton for her thoughts, and she wrote:

First, this is a “construct validity” issue, in the language of Shadish, Cook, and Campbell (2002). One method proposed to address this is the multi-trait mulimethod matrix.

Second, this strikes me more generally as a measurement issue and one that area has probably addressed to some extent. I know from my experience working in RCTs that this is something psychologists are generally worried about and try to address.

Poes expressed further concerns:

That is the direction we initially approached this, just as a conceptual issue related to the validity of the measure itself. This is a different issue, but I personally feel that we do not validate tools adequately, and even the best feasible validations don’t tell us what we often suggest they do. The example I used, parent child interaction, is really a pretty abstract concept. Most tools take the “I’ll know it when I see it” standpoint. They use what we term global ratings, where an observer is asked to score on an arbitrary scale where a mom is at with their child in an interaction scenario for the overall interaction across a set of prescribed domains. These domains are often not so well defined themselves, and so this leaves a lot of room for measurement error and bias. In graduate school I developed a micro-coding schema for moment level behavior coding. In this case, discrete behaviors were precisely defined and coded and overall PCI scores were based on the aggregate within a domain (or overall). A confirmatory Factor Analysis was used to create weighted aggregate scores, rather than a direct addition of the items by frequency. This was more predictive of long term development of a child academically and social emotionally than were global scoring schema’s, but the tool is also a beast to use and very difficult to gain high inter-rater reliability. It could not be used by practitioners. Even this, however, was only validated against other similar tools or predictive validity. My concern has always been, our methods of validation often rely on face validity (it looks right) which can be very biased, concurrent validity (that it matches other similar tools) which can have the same biases and inaccuracies in their definition and aim as the new tool, and predictive validity, which actually is often weak when measured, and even when not, precise alignment is needed which may narrow the true validation of the tool from how it is conceptually presented. That is, the actual evidence for validation may narrow what you can say the tool measures, but people don’t present it that way, they would still present it as a universal PCI tool (even if you only have evidence to support that it accurately predicts social emotional development). There is also the possibility that this is all nonsense since you cannot measure PCI the concept directly to compare.

I think I am stuck on this idea of representing in mathematical terms the effect of these misguided measurements. Call it the impact of poor construct validity, where the tool does not measure what it is supposed to measure, and that then weakens an intervention evaluations ability to measure true impacts. When an intervention seeks to improve what is effectively a latent construct that can only be measured by a set of manifest variables organized themselves into latent sub-scales. You end up having a kind of Venn diagram. The intervention itself is only addressing the area where the sets in each scale overlap. In this case, I’m suggesting there are two sets of Venn diagrams (since again, the intervention is addressed something latent or abstract, it can’t be measured directly, but there should be a “true” set of manifest variables that get at the latent construct). You have the Venn for the set of manifest variables and respective sub-scales that would make up the reality of the intervention, and then you have your actual measure. Where the overlap of each set overlaps is the only portion of your actual measure that matches the effect of the intervention. You take a fraction of a fraction and depict it as a true and accurate representation of the outcome of your intervention, but in fact, it was always being measured abstractly, and further, in this case, I’m suggesting it gets mis-measured. Since the outcome is composed of sets of subscales which each represent a latent variable, I’m suggesting that you end up effectively only measuring portions of your intervention’s outcome, and it may even be fairly unreliable now since it is possible and even likely that the tool itself was not assessed for validity or reliability at that sub-scale level.

He then continued:

I generally take issue with people taking their scales apart for other reasons. Many of the outcome measures we use have not been validated at the item or sub-scale level. Who says that sub-scale is a valid outcome measure? In this case what I find are a number of problems. One is that the most scales ask multiple questions that are related. If a latent construct is not well represented within an individual by one question, another will capture it. We then expect that on average across respondents that one or two items captures that aspect of the construct for the individual. When you deconstruct a scale into its sub-scales that is maintained, but many researchers go farther and create their own un-tested short forms. In this case, you lose that. You also don’t know if the new short form is valid or not. The other issue is that with the small samples and discrete nature of the responses (such as with small number likert items) I don’t find sub-scales to be normally distributed anymore. While there are statistics that may be ok for such things (and are common in education), that is not true in my field, where t-tests are as advanced as many get.

I guess that makes me curious what you think of this? In my stats courses in graduate school (the stats department at Purdue), I was taught that violations of the assumption of normality is a mixed bag. You shouldn’t do it, but in most cases the standard analysis we use is robust enough against this violation to cause much bias. As such, many people I know simply ignore this. I guess I’m a rule follower because I’ve always checked it and I have found that it’s often a very poor approximation of a normal distribution. The most common I see are 3 item likerts where nobody chooses the middle point, so it’s a U shape. Everyone chose 1 or 3. Is this a concern or am I making a big thing out of nothing?

My quick responses:

1. We typically don’t really care about violations of the normality assumption (unless we’re interested in predictions for individual cases). See here for a quick summary of most important assumps of linear regression.

Or this regarding assumps more generally.

2. Regarding multiple comparisons: Here, the key mistake is selection. If there are 100 reasonable analyses to do on your data, it’s wrong to select just one comparison and not adjust it for multiple comparisons. But it’s also wrong to select just one comparison and adjust it. The mistake is to select just one comparison, or just a few comparisons. The right thing to do [link fixed] is to keep all 100 comparisons and analyze them with a multilevel model.


  1. Jacob says:

    > See this paper from 2003 for an example

    Was there supposed to be a link here?

  2. Anoneuoid says:

    Comment issues again. I see this on the sidebar (not on the homepage but only from the replication is good page):

    However, I see no comments here…

  3. Anon says:

    > The right thing to do is to keep all 100 comparisons and analyze them with a multilevel model.

    The link in this sentence goes to a MI paper. That’s not right, is it?

Leave a Reply