Skip to content
 

Controversies in the theory of measurement in mathematical psychology

We begin with this email from Guenter Trendler:

On your blog you wrote:

The replication crisis in social psychology (and science more generally) will not be solved by better statistics or by preregistered replications. It can only be solved by better measurement.

Check this out:

Measurement Theory, Psychology and the Revolution That Cannot Happen (pdf here)

The background is over 100 years of the theory and practice of measurement in psychology, which began (as I understand it, but bear in mind that I’ve never studied the history of these ideas) with the challenge of measuring subjective states. We can measure the length or weight of an object with a ruler or a scale, but how do you measure how loud a sound is, or how angry someone is, or how much something hurts? Or, to make things even more difficult, how to you measure someone’s verbal ability, their extraversion, their level of depression, or where they stand on some other scale of attitude or behavior? All these concepts are, to varying degrees, “real” (in the sense of being observable (even if only indirectly), reproducible, and corresponding to some external conditions) but can’t be measured directly.

Much has been written about the challenges of indirect measurement in psychology, and many of the resulting ideas have come up again in other fields such as sociology, economics, and political science.

How do you measure “social class,” “race,” “economic growth,” or “political ideology,” for example? One must define as well as measure. Even something as simple as the price of some good or service in the marketplace will depend on how you define it.

All this is well understood within psychometrics, with stochastic models used to estimate—and, implicitly, to define—latent constructs of interest such as abilities, attitudes, and mental states.

But, outside of psychometrics, in certain areas of research psychology that make the Psychological Science / PNAS / Ted talk/ NPR circuit, the subtleties of measurement don’t seem so well understood.

There often seems to be the attitude that, to learn about the connection between latent characteristics A and B, that any statistical significant correlation between observations x and y will do—as long as x can be considered in some way, however tenuous, to be a measurement of A, and y can be considered in some way to measure B. We’ve discussed lots and lots of such examples in this space, including fat arms, testosterone, power pose, life expectancy, and that study that labeled days 6-14 as the period of peak fertility. Many of the researchers in these studies didn’t seem to see the problem: they just (incorrectly) equated the measurements with the target of measurement and went from there.

It’s not wrong to use proxy measurements—in many cases, including much of my own work, all we have are proxy measurements!—but you should be aware of the challenge of going from measurement to what you’re trying to measure. If you want to criticize my political science work on the grounds that you can’t trust people’s responses to pollsters, fine. To defend my work, I’ll have to directly address the problems of measurement, and of course political scientists have been studying such issues for decades.

OK, that’s all background. On to Trendler’s papers. I can’t follow exactly what he’s saying. But it’s not just him, it’s the whole literature. I’m just not familiar enough with the terminology and concepts used in the psychological theory of measurement.

That said, I think Trendler might be on to something here. I say this because of what I see as the weakness in the opposition to his arguments, as I discuss next.

Trendler recently published a new paper, Conjoint measurement undone, in the journal Theory & Psychology:

According to classical measurement theory, fundamental measurement necessarily requires the operation of concatenation qua physical addition. Quantities which do not allow this operation are measurable only indirectly by means of derived measurement. Since only extensive quantities sustain the operation of physical addition, measurement in psychology has been considered problematic. In contrast, the theory of conjoint measurement, as developed in representational measurement theory, proposes that the operation of ordering is sufficient for establishing fundamental measurement. The validity of this view is questioned. The misconception about the advantages of conjoint measurement, it is argued, results from the failure to notice that magnitudes of derived quantities cannot be determined directly, i.e., without the help of associated quantitative indicators. This takes away the advantages conjoint measurement has over derived measurement, making it practically useless.

This appeared with two discussions, one by Joel Mitchell and one by Dave Krantz and Tom Wallsten.

Mitchell’s comment was pretty technical and I did not try to follow it all. The whole topic just seems so slippery to me. Indeed, even the wikipedia article on conjoint measurement was hard for me to follow. The topic may well be of fundamental importance, and maybe sometime in the future someone will sit down and explain it to me.

In their comment, Krantz and Wallsten made a larger statement about the replication crisis in psychology and elsewhere. They write:

Replication is abetted by statistical thinking, but not closely tied to it. It was important in science long before the burgeoning of statistics in the late 19th and the 20th century. . . . Roentgen’s discovery of X-rays used only an induction coil, a vacuum tube, cardboard for shielding, and a photographic plate; following his report (January 1, 1896) it was replicated within a month in many European and American laboratories (Pais, 1986, pp. 37–39). Tversky and Kahneman (1971) used a brief ques- tionnaire and an available pool of human respondents to discover that subjective bino- mial sampling distributions do not vary with stated sample size. One of us replicated this using 50 students (in a graduate statistics class) within weeks after receiving their draft manuscript and we have both since replicated it several times in classroom settings. The culture of replication depends on feasibility, habit of mind, and typical sizes of reported effects. . . .

Excellent point. Replication is fundamental and in many cases does not need to be tied to statistics at all.

Krantz and Wallsten also write:

In fact, both false alarms and low-power misses are statistically inevitable, rather than signs of pathology. Failure to accept this probabilistic viewpoint can contribute to a (false) feeling of crisis, and thence to unreasonable remedies. . . .

Also:

We are horrified by much of the statistical practice in psychology and other research. But so are many other critics. . . . Hardly anyone follows Trendler (2019; or Stevens, 1946) by asserting that development of interval-scale measurement is a prerequisite for statistical analysis. . . .

This does not fit together. Much of the statistical practice in psychology and other research is horrible? Check. Replication is important? Check. The feeling of crisis is “false”? Huh? A problem I see in Krantz and Wallsten is they talk about replication in terms of increasing sample sizes, without noting that improved measurement—better data—can be a key step. They write of “the inevitable tradeoff among effect size, sample size, and probability of missing something worthwhile”—but with better measurement we can learn more, with the new input being effort (to take better measurements) rather than sample size. Krantz and Wallsten come close when they write that “valid replication often requires theoretical understanding of the phenomenon in question,” but they don’t take the next step that this theoretical understanding can facilitate, and also come from, deeper measurements. I know that these researchers understand this in their applied work, but in this discussion they don’t seem to make the connection.

In
his response to the discussion, Trendler writes:

Unfortunately, the problem of measurability is not perceived as the primary cause of the failure to replicate, but what has been identified instead as the main issue is an inappropriate and dysfunctional use of established methods of statistical analysis. . . .

I agree with Trendler that improved statistical methods are not enough; we also need better measurement.

To summarize: I can’t evaluate the claims in the above discussion regarding conjoint measurement in psychology, as it involves many technical details that I have not tried to follow as I keep getting tangled in the details. For me to understand this debate, I think it would help if it were applied to problems in political science such as measurement of issue attitudes, political ideology, and partisanship, as estimated using survey responses, elections, votes, and political decisions. It does seem to be much of the controversial work in psychology that I’ve seen has serious problems with measurement, poor connection of measurement and experimental design to theory, and a general attitude that measurement doesn’t matter. So I do think that something needs to be done—and that “something” can’t just be increased sample size, exact replications of poorly-conceived studies, and improved statistical analysis.

28 Comments

  1. gec says:

    While I heartily agree with the need for improved measurement, Trendler’s article itself contains a lot of stuff that seems baffling to me.

    1) The argument is made that for a latent construct to be measurable, a necessary condition is that things that are equal on that construct yield equal values on the measurement scale. It is then pointed out that variability means this will never be possible to guarantee–two individuals with the same ability might produce different scores on a test; conversely, two individuals with the same test score might have different latent ability levels. But then this is used to say that nothing in psychology is measurable? By that logic, nothing ever would be measurable, no? Even Ohm’s needles were never pointing in the “same” direction nor were they ever really in the “same” electric field. I don’t see why stochastic outcomes mean something is not measurable in a meaningful sense, though the argument seems to rest on exactly that assumption.

    2) The kind of variability in #1 that makes “sameness” impossible is random, but of course there is also systematic error which can be reduced by experimental controls, as Trendler says. But why isn’t it the case that the careful experimental controls used by experimental psychologists since the days of Ohm and Helmholtz (who were early experimental psychologists in addition to physicists) aren’t sufficient to reduce systematic error to the point of getting good measurements? I agree this is a hard problem, but it is hardly insurmountable and has been dealt with for a long time.

    3) Trendler is pessimistic that we can design experiments that selectively influence particular latent psychological constructs (what he calls “Galilean” science) but I don’t see why. People like Townsend and Dzhafarov have been working on this and, thanks to their efforts, we have experimental and statistical techniques that can check conditions necessary for selective influence. And again I don’t see why this is specific to psychology, it is as much a problem in physics, chemistry, and biology.

    Finally, it seems like Trendler is talking about “psychology” and “psychological phenomena” as if they are only about characterizing stable traits of individuals through observational methods (e.g., personality inventories). This is, I think, why he doesn’t believe good experiments can be done, because it might be hard to shift those things around and hard to know if one had done so. But while that is a concern of some psychologists, it is hardly the whole field—psychophysics (the stuff that began with Helmholtz and Ohm) and cognitive psychology have taken problems of measurement quite seriously since their inception and it is no accident that these fields have shown the greatest theoretical progress and have far fewer reproducibility issues than other branches of psychology (or medicine for that matter—also a measurement nightmare).

    But psychophysics and cognitive psych are not about inferring traits or attitudes from correlational data, they are about building models of how stimulus attributes and context affect observable behavior via internal perceptual/cognitive mechanisms. I grant this is not the “sexy psychology” that often gets promoted in the science tabloids so I understand how it can get overlooked, but I don’t think it is fair to ignore either, particularly since they exemplify how taking measurement problems seriously leads to real progress.

    • Anoneuoid says:

      The argument is made that for a latent construct to be measurable, a necessary condition is that things that are equal on that construct yield equal values on the measurement scale

      I think once you start talking about measuring something in this way you are already lost. Everything you try to measure has its own unique problems.

    • Hans says:

      gec: I also find the Trendler article confusing and hard to follow but I’ll attempt an interpretation (in hope of being corrected!):

      First of all, I think you are right in thinking that Trendler is talking about traits, attitudes and other psychological constructs that are usually “measured” using the the traditional toolbox of psychometric techniques.

      Measurement (which is taken to mean quantification) depends on a whole bunch of lawful relationships. First, in psychology you don’t get very far unless you account for error which means that you need a true score. You can get that either by simply assuming that it exists and working from there like in classical test theory, which I imagine Trendler does not find acceptable, or you can have a causal link between the measurement and that which is measured.

      Then you need a relationship between qualitative states that maps to numbers. Stuff that can be ordered cannot necessarily be quantified. You can say that hotel B is better than hotel A because it has a nice restaurant and that hotel C is even better than hotel B because it also has a nice restaurant and a pool. You can compare and rank the hotels and people will understand what you mean but you can’t compare the differences between them. You can say which is better than which but not by how much. Constructs can be meaningful and clearly ordered without being quantitative (as Joel Michell keeps pointing out).

      Measurement in psychology (not “soft psychology” at least) can’t be based on concatenation. You can’t add an attitude of one strength to another attitude and compare the outcome with a third attitude of some other strength. So if the relationship between attitudes is quantitative (i.e. something that can be better represented with numbers rather than, say, letters) you have to find some way of comparing them and the differences between them.

      In theory conjoint measurement could be a solution. If you have a pair of variables that relate non-interactively to a third then you can compare “configurations” of the pair and you can compare the differences between them. So far, so good. If psychological phenomena like attitudes are indeed quantitative we could in theory measure them like that.

      The problem is that all of this relies on relations that are lawful. These need to be established empirically and we just can’t manipulate the stuff we’re trying to measure to the point of doing this (is what I think Trendler is saying). I think his big point is that causal theory and measurement can’t be separated and that the necessary causal theory can’t be tested.

      If this is a correct reading of Trendler this is as far as I can follow the argument. The only way I can make sense of the final bit is to imagine that he is saying that at least one of the variables would have to be a quantitative measurement for everything to work, but that can’t be correct.

      • gec says:

        Thanks, I appreciate the translation into terms I can actually understand (unlike “Galilean” or “Millean”, but that may be an ignorance of philosophy on my part).

        I certainly get the problem with concatenation, and this is why things like conjoint measurement and systems factorial technology are built around detecting different types of (what a statistician would call) interactions, but as you say these methods also rely on assumptions about how we can set things up (e.g., selective influence in SFT).

        But at least from how I read it, the core of Trendler’s argument is the objection about “sameness”, in that this is said to be a necessary condition for measurement, i.e., the same latent state gives the same answer on your measurement instrument. Putting aside the fact that it is possible to construct measurement theories that are stochastic (i.e., a latent state maps onto a distribution of outcomes), if this really were a requirement of measurement then literally nothing could ever be measured since as the Greek said, you can’t step in the same river twice.

        So like the final part, I have to assume Trendler means something else than I’m reading, but I can’t pull it out from the article.

  2. Vithor says:

    I’ve been studying measurement theory for five years know and maybe I can help with an intuitive understanding on it.

    The basic idea is that quantitative things should present a basic structure to be identified as such. This basic structure is that of the real numbers (https://en.wikipedia.org/wiki/Construction_of_the_real_numbers). As it is not possible to make mathematical operations on things, we need a numerical representation that represents the property we want to assess. “Attributing numerical representations” is what we know as “to measure” something. Concatenation is the basic procedure used in physics to measure fundamental properties. The whole idea of measurement theory started with Hölder’s 1901 paper, where he proved that concatenation is but a general procedure for showing that some observable qualitative properties have the same basic structure of the real numbers. What Luce and Tukey did in 1964 with conjoint measurement was to prove that there are other procedures than concatenation that also allow for us to conclude that the observed qualitative property has the same basic structure of the real numbers.

    Some misconceptions about measurement theory, and specially conjoint measurement, are also pretty widespread. The first is Trendler’s assertion that “[…] conjoint measurement, as developed in representational measurement theory, proposes that the operation of ordering is sufficient for establishing fundamental measurement”. This is imprecise because it is true ONLY for the n-component version of the additive conjoint measurement model. For the traditional additive conjoint measurement model, it is necessary to test other conditions (such as double cancellation). Another misconception is that people talk about conjoint measurement, but they only take into account additive conjoint measurement theory. There is n-component additive measurement, polynomial conjoint measurement, non-additive and subtractive conjoint measurement, and so on. No single measurement theory is to be put in a pedestal as measurement theories are just abstract descriptions on how to demonstrate that observed qualitative properties are, in fact, quantitative.

    Another discussion involves the relation and differences between psychometrics and conjoint measurement theory. Some people argue that psychometrics, such as Rasch model, is a probabilistic version of additive conjoint measurement, which is simply untrue. Also, some people say that, for instance, only Rasch and one-parameter logistic models allow for interval measures to be attained, which is also untrue. In measurement theory literature, IRT, multidimensional scalling, factor analysis, and so forth, are methods of scaling. Scaling methods are used to create appropriate numerical representations GIVEN a particular measurement theory is assumed to be true. For instance, the Rasch model indeed take into account probabilistic errors to create the best (usually, maximum likelihood) numerical representation according to additive conjoint measurement model. However, the Rasch model will always give the best numerical representation according to additive conjoint measurement model, despite the possibility for the measurement model to be false or imprecise. The two-parameter logistic model, for instance, will always give the best numerical representation for a distributive rule of compositional conjoint measurement model, even if the model is incorrect.

    I don’t want to extend myself much more as I said I was trying to give an intuitive understanding to the problem. The final thing, that is the only thing I actually agree with Trendler, is the fact that latent variables cannot be measured, at least not in the traditional sense he, or anyone in measurement theory, uses. One can experimentally control the external contingencies of behavior and find very consistent findings. However, inferring this experimental setting affected some unobservable variable, and that is why we saw differences in behavior, can be considered a long stretch. Behaviorist have been pointing this out for ages now. This is not a problem that can be solved with measurement theory, psychometrics, or any clever mathematical or statistical tool, but only with experimental design and a lot of collaborative brain power.

    • Erik says:

      What do you think would be the best way to move on in applied work (in social science)? Focus on conjoint measurement and use stuff like the ConjointChecks package in R? Or use Rasch/IRT even though it is imperfect? Both in tandem, first ACM and the Rasch for example?

      I heartily agree that poor measurements has to be a big reason that so much social science is wonky. Still, I’d be very happy if we don’t have to give up completely. I’d absolutely enjoy methods for testing if I have a quantitative attribute at hand. It would also be great if they have a good chance of breaking down and not generate a numerical solution if it is not.

      • Vithor says:

        My current view on the issue is basically that of Sijtsma (2012): https://journals.sagepub.com/doi/abs/10.1177/0959354312454353. In the abstract, he uses a phrase that, for me, sums up the problem pretty well: “Only the rigorous development of attribute theories can lead to meaningful measurement.” I also add to the problem one appeal made by Townsend (2008): https://www.sciencedirect.com/science/article/abs/pii/S0022249608000436. Tl;Dr: Psychology courses should teach more mathematics. Advanced statistics is great, but if we think math as a language on the quantitative realm, and we want to be a quantitative science, we cannot keep using natural language to propose theories that will be tested with math related techniques.

        This answer was probably to abstract and long-run oriented. A more short-term “solution” would be to increase the use of experimental designs on creating psychological measurements. We can get a lot of inspiration from researchers on psychophysics who used ACM, for instance: https://www.sciencedirect.com/science/article/abs/pii/S002224961100040X. Part of the problem, on my humble opinion, is that psychometrical procedures are taken as a synonym of psychological measurement, and many times are considered to be the only acceptable alternative. But, realistically, psychometrics is just one type of a considerably limited measurement approach.

        However, if we are to keep using only, or mostly, psychometrics for doing psychological measurements, I think of at least two alternatives. First, to make nonparametric IRT (Kernel smoothing, optimal scores, Mokken scale analysis [MSA], whatnot) the default. I think MSA should be of special interest, as it tests some assumptions of IRT, instead of just finding the best model that accords with these assumptions. If this is not feasible, the next alternative is to use explanatory item response models as default, see the book by Paul de Boeck and Mark Wilson. It at least allows for better explaining the variance on data, restricting what the latent variables should explain.

  3. zbicyclist says:

    “Hardly anyone follows Trendler (2019; or Stevens, 1946) by asserting that development of interval-scale measurement is a prerequisite for statistical analysis” (Krantz and Wallsten)

    But is that right? Surely ordinal statistical analyses are useful.

    • Hans says:

      Maybe this is yet another case of psychologists confusing statistics with theory?

      Psychological test yield scores which are numerical. If these numerical scores are useful for summarizing or predicting things when treated as numbers then of course there are plenty of situations where treating them as interval-scale measures for the purposes of statistical analysis can make sense. For example most people would probably think it useful to find good predictors of PISA reading scores even if they don’t believe that a unit of reading ability exists.

      Things change however when you start treating psychological constructs as real things that can cause other things. Using scores on personalty inventories, IQ-tests or anxiety questionnaires as summaries is one thing. It is a completely different thing however to say for example that neuroticism mediates the effect of genes on depression or that differences in average IQ levels explains differences in some life outcome between racial groups or that state anxiety causes poor performance on some task. If such statements are true then neuroticism, IQ and state anxiety are real attributes of a person and either they are quantitative or not.

      I think it is quite obvious that if you think that psychological constructs are real attributes of a person that can cause (rather than predict) effects then the understanding what kind of attributes they are is necessary for theory development.

      This may be a bit (further) of topic but I think that psychology’s addiction to significance testing has a lot to do with the fact that many of psychology’s measures are more or less uninterpretable. You regress anxiety scores on something and find a slope of -0.35 (SE=0.1) which you have to interpret but you have no idea what 0.35 anxiety score points are supposed to mean. Solution: Something was significantly negatively correlated with anxiety (p<0.001).

      • gec says:

        > Maybe this is yet another case of psychologists confusing statistics with theory?

        I think this is related to Vithor’s point, above,

        > This is not a problem that can be solved with measurement theory, psychometrics, or any clever mathematical or statistical tool, but only with experimental design and a lot of collaborative brain power.

        And this is really the crux of the issue—we need to be focused on constructing actual theories that link constructs to outcomes, it doesn’t come for free. In relation to zbike’s point below, we don’t need an “anxiety score”, we need a theory of what anxiety *is* such that it would cause someone to produce a particular (range of) outcomes on some instrument. Admittedly, this requires making a lot of hard choices (and potentially being wrong) along the way, but as we see in certain branches of psychophysics and cog psych, it pays off because you end up with a better understanding of the construct and how to learn about it.

  4. zbicyclist says:

    “This may be a bit (further) off topic but I think that psychology’s addiction to significance testing has a lot to do with the fact that many of psychology’s measures are more or less uninterpretable. You regress anxiety scores on something and find a slope of -0.35 (SE=0.1) which you have to interpret but you have no idea what 0.35 anxiety score points are supposed to mean. Solution: Something was significantly negatively correlated with anxiety (p<0.001)."

    Both true and on-topic.

    When I left academia and went into industry, I found to my great joy that I was spending my time with ACTUAL interval and ratio measurements: dollars, cases, price, advertising rating points, and so on. Yes, these have measurement issues (e.g. related to how sales are estimated from a sample), but nothing like the behavioral sciences.

    I can't get by saying "if you raise the price 10%, sales will go down, and that effect will be statistically significant." That's trivially useless. I'm expected to provide a point and interval estimate for the specific amount sales will go down, in order to help guide decision making. But who knows what 0.35 anxiety points mean, other than "more anxiety".

    • Steve says:

      ” I found to my great joy that I was spending my time with ACTUAL interval and ratio measurements: dollars, cases, price, advertising rating points, and so on.”

      Aren’t all of the measurement issues that plague psychological constructs apply to money as well — at least if we are talking about its “real” value? To adjust for inflation I need an index that assumes that one amount of money fetches the same amount of utility as another amount so I can adjust for inflation. But, utility isn’t something that can be measured. I can only ever get ordinal data. Person A likes apples more than oranges. That may not matter much when the basket of goods that I am using as my index hasn’t changed much, but as time goes by, I have to truly depend upon my untestable assumption that utility is has an interval and ratio measure. Statements like “Rockefeller was richer than Bill Gates” or “China’s GDP is as large as the US’s” are as problematic and uninterpretable as the “0.35 anxiety score.

      • I think this is why the base basket of goods should be something that is connected to biological or physical constraints / needs. So for example

        A certain number of square feet of climate controlled living space + a typical number of calories / day of a wide mixture of foods + the cost of transporting oneself for an hour a day to and from a work location + the cost of hiring someone to care for a young child for 8 hours + the cost of providing education at a middle grade level + the cost of caring for an elderly adult + the cost of medical care for an uncomplicated broken arm or leg + the cost of medical care for a respiratory illness such as a bad cold or mild flu.

        If you built such an index and converted it to an annualized cost assuming typical consumption levels, such as maybe 2 or 3 broken bones in a lifetime, three children per two adults, 5 years of pre-school care, 12 years of elementary school care, one respiratory illness per year or every other year, 1500 to 2000 calories of food per day, etc etc you would have an index that trasported value across literally hundreds of years with relative ease. It would apply mainly to “typical” people not Andrew Carnegie or whatever but it would have a lot of use, and in general be a better indicator of “poverty” than what we have now, or than inflation via the CPI.

        • Steve says:

          This might be a workable suggestion if all we want is a basket that measures the price of basic nutritional needs. But, even as a poverty measure that fails (though it might be helpful). As soon as we try to take into account other needs including needs that are pretty basic like clothing and housing quality considerations come into play as well as what we might call cultural predispositions. I need cloths but the quality of cloths varies considerably over time. To the extent we can measure it, quality of clothing has gone down, but also tastes have changed. Yoga pants were unacceptable public attire not that long ago. Jeans were controversial public attire in the 70s. The space required to live depends in part on the cultural knowledge and other requirements on how to utilize space. People in Japan and Hong Kong are adept at turning small spaces into livable spaces, but Americans aren’t. Thus, a basket with only basic needs tied to biological needs (outside of nutritional needs — and even that is questionable) will require comparing goods and services that are not truly identical over time and thus require the fiction of utility as a quantity with full cardinality, which is just a fiction.

          • the quality issues and cultural norms are of course an issue but I think they are second order, whereas compare everyday purchases of someone in 1820 with everyday purchases in 1920 and 2020… The importance of musket balls, feed corn, raw milk, cotton yardage, maple sugar, and wagon wheels in 1820 vs now…

            • Steve says:

              I think that we are in agreement as long as the time frame and geographic/cultural differences are not too large. It is not hard for me to concede that I am richer than a middle class person in the 18th century, at least from my perspective (that last bit is important). But, it is very questionable for me to believe that I am twice (or 50% 0r whatever %) richer than someone in the 18th century. Those are the claims that I am objecting to as unscientific.

      • Hans says:

        Money may not be a measure of utility but hardly as uninterpretable as anxiety scores. Predicting, say, an increase of 10% in profits is a prediction that “translates back into the real world”. You are for example predicting that the profits can be traded for more stuff than before.

        In that money is perhaps more like a score on a well designed standardized test of academic achievement in some area. The test is not really a measure of anything but a pretty good indicator of examinees’ chances of successfully completing tasks that were not on the actual test (but of course the difference between summary and measure can be very important for theory).

        Anxiety scores on the other hand are at best very weak predictors of life outcomes. Predicting a decrease of 0.35 score points has very few implications. The score is pretty useless for practical purposes and your options for further testing and developing the theory that predicted the decrease are limited.

        • Actually no. If your profits increase by 10% in nominal dollars but the general cost of stuff you want to buy increased by say 20% in nominal dollars, then you can only buy about 1.1/1.2 = .92 times as much stuff as you could last year.

          Only ratios of dollars make sense. We are used to prices being very stable so we often ignore this in the short term, but it’s to our peril.

          • Hans says:

            I only said “more stuff” because I was trying to avoid the whole issue of inflation and price changes. I know the claim is still not necessarily true but I don’t think that is relevant to the main point.

        • Steve says:

          I don’t think that all claims about money are uninterpretable. If you have more money than I do, you are probably richer anyway we slice it because we are probably living in more or less the same economy and culture. But, when we start making comparisons over big stretches of time or big culture and geographic divides (like comparing our GDP to China’s), we are involved in making so many choices in what goes into the two baskets that we are comparing that we are really free to construct the baskets anyway we want. I own a smart phone today should I compare that to owning a game console, mobile phone, personal computer, camera and sound system in 1980 or just one of those or doing something else that seems like a better comparison. I was just pointing out that many of the problems you point out for psychological constants are there for money if we go to far inferences we can make.

  5. Oliver C. Schultheiss says:

    I must confess that I haven’t read Trendler’s paper, but only Andrew’s summary and the discussion here. I can’t help but agree with the verdict that anything goes for measurement isues in some parts of psychology (though not all — I too would exempt cognitive psychology here). For the past 25 years I’ve been doing research on motivational dispositions, which usually fall into the realm of social and personality psychology. I can attest to a pervasive lack of rigour when it comes to measurement in this research domain, I sometimes refer to it as the “claiming game”, where, in addition to the face validity of the self-report items typically used, the mere claim that someone’s new inventory of, say, achievement motivation actually measures the thing it’s named for is sufficient to demonstrate validity. Such claims are usually bolstered by convergent and discriminant validity coefficients with other self-report-based measures, weaving one of those infamous nomological networks of which Paul Meehl remarked as far back as 1967 that researchers can forever build and extend them without once coming up with a real shred of evidence supporting the validity of their measure.

    The cure for much of this nonsense would be the adoption of more rigorous validity standards for presumed measures of an attribute. In a Psychological Review paper published in 2004, Borsboom and colleagues made the very common-sense statement that a measure is valid if and only if the thing it measures really exists and manipulations of that thing lead to corresponding variations in the measure. Think heat and the thermometer here. Almost 50 years earlier, in 1958 David McClelland similarly made the criterion that a measure be sensitive to variations in the thing it measures the cornerstone of his validation approach for measures of motivation. As you will note, I am not even talking about some of the finer points brought up in this discussion, which are about the intricacies of measurement itself, the mapping of real-world phenomena onto numerical scales and all the problems this may pose. All of these considerations will be in vain if the bare requirement that a measure must be linked to a real thing and its variations is not met. But that’s the state much of motivation science and the broader field of personality psychology, which frequently comes up with measures first and then tries to figure out what the underlying causes generating variations in these measures might be.

    Castles of sand.

  6. Mort says:

    In their response Krantz and Wallsten point to some successful applications of additive conjoint measurement, but in his rejoinder Trendler scoffs at such old, one-off examples. Has there ever been a successful replication of a conjoint measurement model?

  7. Stanislav says:

    We don’t need measurement in Trendler’s sense to do interesting work; we just need reliable classification of observations into cases. Reliability is on a sliding scale, so we can assess usefulness for our purposes. With classification we can compute probabilities, which are dimensionless, and therefore not measures per Trendler.

    Latent variables from psych instruments can be seen in this light, and then their usefulness is in prediction of other classified events. This is where the criticism of measurement hits home, because it is sloppy to refer to, e.g. a reading test as such. That is, “Tatiana scored low on her reading test” is misleading because the name “reading test” is going to be interpreted by the general public via their own classifications of what “reading” means, not what the test results actually predict. See Academically Adrift for a book-length mistake of this kind. This puts the burden on researchers not to use ordinary language when it might trigger native classifiers, but to spell out the predictive ability of the instrument explicitly, to include how exactly classifications are made.

    I think the increasing use of arbitrary (made up) language is a sign of a maturing science. In the beginning, classifiers are tied to what ordinary people see–a “red-wing blackbird” instead of “Agelaius phoeniceus”. The reason is that as classifiers become more reliable they diverge from common experience (because of controlled conditions, declared assumptions, etc.). So we end up with things like “quarks” and “volts” where before we had “earth”, “wind”, “fire”, and “water.”

  8. Davis-Stober says:

    Interesting discussion.

    I tend to view conjoint measurement as a very useful theoretic construct, rather than as a specific applied measurement tool/procedure. One of the original papers on conjoint measurement, the classic Duncan Luce and John Tukey (yes, the statistician) paper from 1964 (Journal of Mathematical Psychology, issue 1), provides a complete axiomatization for such a conjoint measurement scale to exist.

    One could ask the question: how can I represent voter’s preferences among political candidates? To keep it simple, suppose we were interested in just two attributes: 1) extremeness of political views and 2) perceived electability. One could argue that these two attributes “trade off” with one another when voters evaluate candidates. Now we are faced with an interesting question. How should we model this evaluation? We could approach the problem in an applied statistical manner – we could do a logistic regression on voter’s stated preferences (or something similar) and include these attributes as parameters in the model and see how the model fits. Perfectly fine thing to do and can be used to evaluate many different questions.

    On the other hand, we could evaluate in a rigorous fashion whether voter’s stated preferences over the candidates satisfied the set of axioms both necessary and sufficient for a conjoint measurement representation. This would also require a degree of modeling and statistical testing, but suppose we did so and we found that, yes, voter’s stated preferences over the candidates DID admit a conjoint measurement scale. This would lead to us to conclude that voter’s preferences admit a utility function in which there exists separable (and additive) functions of “extremeness of political views” and “electability”. We would conclude that there exist functions, f and g, such that:

    Utility of candidate = f(“extremeness”) + g(“electability”)

    I think it’s a bit funny to ask whether or not this is useful in a practical sense. A typical logistic regression (or similar) model addresses a particular type of question. The conjoint measurement question addresses a different one. I would argue that the conjoint perspective allows for a strong springboard to developing and refining a theory of candidate preference as it is directly assessing what is representable. It is also highly psychological, in that we now have an “as if” model of how voters evaluate candidates using these two attributes. This is analogous to the random utility condition, of which the multinomial logit model must also satisfy.

  9. Davis-Stober says:

    Interesting discussion.

    I tend to view conjoint measurement as a very useful theoretic construct, rather than as a specific applied measurement tool/procedure. One of the original papers on conjoint measurement, the classic Duncan Luce and John Tukey (yes, the statistician) paper from 1964 (Journal of Mathematical Psychology, issue 1), provides a complete axiomatization for such a conjoint measurement scale to exist.

    One could ask the question: how can I represent voter’s preferences among political candidates? To keep it simple, suppose we were interested in just two attributes: 1) extremeness of political views and 2) perceived electability. One could argue that these two attributes “trade off” with one another when voters evaluate candidates. Now we are faced with an interesting question. How should we model this evaluation? We could approach the problem in an applied statistical manner – we could do a logistic regression on voter’s stated preferences (or something similar) and include these attributes as parameters in the model and see how the model fits. Perfectly fine thing to do and can be used to evaluate many different questions.

    On the other hand, we could evaluate in a rigorous fashion whether voter’s stated preferences over the candidates satisfied the set of axioms both necessary and sufficient for a conjoint measurement representation. This would also require a degree of modeling and statistical testing, but suppose we did so and we found that, yes, voter’s stated preferences over the candidates DID admit a conjoint measurement scale. This would lead to us to conclude that voter’s preferences admit a utility function in which there exists separable (and additive) functions of “extremeness of political views” and “electability”. We would conclude that there exist functions, f and g, such that:

    Utility of candidate = f(“extremeness”) + g(“electability”)

    I think it’s a bit funny to ask whether or not this is useful in a practical sense. A typical logistic regression (or similar) model addresses a particular type of question. The conjoint measurement question addresses a different one. I would argue that the conjoint perspective allows for a strong springboard to developing and refining a theory of candidate preference as it is directly assessing what is representable. It is also highly psychological, in that we now have an “as if” model of how voters evaluate candidates using these two attributes. This is analogous to the random utility condition, of which the multinomial logit model must also satisfy.

  10. Paul Barrett says:

    In order to perhaps provide a richer background to Gunter’s work, the essence of the measurement issues, and many more references which speak to various issues raised by many commentators above, it might be useful (if interested) to download or read online my recent article (and response to reviewers) on the whole ‘measurement’ issue embodies in the test review standards of the International Test Commission, used by Veritas and various international agencies for accrediting psychological assessments. My article also contains the legal aspect of an ‘expert’ offering testimony in court regarding the evidential status of a ‘measurement’ of a psychological attribute. This is for real, having been the focus of expert opinion offered in a NZ judicial review of the statutory definition of a diagnosis of learning disability, and in the USA courts by the first author of this article:
    Beaujean, A.A., Benson, N.E., McGill, R.J., & Dombrowski, S.C. (2018). A misuse of IQ scores: Using the dual discrepancy/consistency model for identifying specific learning disabilities. Journal of Intelligence (http://www.mdpi.com/2079-3200/6/3/36), 6, 36, 1-25. [Open-access]

    My article can be found at:
    Barrett, P.T. (2018). The EFPA test-review model: When good intentions meet a methodological thought disorder. Behavioural Sciences (https://www.mdpi.com/2076-328X/8/1/5), 8,1, 5, 1-22.[open-access]

    As to the status of “latent variable modeling and latent constructs”, I suggest a close reading of Part 2 of Michael Maraun’s online book:
    Maraun, M.D. (2007). Myths and Confusions. http://www.sfu.ca/~maraun/myths-and-confusions.html [open-access]

    and the recent article:
    Hanfstingl, B. (2019). Should we say goodbye to latent constructs to overcome replication crisis or should we take into account epistemological considerations?. Frontiers in Psychology: Quantitative Psychology and Measurement (https://doi.org/10.3389/fpsyg.2019.01949), 10, 1949, 1-8.[open-access]

Leave a Reply