Beaujean, A.A., Benson, N.E., McGill, R.J., & Dombrowski, S.C. (2018). A misuse of IQ scores: Using the dual discrepancy/consistency model for identifying specific learning disabilities. Journal of Intelligence (http://www.mdpi.com/2079-3200/6/3/36), 6, 36, 1-25. [Open-access]

My article can be found at:

Barrett, P.T. (2018). The EFPA test-review model: When good intentions meet a methodological thought disorder. Behavioural Sciences (https://www.mdpi.com/2076-328X/8/1/5), 8,1, 5, 1-22.[open-access]

As to the status of “latent variable modeling and latent constructs”, I suggest a close reading of Part 2 of Michael Maraun’s online book:

Maraun, M.D. (2007). Myths and Confusions. http://www.sfu.ca/~maraun/myths-and-confusions.html [open-access]

and the recent article:

Hanfstingl, B. (2019). Should we say goodbye to latent constructs to overcome replication crisis or should we take into account epistemological considerations?. Frontiers in Psychology: Quantitative Psychology and Measurement (https://doi.org/10.3389/fpsyg.2019.01949), 10, 1949, 1-8.[open-access]

This answer was probably to abstract and long-run oriented. A more short-term “solution” would be to increase the use of experimental designs on creating psychological measurements. We can get a lot of inspiration from researchers on psychophysics who used ACM, for instance: https://www.sciencedirect.com/science/article/abs/pii/S002224961100040X. Part of the problem, on my humble opinion, is that psychometrical procedures are taken as a synonym of psychological measurement, and many times are considered to be the only acceptable alternative. But, realistically, psychometrics is just one type of a considerably limited measurement approach.

However, if we are to keep using only, or mostly, psychometrics for doing psychological measurements, I think of at least two alternatives. First, to make nonparametric IRT (Kernel smoothing, optimal scores, Mokken scale analysis [MSA], whatnot) the default. I think MSA should be of special interest, as it tests some assumptions of IRT, instead of just finding the best model that accords with these assumptions. If this is not feasible, the next alternative is to use explanatory item response models as default, see the book by Paul de Boeck and Mark Wilson. It at least allows for better explaining the variance on data, restricting what the latent variables should explain.

]]>I tend to view conjoint measurement as a very useful theoretic construct, rather than as a specific applied measurement tool/procedure. One of the original papers on conjoint measurement, the classic Duncan Luce and John Tukey (yes, the statistician) paper from 1964 (Journal of Mathematical Psychology, issue 1), provides a complete axiomatization for such a conjoint measurement scale to exist.

One could ask the question: how can I represent voter’s preferences among political candidates? To keep it simple, suppose we were interested in just two attributes: 1) extremeness of political views and 2) perceived electability. One could argue that these two attributes “trade off” with one another when voters evaluate candidates. Now we are faced with an interesting question. How should we model this evaluation? We could approach the problem in an applied statistical manner – we could do a logistic regression on voter’s stated preferences (or something similar) and include these attributes as parameters in the model and see how the model fits. Perfectly fine thing to do and can be used to evaluate many different questions.

On the other hand, we could evaluate in a rigorous fashion whether voter’s stated preferences over the candidates satisfied the set of axioms both necessary and sufficient for a conjoint measurement representation. This would also require a degree of modeling and statistical testing, but suppose we did so and we found that, yes, voter’s stated preferences over the candidates DID admit a conjoint measurement scale. This would lead to us to conclude that voter’s preferences admit a utility function in which there exists separable (and additive) functions of “extremeness of political views” and “electability”. We would conclude that there exist functions, f and g, such that:

Utility of candidate = f(“extremeness”) + g(“electability”)

I think it’s a bit funny to ask whether or not this is useful in a practical sense. A typical logistic regression (or similar) model addresses a particular type of question. The conjoint measurement question addresses a different one. I would argue that the conjoint perspective allows for a strong springboard to developing and refining a theory of candidate preference as it is directly assessing what is representable. It is also highly psychological, in that we now have an “as if” model of how voters evaluate candidates using these two attributes. This is analogous to the random utility condition, of which the multinomial logit model must also satisfy.

]]>I tend to view conjoint measurement as a very useful theoretic construct, rather than as a specific applied measurement tool/procedure. One of the original papers on conjoint measurement, the classic Duncan Luce and John Tukey (yes, the statistician) paper from 1964 (Journal of Mathematical Psychology, issue 1), provides a complete axiomatization for such a conjoint measurement scale to exist.

One could ask the question: how can I represent voter’s preferences among political candidates? To keep it simple, suppose we were interested in just two attributes: 1) extremeness of political views and 2) perceived electability. One could argue that these two attributes “trade off” with one another when voters evaluate candidates. Now we are faced with an interesting question. How should we model this evaluation? We could approach the problem in an applied statistical manner – we could do a logistic regression on voter’s stated preferences (or something similar) and include these attributes as parameters in the model and see how the model fits. Perfectly fine thing to do and can be used to evaluate many different questions.

On the other hand, we could evaluate in a rigorous fashion whether voter’s stated preferences over the candidates satisfied the set of axioms both necessary and sufficient for a conjoint measurement representation. This would also require a degree of modeling and statistical testing, but suppose we did so and we found that, yes, voter’s stated preferences over the candidates DID admit a conjoint measurement scale. This would lead to us to conclude that voter’s preferences admit a utility function in which there exists separable (and additive) functions of “extremeness of political views” and “electability”. We would conclude that there exist functions, f and g, such that:

Utility of candidate = f(“extremeness”) + g(“electability”)

I think it’s a bit funny to ask whether or not this is useful in a practical sense. A typical logistic regression (or similar) model addresses a particular type of question. The conjoint measurement question addresses a different one. I would argue that the conjoint perspective allows for a strong springboard to developing and refining a theory of candidate preference as it is directly assessing what is representable. It is also highly psychological, in that we now have an “as if” model of how voters evaluate candidates using these two attributes. This is analogous to the random utility condition, of which the multinomial logit model must also satisfy.

]]>Latent variables from psych instruments can be seen in this light, and then their usefulness is in prediction of other classified events. This is where the criticism of measurement hits home, because it is sloppy to refer to, e.g. a reading test as such. That is, “Tatiana scored low on her reading test” is misleading because the name “reading test” is going to be interpreted by the general public via their own classifications of what “reading” means, not what the test results actually predict. See Academically Adrift for a book-length mistake of this kind. This puts the burden on researchers not to use ordinary language when it might trigger native classifiers, but to spell out the predictive ability of the instrument explicitly, to include how exactly classifications are made.

I think the increasing use of arbitrary (made up) language is a sign of a maturing science. In the beginning, classifiers are tied to what ordinary people see–a “red-wing blackbird” instead of “Agelaius phoeniceus”. The reason is that as classifiers become more reliable they diverge from common experience (because of controlled conditions, declared assumptions, etc.). So we end up with things like “quarks” and “volts” where before we had “earth”, “wind”, “fire”, and “water.”

]]>I heartily agree that poor measurements has to be a big reason that so much social science is wonky. Still, I’d be very happy if we don’t have to give up completely. I’d absolutely enjoy methods for testing if I have a quantitative attribute at hand. It would also be great if they have a good chance of breaking down and not generate a numerical solution if it is not.

]]>Only ratios of dollars make sense. We are used to prices being very stable so we often ignore this in the short term, but it’s to our peril.

]]>In that money is perhaps more like a score on a well designed standardized test of academic achievement in some area. The test is not really a measure of anything but a pretty good indicator of examinees’ chances of successfully completing tasks that were not on the actual test (but of course the difference between summary and measure can be very important for theory).

Anxiety scores on the other hand are at best very weak predictors of life outcomes. Predicting a decrease of 0.35 score points has very few implications. The score is pretty useless for practical purposes and your options for further testing and developing the theory that predicted the decrease are limited.

]]>I think this is related to Vithor’s point, above,

> This is not a problem that can be solved with measurement theory, psychometrics, or any clever mathematical or statistical tool, but only with experimental design and a lot of collaborative brain power.

And this is really the crux of the issue—we need to be focused on constructing actual theories that link constructs to outcomes, it doesn’t come for free. In relation to zbike’s point below, we don’t need an “anxiety score”, we need a theory of what anxiety *is* such that it would cause someone to produce a particular (range of) outcomes on some instrument. Admittedly, this requires making a lot of hard choices (and potentially being wrong) along the way, but as we see in certain branches of psychophysics and cog psych, it pays off because you end up with a better understanding of the construct and how to learn about it.

]]>A certain number of square feet of climate controlled living space + a typical number of calories / day of a wide mixture of foods + the cost of transporting oneself for an hour a day to and from a work location + the cost of hiring someone to care for a young child for 8 hours + the cost of providing education at a middle grade level + the cost of caring for an elderly adult + the cost of medical care for an uncomplicated broken arm or leg + the cost of medical care for a respiratory illness such as a bad cold or mild flu.

If you built such an index and converted it to an annualized cost assuming typical consumption levels, such as maybe 2 or 3 broken bones in a lifetime, three children per two adults, 5 years of pre-school care, 12 years of elementary school care, one respiratory illness per year or every other year, 1500 to 2000 calories of food per day, etc etc you would have an index that trasported value across literally hundreds of years with relative ease. It would apply mainly to “typical” people not Andrew Carnegie or whatever but it would have a lot of use, and in general be a better indicator of “poverty” than what we have now, or than inflation via the CPI.

]]>The cure for much of this nonsense would be the adoption of more rigorous validity standards for presumed measures of an attribute. In a Psychological Review paper published in 2004, Borsboom and colleagues made the very common-sense statement that a measure is valid if and only if the thing it measures really exists and manipulations of that thing lead to corresponding variations in the measure. Think heat and the thermometer here. Almost 50 years earlier, in 1958 David McClelland similarly made the criterion that a measure be sensitive to variations in the thing it measures the cornerstone of his validation approach for measures of motivation. As you will note, I am not even talking about some of the finer points brought up in this discussion, which are about the intricacies of measurement itself, the mapping of real-world phenomena onto numerical scales and all the problems this may pose. All of these considerations will be in vain if the bare requirement that a measure must be linked to a real thing and its variations is not met. But that’s the state much of motivation science and the broader field of personality psychology, which frequently comes up with measures first and then tries to figure out what the underlying causes generating variations in these measures might be.

Castles of sand.

]]>I certainly get the problem with concatenation, and this is why things like conjoint measurement and systems factorial technology are built around detecting different types of (what a statistician would call) interactions, but as you say these methods also rely on assumptions about how we can set things up (e.g., selective influence in SFT).

But at least from how I read it, the core of Trendler’s argument is the objection about “sameness”, in that this is said to be a necessary condition for measurement, i.e., the same latent state gives the same answer on your measurement instrument. Putting aside the fact that it is possible to construct measurement theories that are stochastic (i.e., a latent state maps onto a distribution of outcomes), if this really were a requirement of measurement then literally nothing could ever be measured since as the Greek said, you can’t step in the same river twice.

So like the final part, I have to assume Trendler means something else than I’m reading, but I can’t pull it out from the article.

]]>Aren’t all of the measurement issues that plague psychological constructs apply to money as well — at least if we are talking about its “real” value? To adjust for inflation I need an index that assumes that one amount of money fetches the same amount of utility as another amount so I can adjust for inflation. But, utility isn’t something that can be measured. I can only ever get ordinal data. Person A likes apples more than oranges. That may not matter much when the basket of goods that I am using as my index hasn’t changed much, but as time goes by, I have to truly depend upon my untestable assumption that utility is has an interval and ratio measure. Statements like “Rockefeller was richer than Bill Gates” or “China’s GDP is as large as the US’s” are as problematic and uninterpretable as the “0.35 anxiety score.

]]>Both true and on-topic.

When I left academia and went into industry, I found to my great joy that I was spending my time with ACTUAL interval and ratio measurements: dollars, cases, price, advertising rating points, and so on. Yes, these have measurement issues (e.g. related to how sales are estimated from a sample), but nothing like the behavioral sciences.

I can't get by saying "if you raise the price 10%, sales will go down, and that effect will be statistically significant." That's trivially useless. I'm expected to provide a point and interval estimate for the specific amount sales will go down, in order to help guide decision making. But who knows what 0.35 anxiety points mean, other than "more anxiety".

]]>Psychological test yield scores which are numerical. If these numerical scores are useful for summarizing or predicting things when treated as numbers then of course there are plenty of situations where treating them as interval-scale measures for the purposes of statistical analysis can make sense. For example most people would probably think it useful to find good predictors of PISA reading scores even if they don’t believe that a unit of reading ability exists.

Things change however when you start treating psychological constructs as real things that can cause other things. Using scores on personalty inventories, IQ-tests or anxiety questionnaires as summaries is one thing. It is a completely different thing however to say for example that neuroticism mediates the effect of genes on depression or that differences in average IQ levels explains differences in some life outcome between racial groups or that state anxiety causes poor performance on some task. If such statements are true then neuroticism, IQ and state anxiety are real attributes of a person and either they are quantitative or not.

I think it is quite obvious that if you think that psychological constructs are real attributes of a person that can cause (rather than predict) effects then the understanding what kind of attributes they are is necessary for theory development.

This may be a bit (further) of topic but I think that psychology’s addiction to significance testing has a lot to do with the fact that many of psychology’s measures are more or less uninterpretable. You regress anxiety scores on something and find a slope of -0.35 (SE=0.1) which you have to interpret but you have no idea what 0.35 anxiety score points are supposed to mean. Solution: Something was significantly negatively correlated with anxiety (p<0.001).

]]>But is that right? Surely ordinal statistical analyses are useful.

]]>The basic idea is that quantitative things should present a basic structure to be identified as such. This basic structure is that of the real numbers (https://en.wikipedia.org/wiki/Construction_of_the_real_numbers). As it is not possible to make mathematical operations on things, we need a numerical representation that represents the property we want to assess. “Attributing numerical representations” is what we know as “to measure” something. Concatenation is the basic procedure used in physics to measure fundamental properties. The whole idea of measurement theory started with Hölder’s 1901 paper, where he proved that concatenation is but a general procedure for showing that some observable qualitative properties have the same basic structure of the real numbers. What Luce and Tukey did in 1964 with conjoint measurement was to prove that there are other procedures than concatenation that also allow for us to conclude that the observed qualitative property has the same basic structure of the real numbers.

Some misconceptions about measurement theory, and specially conjoint measurement, are also pretty widespread. The first is Trendler’s assertion that “[…] conjoint measurement, as developed in representational measurement theory, proposes that the operation of ordering is sufficient for establishing fundamental measurement”. This is imprecise because it is true ONLY for the n-component version of the additive conjoint measurement model. For the traditional additive conjoint measurement model, it is necessary to test other conditions (such as double cancellation). Another misconception is that people talk about conjoint measurement, but they only take into account additive conjoint measurement theory. There is n-component additive measurement, polynomial conjoint measurement, non-additive and subtractive conjoint measurement, and so on. No single measurement theory is to be put in a pedestal as measurement theories are just abstract descriptions on how to demonstrate that observed qualitative properties are, in fact, quantitative.

Another discussion involves the relation and differences between psychometrics and conjoint measurement theory. Some people argue that psychometrics, such as Rasch model, is a probabilistic version of additive conjoint measurement, which is simply untrue. Also, some people say that, for instance, only Rasch and one-parameter logistic models allow for interval measures to be attained, which is also untrue. In measurement theory literature, IRT, multidimensional scalling, factor analysis, and so forth, are methods of scaling. Scaling methods are used to create appropriate numerical representations GIVEN a particular measurement theory is assumed to be true. For instance, the Rasch model indeed take into account probabilistic errors to create the best (usually, maximum likelihood) numerical representation according to additive conjoint measurement model. However, the Rasch model will always give the best numerical representation according to additive conjoint measurement model, despite the possibility for the measurement model to be false or imprecise. The two-parameter logistic model, for instance, will always give the best numerical representation for a distributive rule of compositional conjoint measurement model, even if the model is incorrect.

I don’t want to extend myself much more as I said I was trying to give an intuitive understanding to the problem. The final thing, that is the only thing I actually agree with Trendler, is the fact that latent variables cannot be measured, at least not in the traditional sense he, or anyone in measurement theory, uses. One can experimentally control the external contingencies of behavior and find very consistent findings. However, inferring this experimental setting affected some unobservable variable, and that is why we saw differences in behavior, can be considered a long stretch. Behaviorist have been pointing this out for ages now. This is not a problem that can be solved with measurement theory, psychometrics, or any clever mathematical or statistical tool, but only with experimental design and a lot of collaborative brain power.

]]>First of all, I think you are right in thinking that Trendler is talking about traits, attitudes and other psychological constructs that are usually “measured” using the the traditional toolbox of psychometric techniques.

Measurement (which is taken to mean quantification) depends on a whole bunch of lawful relationships. First, in psychology you don’t get very far unless you account for error which means that you need a true score. You can get that either by simply assuming that it exists and working from there like in classical test theory, which I imagine Trendler does not find acceptable, or you can have a causal link between the measurement and that which is measured.

Then you need a relationship between qualitative states that maps to numbers. Stuff that can be ordered cannot necessarily be quantified. You can say that hotel B is better than hotel A because it has a nice restaurant and that hotel C is even better than hotel B because it also has a nice restaurant and a pool. You can compare and rank the hotels and people will understand what you mean but you can’t compare the differences between them. You can say which is better than which but not by how much. Constructs can be meaningful and clearly ordered without being quantitative (as Joel Michell keeps pointing out).

Measurement in psychology (not “soft psychology” at least) can’t be based on concatenation. You can’t add an attitude of one strength to another attitude and compare the outcome with a third attitude of some other strength. So if the relationship between attitudes is quantitative (i.e. something that can be better represented with numbers rather than, say, letters) you have to find some way of comparing them and the differences between them.

In theory conjoint measurement could be a solution. If you have a pair of variables that relate non-interactively to a third then you can compare “configurations” of the pair and you can compare the differences between them. So far, so good. If psychological phenomena like attitudes are indeed quantitative we could in theory measure them like that.

The problem is that all of this relies on relations that are lawful. These need to be established empirically and we just can’t manipulate the stuff we’re trying to measure to the point of doing this (is what I think Trendler is saying). I think his big point is that causal theory and measurement can’t be separated and that the necessary causal theory can’t be tested.

If this is a correct reading of Trendler this is as far as I can follow the argument. The only way I can make sense of the final bit is to imagine that he is saying that at least one of the variables would have to be a quantitative measurement for everything to work, but that can’t be correct.

]]>The argument is made that for a latent construct to be measurable, a necessary condition is that things that are equal on that construct yield equal values on the measurement scale

I think once you start talking about measuring something in this way you are already lost. Everything you try to measure has its own unique problems.

]]>1) The argument is made that for a latent construct to be measurable, a necessary condition is that things that are equal on that construct yield equal values on the measurement scale. It is then pointed out that variability means this will never be possible to guarantee–two individuals with the same ability might produce different scores on a test; conversely, two individuals with the same test score might have different latent ability levels. But then this is used to say that nothing in psychology is measurable? By that logic, nothing ever would be measurable, no? Even Ohm’s needles were never pointing in the “same” direction nor were they ever really in the “same” electric field. I don’t see why stochastic outcomes mean something is not measurable in a meaningful sense, though the argument seems to rest on exactly that assumption.

2) The kind of variability in #1 that makes “sameness” impossible is random, but of course there is also systematic error which can be reduced by experimental controls, as Trendler says. But why isn’t it the case that the careful experimental controls used by experimental psychologists since the days of Ohm and Helmholtz (who were early experimental psychologists in addition to physicists) aren’t sufficient to reduce systematic error to the point of getting good measurements? I agree this is a hard problem, but it is hardly insurmountable and has been dealt with for a long time.

3) Trendler is pessimistic that we can design experiments that selectively influence particular latent psychological constructs (what he calls “Galilean” science) but I don’t see why. People like Townsend and Dzhafarov have been working on this and, thanks to their efforts, we have experimental and statistical techniques that can check conditions necessary for selective influence. And again I don’t see why this is specific to psychology, it is as much a problem in physics, chemistry, and biology.

Finally, it seems like Trendler is talking about “psychology” and “psychological phenomena” as if they are only about characterizing stable traits of individuals through observational methods (e.g., personality inventories). This is, I think, why he doesn’t believe good experiments can be done, because it might be hard to shift those things around and hard to know if one had done so. But while that is a concern of some psychologists, it is hardly the whole field—psychophysics (the stuff that began with Helmholtz and Ohm) and cognitive psychology have taken problems of measurement quite seriously since their inception and it is no accident that these fields have shown the greatest theoretical progress and have far fewer reproducibility issues than other branches of psychology (or medicine for that matter—also a measurement nightmare).

But psychophysics and cognitive psych are not about inferring traits or attitudes from correlational data, they are about building models of how stimulus attributes and context affect observable behavior via internal perceptual/cognitive mechanisms. I grant this is not the “sexy psychology” that often gets promoted in the science tabloids so I understand how it can get overlooked, but I don’t think it is fair to ignore either, particularly since they exemplify how taking measurement problems seriously leads to real progress.

]]>