Skip to content

Inauthentic leadership? Development and validation of methods-based criticism

Thomas Basbøll writes:

I need some help with a critique of a paper that is part of the apparently growing retraction scandal in leadership studies. Here’s Retraction Watch.

The paper I want to look at is here: “Authentic Leadership: Development and Validation of a Theory-Based Measure” By F. O. Walumbwa, B. J. Avolio, W. L. Gardner, T. S. Wernsing, S. J. Peterson Journal of Management 2007.

I have a lot of issues with this paper that are right on the surface, and the one thing (again on the surface) that seems to justify its existence is the possibility that the quantitative stuff is, well, true. But that’s exactly what’s been questioned. And in pretty harsh terms. The critics are saying its defects are entirely to obvious to anyone who knows anything about structural equation modeling. The implication is that in leadership studies no one—not the authors, not the reviewers, not the editors, not the readers—actually understands SEM. It’s just a way of presenting their ideas about leadership as science—“research”, “evidence-based”, etc.

Hey, I can relate to that: I don’t understand structural equation modeling either! I keep meaning to sit down and figure it all out sometime. Meanwhile, it remains popular in much of social science, and my go-to way of understanding anything framed as a structural equation model is to think about it some other way.

For example, there was the recent discussion of the claim that subliminal smiley-faces have been effects on political attitudes. It turns out there was no strong evidence for such a claim, but there was some indirect argument based on structural equation models.

Anyway, Basbøll continues:

I’m way out of my depth on the technical issues, however. There’s some discussion of the statistical issues with the paper here.

There’s also a question about data fabrication, but I want to leave that on the side. I’m hoping there’s someone among your readers who might have some pretty quick way to see if the critics are right that the structural equation modeling they use is bogus.

The paper has been widely cited, and has won an award—for being the most cited paper in 2013.

The editors are not saying very much about the criticism.

Hey, I published a paper recently in Journal of Management. So maybe I could ask my editors there what they think of all this.

Basbøll continues

In addition to doing a close reading of the argument (which is weird to me, like I say), I also want to track down all the people that have been citing it, to see whether the statistical analysis actually matters. I suspect it’s just taken as “reviewed therefore true”. If the critics are right, that would make the use of statistics here a great example of cargo-cult science, completely detached from reality.

You’ve talked about measurement recently, so I should say that I don’t think the thing they’re trying to measure can be measured, and is best talked about in other ways, but if their analysis itself is bad, however they got the data (whether by mis-measurement or by outright fabrication), then that point is somewhat moot.

What do all of you think? I’m prepared to agree with Basbøll on this because I agree with him on other things.

This sort of reasoning is sometimes called a “prior,” but I’d prefer to think of it as a model in which the quality of the article is an unknown predictive quantity and “Basbøll doesn’t like it” is the value of a predictor.

In any case, I have neither the energy nor the interest to actually read the damn article. But if any of you have any thoughts on it, feel free to chime in.


  1. Kyle C says:

    Professor Gelman, I hope the post scheduled for Tuesday, “There are many studies showing …,” is still in the queue. I was looking forward to it.

  2. Michael says:

    (Full disclosure: I am on the faculty with one of the co-authors of the paper in question.) One of the reasons that the paper is so highly cited is because it introduces a new measure. Anyone using the measure in future studies will cite it.

    Despite Basboll’s skepticism that these things can be measured, this is the state of the art in organizational behavior and much of psychology. The issue is that most research in these areas study what psychologists so,writes refer to as “hypothetical constructs,” meaning we think they exist but cannot measure them directly. Usually this is because they only exist inside people’s heads, so we have to ask them to self-report their perceptions. Not ideal, but until neuroscience develops a way to tap directly into people’s heads, it’s all we have.

    From a measure development perspective, if they did all the things the paper reports, then they did a really good job of it. It does look like there are some problems with the SEM reporting, though. As a reviewer for management journals, I have seen a number of papers where SEM numbers don’t line up. I think some researchers use SEM (and other statistical analyses beyond OLS regression and ANOVA) without really understanding it. The software spits out numbers even when the model hasn’t been set up correctly.

    • Rahul says:

      Are these “hypothetical constructs” falsifiable?

      • Michael says:

        I’m not sure what it means to say a construct is falsifiable. Theories should be falsifiable (Popper), because they make statements. The only statement about a constriuct, I guess, is whether it exists or not. If that is the case, I would say that authentic leadership is falsifiable–it may not exist, and this measure may be tapping into something else.

        This is where work on construct validity is so important. People need to ensure that they have operationalized the construct in a way that it truly reflects the construct. One of the key ways to do this is to show its convergent and discrimination validity in relation to other constructs that most people accept. This places a new construct into a nomological network of related concepts in some theoretical space.

        • Rahul says:

          e.g. Is there any theory based on “authentic leadership” that can be subject to a falsifiability test? Do you have any examples?

        • James says:

          I would like to add to you statement ‘it truly reflects the construct’ that I am sure you understand and simply assumed: That what is of foundational importance is that the construct explicitly represents and can be directly tied to an aspect of reality having causal relationships with other aspects of reality including the responses to the assessment items.

          • Rahul says:

            How can a construct be operationalized to truly reflect the construct? Isn’t that recursive?

            • James says:

              Yes. I suspect it was a misstatement, which is why I highlighted the importance of theory modeling reality.

            • Erikson Kaszubowski says:

              Since a construct is a theoretical-motivated abstract object, I’d say that it can’t be falsified in itself. In classical test theory (e.g.: ), construct verisimilitude is assessed by testing derived hypotheses that: a) operationalize it (construct validity); and b) compare it to some known measure that should be related to the construct (criterion validity).

              (a) involves coming up with a set of items that should ‘truly reflect’ the underlying construct. In practice, this usually means coming up with questions that should be related to the construct. Those items might be ‘falsified’ as a good measure of the underlying latent variable with factor analysis, if they don’t load in each factor as expected. This might lead to a new set of items or a reformulation of the construct (reducing it from two-factor to one factor, etc.). Psychometric studies are usually heavy with degrees of freedom in how to do it (e.g.: if a set of items that should load in two factors has a better solution with three, do we refine the items or redefine the model?)
              In the cited article, fit measures aren’t that great, as pointed out by James, but one solution has ‘better fit’ than others. So it’s OK to use it? Or should we look for a better one? More degrees of freedom.

              (b) is done by computing correlation of a given measure to measures of other phenomena that theory predicts as related to the construct. This might make sense in a stress scale, for example: people with chronic stress usually have higher blood pressure, so we could expect that a stress scale increases monotonically with blood pressure. Choosing a good criterion is tricky and contextual. How should we select criterion variables for ‘authentic leadership’? Should we use measures of other accepted constructs? Testing the construct ends up resting more on auxiliary hypotheses than on the original theory.

              I think this is what Michael meant by ‘operationalizations that truly reflect the construct’. But Rahul’s summary of the process is quite right, at least in the scientific literature I’m used to reading. ‘Ritual of mindless statistics’, as Gigerenzer puts it, in most cases.

              • James says:

                You are highlighting one of the biggest problems with the accepted methods for construct validity in psychology. The disconnect from reality during the development of the construct itself allows for the introduction into the scientific discourse extremely ‘squishy’ concepts that have predictive validities of actual behavior at a very modest level, often accounting for 15% of the variation of the outcome, which is then argued to be good support for the construct.

              • I think James and Rahul are right about “squishiness” and “fuzziness” at work here. The authors claim to have developed a “reliable and valid instrument for examining the level of authentic leadership exhibited by managers” (p. 121) but they’ve got an 85% fudge factor at best. Actually, I’m not sure the degree of “squishiness” is even measurable here. I suspect that there’s an entirely arbitrary relationship between their instrument and the actual on-the-ground authenticity of any given manager. That is, their “construct” does not “exist” in reality.

              • Erikson Kaszubowski says:

                I took the time to read the paper more carefully. As Michael stated, the authors followed standard practices in the field. In fact, they did a lot more than I’m used to in measure development! But in Psychometrics, ‘reliable and valid’ is usually a shorthand for ‘good factor loadings (under flexible criteria), high Cronbach’s alpha (or other measure of internal consistency) and p < 0.05 correlations between the measure and related variables'.

                There are a couple of oddities in the article, though:
                1) In the first study, the number of DF is off:
                The covariance matrix for a 16-item scale has 136 (16*(16+1)/2) distinct values.
                1.1) For a single factor, independent errors model (Model 1), there are 16 variances plus 16 loadings to be estimated; so (without additional constrains) 136 – 32 = 104 DF. But the reported DF for M1 is 102;
                The same difference of 2 DF is present in the other models. I imagine that the authors might have included correlations between error variances for some arcane reason, but it's not stated anywhere in the paper.

                2) The RMSEA for the proposed model (Model 3) with the US sample is wrong:
                RMSEA = sqrt((X^2 – df)/(N-1)*df)
                sqrt((234.7-98)/((224-1)*98)) =~ 0.08
                which is considerably worse than the stated value (<=0.05 is usually used as a threshold for good fit).
                The same happens with the RMSEA for Model 1 and 2 with the Chinese sample.

                3) What really puzzles me is how they got a nested, more restricted model (Model 3), to have a better fit than a more flexible model (Model 2). As far as I understand it, nested models in SEM should always have worse fit, that is, we would expect, in the best case-scenario, that the difference in X2 is not significant, not significant in favor of the restricted model!

                Well, those issues were already stated in PubPeer, so I don't know if I'm adding anything to the discussion.

                Is this a case for retraction? I really don't know; maybe the authors are able to justify those oddities in the paper by stating more clearly what they have done to compute all those pretty numbers? If there's reason for retracting this article beyond its (apparently non-intentional) statistical errors, then most research in measure development should suffer similar fate for supporting awkward constructs with poor foundations in statistical methods.

      • Elin says:

        I think if the factor analysis did not support them or did not support them in repeated testing with different populations etc, then you could say they do not exist.

        • Rahul says:

          Are there constructs in leadership research that were discarded as false because factor analysis did not support them? Do you know any papers of this sort?

          • That is a great question. I’ll look for it in leadership research, but I do know of other management fields where there is a conspicuous absence of discarded ideas. Sure, some fall out of fashion. But they are rarely explicitly discarded because a construct was not supported by the data. It’s a good question to ask of any science, in my opinion: how good are you at discarding bad constructs?

            • Rahul says:

              The more fuzzy an idea, the easier it is to formulate it & the harder it is to falsify it. That paper if full of fuzzy terms & notation.

              • When science is best it uses measurement to combat fuzziness. I.e., it tries to pin down a notion around values that can be unambiguously determined. In this case, the rhetoric of measurement is being used to compound fuzziness.

    • Thanks for this, Michael. That’s more or less the angle I’m working at the moment. The appeal of the paper (which can be seen in it’s citation) lies in providing a measure (even an measuring instrument, I think) for something that would otherwise not be quantifiable. As with all psychological measurement, I’m willing to stand corrected when something actually does seem to be measured, which is why the modelling has to be on the face of it plausible. Also, I agree with Rahul that it’s worth asking whether SEM yields falsifiable “constructs”. With ordinary scientific measurement, there’s the possibility of testing the the new measure or technique of measurement.

      • James says:

        Isn’t this the evidence that the Chi^2 statistic is telling us — that the models do not adequately fit the data?

      • Elin says:

        It’s been along time since I read any of this, but isn’t the fundamental theorem of factor analysis that there is a set of underlying constructs that are real, the variables used in the analysis are indicators, not direct measures, but through the factor analysis these multiple indicators can be used approximate the underlying constructs for sample members. When I was in grad school and we learned factor analysis it was always said it was art not science to do, and you want to differentiate the confirmatory FA from the exploratory FA. In this paper they are doing confirmatory. So I believe that if the factor loadings did not reveal any underlying structure then that would tend to not confirm their existence. Of course you will never get no correlations between the separate items since that will happen by chance alone, and there is no equivalent to the arbitrary cut off of a p value. THere are just general guidelines that people use.

    • Rahul says:

      Because these issues persist as “state of the art” in these fields, I wonder whether these papers get read much by people outside of these fields?

      Maybe once you read papers like this one everyday they cease to bother you any more and that becomes the new normal. But to an outsider like me that 40 page paper did feel a lot like The Emperor’s New Clothes.

    • Andrew (not Gelman) says:

      In what world is fabricating almost all of your fit statistics (or at least accidentally getting almost everything wrong) evidence of a “really good job”? Surely you have some ethical responsibility to have a quiet word with your colleague about the need to get this paper retracted.

  3. James says:

    After a quick read of the paper, what pops out immediately is that none of the models fit the data according to the Chi^2 and yet delta_chi^2 is interpreted as if they did.

    While Michael is correct that this is considered ‘state of the art’ it the sad state of affairs in much of psychology — that testing a complex measurement model based on a couple hundred that does not fit the data is considered ‘a really good job’.

    • Rahul says:

      I read the paper. Their approach is (a) to hype a vague, fuzzy term (“authentic leadership”) then (b) collect some noisy survey data (c ) make it impressive by adding large international datasets & (d) then make it sound pseudo-scientific by adding a complex model, wow people with some math, and (e) finally come up with some complex yet vague conclusions that are not exactly falsifiable and have close to zero predictive value.

      Garbage in, garbage out.

    • Thanks James. How easy do you think it would be to explain why that is a problem to a nonspecialist? Someone like me, for example. To you it “pops out” as an immediate problem, just as the argument itself seems superficially “bad” to me. (I agree with Rahul: it just seems immediately “fuzzy”.) If I could somehow explain both at the same time in a paper, without presuming that “none of the models fit the data according to the Chi^2 and yet delta_chi^2 is interpreted as if they did” is immediately understood as a sign of trouble, then I think I might have something important to say about social science in management/leadership studies.

      • James says:

        Hi Thomas,

        There is an ongoing debate about structural equation modeling and model fit. A 2007 issue of Personality and Individual Differences ( was devoted to this topic and provides a good overview of the arguments on both sides. At some point I was persuaded by the arguments about the importance of acknowledging that the only actual model fit statistic there is for structural equation modeling often goes unreported and even when it is reported is typically telling us the model does not fit the data and is then summarily dismissed and other metrics are used to argue for the appropriateness of the model — precisely as was done in the Authentic Leadership validation study paper.

        Also see here for a very strong argument for attention to chi_square: — and also a warning that a non-significant chi_square is not a guarantee of a proper model either and diagnostics are suggested as a way to deal with this.

        A brief summary of the two sides would be:

        1.) AGAINST strict adherence to Chi_Square: When comparing the saturated model to the theoretical model, sample sizes above 400 result in mostly significant chi_square, therefore, we should use something that allows more of our models to pass the fit test. These rejections are assumed to be due to a problem with the chi_square test and not with the theory or measurement. It is further assumed that the differences causing the chi_square to be significant are trivial.

        2.) FOR proper attention to Chi_Square: Increasing sample size does not result in a significant chi_square for a properly specified model. Things such as sampling error and reasonable amounts of measurement error, etc… will be accounted for in the error terms. There is nothing about the chi_square test that tells you whether the differences are trivial and thus it is important to determine what aspect of the model-data interface that is in fact causing the significant difference.

        Three points in support of attention to chi_square excerpted at the SEMNET listserve by John Antonakis:

        a) It is simply untrue the chi-square will reject 99.9% of models.
        b) It is simply untrue that chi-square cannot distinguish good
        from bad models.
        c) It is not a problem to have large samples.

        A related article:

        Antonakis, J., & House, R. J. (2014). Instrumental leadership: Measurement and extension of transformational–transactional leadership theory. The Leadership Quarterly, 25(4), 746-771.

        And the alternative with which more people here are likely familiar is Directed Acyclic Graphs. Judea Pearl is quite a proponent of this method.

        • Thanks for this, James. I will definitely look into it. Do you have a position on whether this is (a) a misuse of SEM to support an unsupportable construct and/or (b) SEM can actually model anything at all. There seems to be an undercurrent of critique that goes after “the very idea” of structural equation modelling. So I’m trying to decide wether this paper uses a method badly or uses a bad method.

          • Elin says:

            Like anything, there are people who don’t agree with the assumptions and approaches of SEM and those that do. Just like there are Bayesians and Frequentists, or people who think that in a given situation you should use a Poisson and others who say to use a negative binomial. I disagree with making the assumptions (statistical and theoretical) required for SEM in most situations so you could say that I am critical of the “idea” of SEM, but that is just normal scholarly debate in the social sciences. Methods reflect and embody theory. That said, are there situations where it is wrong to use SEM even in its own terms and even where people who like SEM would say it doesn’t make sense. Yes. Even as a non SEM person I know that much from reading and hearing talks. Is this one of them? Not sure, it would really take a SEM person to delve into that.

            I don’t think SEM has moral standing as “good” or “bad”; it’s a reflection of one social science approach to the idea of causation and the challenge of measuring things that can’t be measured directly. Its advocates have their internal debates, but it also is well developed. So do and are a lot of theories and methodological approaches.

            In the end, though, one has to return to the fundamental issue of whether this concept makes theoretical sense and is well specified and, if so, if it makes sense on its face to measure it with these items. I haven’t seen a great argument for that, but, then again, I haven’t looked.

            James, thanks for those citations, very interesting.

            • Thanks, Elin. Yes, I suppose it’s an ill method that doesn’t blow anyone any goods. For the purpose of this critique, I’m going to count among the “methodological” issues (i.e., within the question of method) the question of whether SEM should have been used here. Sort of like heart surgery isn’t a bad method per se, but it’s a bad way to cure a common cold.

          • James says:

            I think it is the right method where information about fit is being ignored and that this is a widespread practice. If SEM is in fact capable of helping us sift good from bad models when the fit information is attended to, then why should we not acknowledge what it is telling us? I’m not sure I understand the argument that SEM is not a good method.

            • James says:

              Allow me to rephrase. It is an appropriate method that is far better than the traditional methods where exploratory factor analysis was used to divine meaning.

              However, it may not be the ultimate BEST method as alternatives that allow us to better assess the causal relationship among constructs as well as the structural relationship among predictor and outcome variables may already be out there or in the process of development.

  4. James says:

    Consider this possibility as well. Let’s say someone believes that the metaphor ‘spraying weed killer’ is a good hypothetical construct for engaging in the ‘soft’ management skills or rather avoiding dealing with them. They can create a construct called ‘weed killing’ come up with a hand full of items about how to undermine your direct reports that people respond to similarly and test them using SEM, find the model does not fit according to the chi^2 and then claim they have developed a new construct.

    It seems to me there has to be a better way to advance the field than this approach allows.

  5. Daniel says:

    “Hey, I can relate to that: I don’t understand structural equation modeling either! I keep meaning to sit down and figure it all out sometime. Meanwhile, it remains popular in much of social science, and my go-to way of understanding anything framed as a structural equation model is to think about it some other way.”

    I always thought SEM were anything that couldn’t be represented by a traditional regression. In my experience (and maybe this is a function of the field I work in), most SEMs are just glorified dynamic discrete choice models. Perhaps that’s what you mean by “think about it in another way” (I.e. In that example you would think in terms of discrete choice models).

  6. Billy says:

    In the second study it seems a bit odd to model two distinct continuums for leadership styles rather than using some type of mixture model approach where the authors could test the number of distinct classes and model the continuum within each class (since there could be more than two classes although their lit review only identified two). The authors probably should have used a different estimator than ML since their response types were likely ordinal in nature and alternate estimators (e.g., Diagonally Weighted Least Squares/Asymptotic Distribution Free) provide more consistent estimates when the data are polytomous. I would think they would have tested for measurement invariance in the study with respondents from the US and China as well, but it didn’t seem to be the case given what was reported in the tables.

    However, the biggest difficulty researchers using SEM typically face is how they reconcile the space of other alternative models that fit their data as well as they model they propose. George Mercoulides gave a talk at the SEM SIG meeting at 2013’s AERA conference where he really tried to emphasize how much of a problem it is and how the number of alternate models can grow exponentially.

    • James says:

      I don’t disagree with your last paragraph, but it is important to keep in mind the methods SEM is attempting to improve upon which suffer from the same and in most instances worse issues: chronbach’s alpha, the Summation of item values, and exploratory factor analysis.

      While theoretically there may be an infinite number of alternative models, practically I do not believe that is the case. And if we are, then it is likely either the construct is not unidimensional and if it is not unidimensional then is it a construct amenable to testing via the scientific method? Multidimensional constructs are certainly reasonable, but each must be made up of multiple unidimensional constructs as was done in the Authentic Leadership paper. I do not challenge whether the individual unidimensional constructs are reasonable, because they seem to be — but I do question whether the measurement model as currently structured makes sense given what the chi-square test is telling us.

      Psychology is an area in which ‘measurement’ of constructs is conducted by combining in some fashion similar and disparate features of a theoretical concept representing some aspect of the real world. To do so, there must be enough distinctness, stability, and variation in the phenomenon to be represented numerically. Does the shared variance effectively measure this construct? I do not think using methods that allow an excessive amount of vagueness has been all that useful at anything other than providing very small amounts of evidence for very strong claims about extremely uncertain phenomena.

      • Is there an widely accepted, paradigmatic cases where SEM has been applied and a valid and reliable instrument has been developed? What construct in psychology could we point to as one that “authentic leadership” aspires to become as good as?

  7. janice says:

    The pubpeer discussion about this paper has been circulating in my own department in recent weeks. The full discussion is quite something to read because more and more errors in the paper appear to have been detected (beyond the initial problem with the degrees of freedom, nested models, and RMSEA statistics). I teach and am pretty familiar with SEM myself and am stunned that the blatant errors in the authentic leadership papers were not caught by reviewers or the action editor of what is typically thought to be a high-quality journal. Far more disturbing though is that these major errors remain uncorrected. If the pubpeer thread is to be believed then all of the authors of the article and the editors of the journal know about these problems and yet the paper continues to go uncorrected.
    Michael (a commentator in this thread) claims to be in the same department as one of the authors. I would urge him to have a word with his colleague and suggest that the paper be retracted or corrected.

  8. janice says:

    Just another quick note of clarification – some of the comments in this thread ask (reasonably) whether these errors are reason for retraction. Unfortunately the first author of this paper has around 30 other papers of his flagged on pubpeer and has already had at least 7 papers retracted and one expression of concern (+ numerous corrigendums). Although the reasons for retraction are not fully given for the retracted papers none appear to have errors as severe as these. Perhaps most importantly, this paper claims that the higher-order model has better fit than the most plausible alternative model which turns out to be an obvious mathematical impossibility given the nesting issue. The whole argument for authentic leadership as a construct (and certainly the central claimed contribution of this paper) pretty much rests on this false claim. A retraction seems appropriate.

  9. Andrew (not Gelman) says:

    I think it would be great to hear the thoughts of the action editor of your JOM paper on this issue – as you suggest in your original post. From the acknowledgments section of your paper it would appear that Fred Oswald may have been the action editor. Fred is great and I cannot imagine that he would not act on this stuff if he was made aware of it. Of course, the pubpeer comments suggest that the JOM editor has been aware of the problems with the paper for more than 2 years. Fred is only an associate editor and may not be aware of this discussion. Of course, it looks like Russell Cropanzano was the action editor of the paper in question and I cannot fathom how some of the more obvious error slipped by him either.

  10. Paul says:

    They intend to do nothing. The authors have been invited to correct the papers if they want to. That offer has been standing for years.

    To give context: JOM has been making a huge deal of their recent rise in the impact ratings. They are currently ranked #1 or #2 depending on what category you put them in. At nearly 900 citations, this paper contributed more than a little to that. As noted in the pubpeer comments, they gave this paper an award for impact even after being alerted to the problems with it.

  11. mark says:

    Say what? They have been “invited” to correct their apparently very serious errors? Please tell me that you are joking or at least speculating wildly.

Leave a Reply to Thomas Basbøll