It should be ok to just publish the data.

Posted on August 15, 2018 9:39 AM by Andrew

Gur Huberman asked for my reaction to a recent manuscript, Are CEOs Different? Characteristics of Top Managers, by Steven Kaplan and Morten Sorensen. The paper begins:

We use a dataset of over 2,600 executive assessments to study thirty individual characteristics of candidates for top executive positions – CEO, CFO, COO and others. We classify the thirty candidate characteristics with four primary factors: general ability, execution vs. interpersonal, charisma vs. analytic, and strategic vs. managerial. CEO candidates tend to score higher on these factors; CFO candidates score lower. Conditional on being a candidate, executives with greater interpersonal skills are more likely to be hired, suggesting that such skills are important in the selection process. Scores on the four factors also predict future career progression. Non-CEO candidates who score higher on the four factors are subsequently more likely to become CEOs. The patterns are qualitatively similar for public, private equity and venture capital owned companies. We do not find economically large differences in the four factors for men and women. Women, however, are subsequently less likely to become CEOs, holding the four factors constant.

I really don’t know what to do with this sort of thing. On one hand, the selection processes for business managers are worth studying, and these could be valuable data. On the other hand, the whole study looks like a mess: there’s no real sense of seeing all the data and once; rather, it looks like just a bunch of comparisons grabbed from the noise. So I have no real reason to take any of these empirical patterns seriously in the sense of thinking they would generalize beyond this particular dataset. But you have to start somewhere.

It was hard for me to bring myself to read a lot of the article; the whole thing just seemed kinda boring to me. If it were about sports, I’d be interested. But my decision to set the paper aside because it’s boring . . . that brings up a real bias in the dissemination of these sorts of reports. If a paper makes a strong claim, however ridiculous (Managers who wear red are more likely to perform better in years ending in 9!), or some claim with political content (Group X does, or does not, discriminate against group Y) then it’s more likely to get attention, also more likely to attract criticism. It’s just more likely to be talked about.

But, again, my take-home point is I don’t have a good way of thinking about this sort of paper, in which a somewhat interesting dataset is taken and then some regressions and comparisons are made. What I really think, I suppose, is that the academic communication system should be changed so it becomes OK to just publish interesting data, without having to clothe it in regressions and statistical significance. Not that regression is a bad method, it’s just that in this case I suspect the main contribution is putting together the dataset, and there’s no need for these data to be tied to some particular set of analyses.

14 thoughts on “It should be ok to just publish the data.”

Dale Lehman on August 15, 2018 10:26 AM at 10:26 am said:

Unlike you, I find the topic somewhat interesting – perhaps more than sports data (unless you mean my golf scores, because I like big numbers). What I find boring – and beyond annoying – are two things: the lack of publicly available data and the interminable writing style that puts all the tables and figures in the back of the document. Both make it virtually unreadable. I actually don’t care much about their attempts to build personality models. In fact, I believe the whole field of personality modeling is quite flawed. I am particularly skeptical of dichotomous traits (e.g., charisma vs. analytic). But the data itself I would find of interest. And, if it is interesting enough I might even be interested in what they have done with that data.

It seems to me that it should work in reverse – the data should be published (with credit) and this might motivate interest in the study. Instead, we have a study that is supposed to motivate interest in itself, and then perhaps we want to see the data – but, of course, that will not be permitted. This is a great example of how screwed up the incentives are in academia. Credit is given for the wrong things and the most valuable things do not receive credit.

Reply ↓
- APG on August 15, 2018 8:27 PM at 8:27 pm said:
  
  It would be great if these sorts of datasets were publicly released but it is impractical for a number of reasons. Firstly it is probably illegal for the company who collected this data to release it publicly. There is normally specific legal agreements in place between the assessement firm and their clients governing data use. Outside of those agreements personality data is generally considered highly private much like medical or genetic data and most psychologists are going to be reluctant to release this data in its raw form.
  
  Secondly, this data has a lot of commercial value. Large personality datasets are extremely valuable. Major companies routinely spend 6-7 figures collecting data when developing a new assessment. Real data on managers and executives is particularly valuable because it can be used to compare the quality of a companies hires against their peers.
  
  As a tip, if you want access to these sorts of datasets go to a conference where assessement/consulting firms are presenting posters and discuss collaboration in person. Companies are often willing to share data if you are a well established researcher.
  
  Reply ↓
  - Dale Lehman on August 16, 2018 9:06 AM at 9:06 am said:
    
    Your points about legality and value are fair – of course, they are based on real concerns and I wouldn’t want to ignore those. But the reality is that both are severely overstated – much like “national security” is/was used to redact many many documents. Health care is a good example: HIPPA is based on valid concerns about privacy, but has become a de facto excuse for not releasing any data from clinical trials. It is a convenient excuse for poor academic research incentives to thrive. Similarly for the commercial value in many data sets. Sure, some is the livelihood of the firms that collect it. But I’ve seen claims of commercial value for the cell phone usage records of college students that were 5 years old. Surely, the personality data is of most value when tied to the individuals. You could argue that making such anonymized data readily available might increase the commercial value by making more people aware of how such data can be used.
    
    I can agree with your points as caveats. But I definitely do not like the too easily adopted knee jerk reaction that protects poor research practice and dysfunctional incentives. As with all things, we need a balance – and the current balance needs adjusting (more than incremental adjustment, in my opinion).
    
    Reply ↓
    - Martha (Smith) on August 16, 2018 7:14 PM at 7:14 pm said:
      
      I agree — but also think it is important to protect individual privacy in cases where data collected on an individual would be sufficient to identify that individual.
Daniel Lakeland on August 15, 2018 11:38 AM at 11:38 am said:

The converse is also true, it shouldn’t be OK to just publish the analysis (without the data).

Reply ↓
- Rahul on August 16, 2018 1:37 PM at 1:37 pm said:
  
  I’m waiting for the study that declares:
  
  We use a dataset of over 2,600 executive assessments to study thirty individual characteristics of candidates for top executive positions…..and unfortunately didn’t find any predictive correlates of success.
  
  Reply ↓
  - Martha (Smith) on August 16, 2018 7:17 PM at 7:17 pm said:
    
    Ah, but then maybe they didn’t collect the right data — for example, maybe they didn’t include information on whether the applicant is a friend (or even a friend of a friend) of the boss.
    
    Reply ↓
Greg Sanders on August 15, 2018 11:44 AM at 11:44 am said:

This makes sense to me, although from my own dataset assembly experience, a lot of the work isn’t really done until you’ve tried to apply the dataset to questions of interest and find out all the complicating factors that emerge. I don’t think that undermines your core point, but I think it’s worth considering what forms of analysis might serve as an effective test drive of the data and what can be left out. Maybe just descriptive stats but not models?

Reply ↓
- Alex Gamma on August 15, 2018 2:49 PM at 2:49 pm said:
  
  I think description is heavily underrated and a really good and smart set of descriptive statistics for such a dataset should be publishable on its own.
  
  Reply ↓
- Keith O'Rourke on August 16, 2018 8:15 AM at 8:15 am said:
  
  Greg:
  
  > a lot of the work isn’t really done until you’ve tried to apply the dataset to questions of interest
  That does suggest the data quality management is simply inadequate at least for a controlled study.
  
  Reply ↓
Germán Reyes on August 15, 2018 4:51 PM at 4:51 pm said:

The journal Scientific Data, by Nature, publishes just datasets. From the website:

“Scientific Data is a peer-reviewed, open-access journal for descriptions of scientifically valuable datasets, and research that advances the sharing and reuse of scientific data. We aim to promote wider data sharing and reuse, and to credit those that share.”

https://www.nature.com/sdata/publish/for-authors#aims-scope

Here are the databases available for Social Sciences:

https://www.nature.com/search?journal=sdata&subject=social-sciences

Reply ↓
markus on August 16, 2018 3:03 AM at 3:03 am said:

What Greg Sanders said.
Or more generally, ‘just the dataset’ doesn’t give you the operationalization, i.e. the description of how the data connects to the constructs/theory/ideas behind the measurement in the first place. Some of that can be provided in an extensive variable manual, but even there one usually does not communicate why a particular definition was chosen (and what that implies), just the definition.
Further, you also lose all the data treatment strategies developed in a field: Like how single item analyses on x1 are discouraged because of high variation and bias, but if you combine x1, …, x10 the score has favourable properties. Or how you absolutely should not form a ratio of x1 and x2 because 10 years ago several papers showed that it doesn’t work.

Reply ↓
- Martha (Smith) on August 16, 2018 7:04 PM at 7:04 pm said:
  
  Good points.
  
  Reply ↓
Keith O'Rourke on August 16, 2018 8:25 AM at 8:25 am said:

This general idea has been advocated for sometime – but perhaps it has not been fleshed out enough.

You certainly do not just want a data dump but rather enough about the background motivations, study conduct as well as the aspirations (the idealized analyses and what they might show) the investigators have/had in mind.

This might be taken as a guiding principle for what is desirable:
“that each should understand pretty minutely what it is that each one of the other’s work consists in”

Full quote “But what I mean by a ”science” (…) is the life devoted to the pursuit of truth according to the best known methods on the part of a group of men who understand one another’s ideas and works as no outsider can. It is not what they have already found out which makes their business a science; it is that they are pursuing a branch of truth according, I will not say, to the best methods, but according to the best methods that are known at the time. I do not call the solitary studies of a single man a science. It is only when a group of men, more or less in intercommunication, are aiding and stimulating one another by their understanding of a particular group of studies as outsiders cannot
understand them, that I call their life a science. It is not necessary that they should all be at work upon the same problem, or that all should be fully acquainted with all that it is needful for another of them to know; but their studies must be so closely allied that any one of them could take up the problem of any other after some months of special preparation and that each should understand pretty minutely what it is that each one of the other’s work consists in; so that any two of them meeting together shall be thoroughly conversant with each other’s ideas and the language he talks and should feel each other to be brethren” (Peirce: MS 1334, pp. 11-14, 1905).

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

It should be ok to just publish the data.

14 thoughts on “It should be ok to just publish the data.”

Leave a Reply to Keith O'Rourke Cancel reply