Reanalysis of that Nobel prizewinning study of patents and innovation (with R and Stan code)

A few days ago we discussed a paper from 2005, Competition and Innovation: An Inverted-U Relationship, two of whose authors recently won the Nobel prize in economics.

I had some concerns about the analysis, which I can express with reference to the above figure from that paper:

1. The paper’s all about an inverted U relationship, but this is driven by fitting a quadratic curve rather than, say, a curve with diminishing returns.

2. The line does not seem to go through the data points. In particular, the curve seems to be too low at the rightmost part of the graph–an artifact of fitting a quadratic, perhaps?

3. The y-axis is some weighted count of patents but it’s being used as a measure of the more abstract concept of “innovation.”

4. The x-axis is an average of profit margins but it’s being used as labeled a measure of the more abstract concept of “competition.”

5. The model predicts patents from profit margin in the same year, but to the extent the model is appropriate I think you’d expect a lag.

6. They use Poisson regression, but the data are not counts, also if you don’t somehow correct for overdispersion your standard errors will be too low.

The plan

Here are the ways I’m gonna adjust for the above issues in my reanalysis:

1. I’ll fit a quadratic curve to replicate what they did in the paper, and I’ll also fit another family of curves (the “hinge”) that allows for nonlinearity but without enforcing non-monotonicity.

2. One problem with the above graph is that it excludes 20% of the data (see the figure caption). I’ll plot the fitted curves showing all the data. Another possible reason for the problematic fit is that in the paper they say they adjust for industry and year effects, and so I’ll make a plot showing the fitted curve and the data broken down by industry. I’ll do the adjustments in two ways: “fixed effects” as in the published paper (using various R packages), and multilevel modeling (using Stan).

3, 4. I’ll relabel the axes of the graph to more accurately capture the authors’ measures of competitiveness and innovation, but I’ll swallow the concern that the analysis only uses a subset of the data that were available at the time (from the paper, “Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

5. This lag thing is an issue, but I have enough concerns with items 3 and 4 that it’s hard for me to take the paper’s theoretical and causal arguments seriously. But, sure, you could replicate my analyses at different lags.

6. I’ll compare four sorts of models: Poisson, quasipoisson, negative binomial, and normal regression on log(y+1). Quasipoisson and negative binomial are two different ways to correct Poisson regression for overdispersion. Modeling on log(y+1) is a completely different way to model data that are mostly positive but have some zeroes, and I think it actually makes the most sense for this example, as the data are not actually counts. When fitting Poisson and negative binomial regressions, I’ll round the data to the nearest integer, which turns out to essentially not affect the results.

The above is not a preregistration plan; I wrote it in the middle of the project after doing some of the analyses already.

The data

Bradford in comments links to a post from 2014 by Leif Nelson and Uri Simonsohn that includes a file linking to the website of Nicholas Bloom, one of the authors of the paper. Navigating the site takes me to this page with a link to a zip file with the data, which I then downloaded.

Good job by Bloom to post the data from a paper published in 2005. I’m not so good with data availability myself.

The data file has 354 records, including data from 1973-1994 and from 17 industries, and for each of these records it has the weighted patent count (“patcw”), the measure of profit margin (“Lc”), and a bunch of variables that I didn’t try to figure out, because these are all I need to replicate the main analysis. (The file does not seem to include a code book, but I’m not complaining, as they’re already way ahead of me by making the data available at all.)

Initial data analysis: quadratic regression using Poisson, quasipoisson, and negative binomial models

I get going by plotting the data and fitting some quadratic models:

Compared to the graph at the top of this post, the above plot shows more data (I’m not trimming the upper and lower 10% of observations), and the quadratic curves are much flatter than were shown in that paper:

(1) green curve: poisson fit to rounded data (using glm() in R)
(2) red curve: quasipoisson fit to rounded data (using glm())
(3) blue curve: quasipoisson fit to raw data (using glm())
(4) pink curve: negative binomial fit to rounded data (using glm.nb())
(5) orange curve: negative binomial fit to rounded data (using stan_glm())

The first four curves are so similar that the lines overlap and you can’t see the green and red curves: They’re there, just overwritten. For each model I display the point estimate of the curve (best fit for models 1-4, posterior median for model 5). In the models above, quasipoisson uses the Poisson fit and then adjusts standard errors to account for overdispersion; negative binomial is a probability distribution that accounts for overdispersion in a different way.

We went from 1 to 2 to see if switching to quasipoisson made a difference (it did; the standard error was much bigger after we allowed for overdispersion); we went from 2 to 3 to see if rounding made a difference (it did’t for these data); we went from 3 to 4 to see if switching to negative binomial made a difference (it didn’t do much, but the standard error increased slightly); and we went from 4 to 5 to see if switching to full Bayes made a difference (it looked a lot different, actually).

I’ll put the code at the end of this post so as not to distract from the story. Here are the relevant pieces of console output:

(1) poisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    25.85 
Lc          164.76    54.90 
I(Lc^2)     -88.39    29.14

(2) quasipoisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    84.64 
Lc          164.76   179.78 
I(Lc^2)     -88.39    95.44

(3) quasipoisson fit to raw data
            coef.est coef.se
(Intercept) -75.01    84.01 
Lc          165.13   178.43 
I(Lc^2)     -88.55    94.72 

(4) negative binomial fit to rounded data
            Estimate Std. Error
(Intercept)   -80.80      89.49
Lc            177.20     190.10
I(Lc^2)       -94.85     100.92

(5) negative binomial fit to rounded data (Bayesian)
            Median MAD_SD
(Intercept)  -7.1   33.2 
Lc           21.1   69.5 
I(Lc^2)     -12.2   37.3

In a paper I’d prefer to display these uncertainties graphically; here I’m giving the console output to give a sense of how things might go in our usual workflow of fitting models and looking at them. Here you can see that the quadratic terms all have large standard errors–except for the very first model, the Poisson, but its standard error is too low as it does not account for overdispersion.

The only thing that really puzzles me here is what’s going on with model 5. Could it be that the Bayesian model uses the posterior median rather than the optimum? I’ll check by re-fitting it, running stan_glm on “optimizing” setting:

(5_opt) negative binomial fit to rounded data (Bayesian posterior mode)
            Median MAD_SD
(Intercept)  -5.3   36.8 
Lc           17.1   76.7 
I(Lc^2)     -10.9   39.4 

That’s not much different from the posterior median. So I guess the difference between the negative binomial regressions fit by glm.nb() and stan_glm() arise from differences in the fitting algorithms, or maybe something I missed in the coding. If I wanted to proceed further down this track I would have to investigate this a bit further, reading up on what glm.nb is actually doing in R, and in Stan programming the negative binomial model myself from scratch and also comparing to brms.

For now I’ll just move on. My guess is that the difference in fits has something to do with how seriously the model takes the small number of influential data points near the top of the graph, but I’ll set that aside for now because we’ll be fitting some more models.

Quadratic regression adjusting for industry and year effects

The next step is to adjust for industry and year effects, which I’ll do in a few ways. I’ll aid in the interpretation of these by plotting the data and fitted curve separately for each industry. In the data the industry codes took on 17 different values ranging from 22 to 49, which according to the paper correspond to “two-digit SIC codes.” So I googled “two-digit SIC codes,” which are listed in various places online. I could not find an official link, but various unofficial sources seemed to agree; here’s one such list.

For each plot, black dots show the data for that industry, with a large dot showing the data from the first year of the data and thin black lines showing the time sequence. The colored curves are the fitted quadratics for each industry in an average year, as explained below.

It’s kind of weird that Furniture has so many patents, and that the number of patents for machinery and computers started out high and then dropped, and that electric and gas services had no patents for the first few years and then suddenly had a lot . . . this is just what’s in the data. I checked a few things to make sure I didn’t garble the categories but it’s possible that I’m missing something.

Here are the models I fit:

(3a) blue curve: quasipoisson fit to rounded data, including factors for industry and year (using glm())
(4a) pink curve: negative binomial fit to rounded data, including factors for industry and year (using glm.nb())
(5a) orange curve: negative binomial fit to rounded data, including factors for industry and year (using stan_glm())
(6a) purple curve: multilevel negative binomial fit to rounded data, with varying intercepts for industry and year (using stan_glmer())

Models (3a), (4a), and (5a) include unmodeled coefficients for industry and year (“fixed effects,” in economics jargon); model (6a) considers these coefficients as latent variables and estimates their distributions (this is what economists call “random effects”).

From the above graph you can see that, after adjusting for industry and year effects, the quadratic curves are much stronger. Most of this comes from the industry effects; there’s not much evidence for unexplained variation at the year level. Here’s the relevant console output from model (6a), which conveniently estimates the scale of each batch of varying intercepts:

stan_glmer
 family:       neg_binomial_2 [log]
 formula:      patcw_rounded ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -68.3   27.1 
Lc          146.6   57.5 
I(Lc^2)     -77.9   30.9 

Auxiliary parameter(s):
                      Median MAD_SD
reciprocal_dispersion 5.6    1.0   

Error terms:
 Groups Name        Std.Dev.
 year   (Intercept) 0.11    
 sic2   (Intercept) 2.11    
Num. levels: year 22, sic2 17

In this fitted model (6a), and also in (4a), and (5a), not shown here, the estimated coefficient for the quadratic term is a bit more than 2 standard errors away from zero, that is, statistically significant at the conventional level. The estimated quadratic term in (3a), the quasipoisson regression, is a bit more than 4 standard errors from zero; I guess this difference is attributable to the different ways that the quasipoisson and negative binomial effectively weight the extreme values in the data.

Quadratic regressions on log(y+1)

When modeling count data we usually start with the negative binomial model with log link. And here I wanted to connect to whatever version of Poisson regression was used in that published paper.

But, as noted above, these data aren’t counts–they’re not even integers, and I think it makes sense to just directly model them on the log scale. Some of the observations have zero values, though, and so I’ll follow the standard practice of modeling nonnegative data y by fitting regressions to log(y+1). My general recommendation along these lines is to model log(y+A), where A is some constant that corresponds to a baseline level of the data. In this case, the 354 data points include 46 zeros, then another 60 values between 0 and 1, then various values (none of them are integers!) ranging as high as 44.7. For the purpose of studying patent counts as measures of innovation, it seems reasonable to add 1 to these data, which blurs the lowest values (there is no big distinction between 0 and the lowest nonzero observation, 0.037) while preserving the distinction between the higher levels. This seems about right: to the extent these data will be supplying a signal on innovation, we’ll want to mostly be learning from data on the high end.

Here’s what happens when we fit the quadratic regression, not adjusting for industry and year effects, to the log(y+1) transformed data. Now we can just use normal errors and we don’t have to worry about Poisson or negative binomials, so there’s just one curve:

You can see the zeros at the very bottom of the graph. The model of normally-distributed errors:

(7) green curve: normal regression fit to log(y+1) (using lm())

is not perfect, but I think it’s close enough, and we no longer have to worry about exactly how we’re modeling extremely high values. On the log scale we still see a fitted quadratic curve. The fit is noisy–the coefficient for the quadratic term has a standard error that is much higher than the estimate itself–but, again, let’s move on to the regressions that adjusts for industry and year effects.

As before, we show the data and fit (this time, with both on the log(y+1) scale) broken out by industry:

The curves come from two fitted models:

(8) orange curve: normal regression fit to log(y+1), including factors for industry and year (using stan_glm())
(9) purple curve: multilevel regression fit to log(y+1), with varying intercepts for industry and year (using stan_glmer())

They’re pretty similar; I assume that the small differences between the two fits arise from the fact that the least-squares model (8) does more adjustment for industries. The estimated coefficients for the quadratic terms are 3 or 4 standard errors from zero. Here’s the output from model (9):

 family:       gaussian [identity]
 formula:      log_patcw ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -86.8   24.6 
Lc          186.6   52.5 
I(Lc^2)     -98.6   27.9 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

Error terms:
 Groups   Name        Std.Dev.
 year     (Intercept) 0.09    
 sic2     (Intercept) 1.04    
 Residual             0.47    
Num. levels: year 22, sic2 17

Hinge regression on log(y+1)

As we have discussed already, the big problem with quadratic regression is that it enforces non-monotonicity. I’d like to fit a model that allows nonlinearity but without a declining slope automatically turning into a negative slope.

My first thought was to fit a model with an asymptote, something like this: b_0 + b_1*(1 – exp(x/b_3)), as an alternative. But this curve has the opposite problem: it is restricted to be monotonic. What I really want is a model that can have diminishing returns or can asymptote or can have that inverted U shape, with the question answered by the data.

We have such a model–the hinge function:


This is a curve that smoothly connects two straight lines which can have arbitrary slopes. The parameters of the curve are:
x0: the value of x where the two lines would intersect
a: the value of y where the two lines would intersect
b0: the slope of the line on the left side of the hinge
b1: the slope of the line on the right side of the hinge
delta: the scale of the continuous curve connecting the two lines

For our purposes, we are most interested in b1: is there evidence that this slope is negative?

We program the hinge model in Stan–the code’s right there at the linked post–also including a multilevel model with varying intercepts for industry and year, as above. Here’s the whole program:

functions {
  vector logistic_hinge(vector x, real x0, real a, real b0, real b1, real delta) { 
    vector[size(x)] xdiff = x - x0;
    return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);
  }
}
data {
  real<lower=0> delta;
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real x0;
  real a, b0, b1;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  x0 ~ normal(1, 1);
  a ~ normal(0, 100);
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(logistic_hinge(x, x0, a, b0, b1, delta) + a1[group1] + a2[group2], sigma);
} 

When fitting the model, we specify the scale parameter delta, setting it to 0.05, which allows for a gentle curve within the range of the data (as you can see from the plots above, x ranges from about 0.85 to 1.0). There’s some arbitrariness here, but it’s just too hard to fit this parameter directly from the data. In effect this curvature is hard-coded into the quadratic regressions fit earlier.

We assign weak priors to the other parameters in the data, effectively excluding extreme values for the slopes of the curves. The only one of these priors that might be confusing is the prior for x0, the x-position of the hinge. We’re soft-bounding it at the high and low ends just so that the fit won’t get lost in extreme values: once x0 is far outside the range of the data, the curve effectively becomes a straight line and the location of the hinge becomes non-identified.

So, yeah, the hinge is a bit more work to fit compared to the quadratic, but that’s the price we have to pay to fit a more flexible model to this small dataset. And in this case I think the flexible model is absolutely necessary given the goal of seeing whether the data indicate that inverted U shape.

Here’s the result:

In each plot, the blue curve represents the posterior median of the parameters and the red curves correspond to 20 draws from the posterior distribution.

Here’s the summary of inferences:

 variable   mean median    sd   mad      q5    q95 rhat ess_bulk ess_tail
   lp__    71.69  71.85  6.65  6.87   60.81  82.19 1.00      838     1428
   x0       0.94   0.93  0.07  0.09    0.82   1.05 1.04      132     1204
   a        4.03   4.03  0.72  0.69    2.84   5.23 1.01     1178     1549
   b0      52.71  36.93 42.67 24.45   15.29 146.35 1.03      251     1288
   b1     -50.66 -31.22 46.50 22.85 -149.89 -11.44 1.03      146      883
   sigma    0.47   0.47  0.02  0.02    0.44   0.50 1.00     4279     2726
   sigma1   1.11   1.08  0.21  0.20    0.81   1.50 1.01      880     1217
   sigma2   0.08   0.08  0.04  0.04    0.02   0.16 1.00      999     1301
   a1[1]   -0.25  -0.25  0.30  0.30   -0.74   0.26 1.00      486      736
   a1[2]    0.23   0.22  0.30  0.30   -0.26   0.75 1.00      491     1036
...

The key parameter is b1, the slope on the right side of the curve. The posterior mean is -50.66 with standard error 46.50–so, apparently, not statistically significant–but if you look at the posterior quantiles, you’ll see that the distribution is very skewed. Indeed, it turns out that all 4000 of the posterior simulation draws of b1 are negative here.

So, after all this modeling, it seems that the data do clearly indicate an inverted U pattern!

That said, I haven’t explored the hinge model too carefully. For example, when I re-fit it with some other choices of the hinge scale parameter delta, it sometimes doesn’t mix well. I think the above graph with delta=0.5 fits these points about as well as is possible, so I’m ok with it as a data summary, but I’m not saying that my fitting procedure is fully computationally robust. If I were to be working more on the computation for this particular problem, I’d start by removing the year effects, as they can be adding stress to the computation without affecting the fit in any meaningful way.

Before getting to the interpretation of these results when it comes to competition and innovation, I want to tie up a couple of loose ends.

Quadratic regression on log(y+1), programmed in Stan

In fitting the above multilevel hinge regression, I moved from the pre-programmed Stan code of stan_glmer() to a custom Stan program. Just to check that nothing funny is going on here, I’ll go back and program the multilevel quadratic regression directly in Stan. It’s easy enough to write the program; I just replace the hinge by a quadratic function and get rid of the priors, and, indeed, we get essentially the same result as when fitting using stan_glmer(). I won’t bother showing the new graph here. No surprise, it’s just good to check.

The other thing we can do with the Bayesian fit is see whether the peak of the curve falls within the range of the data. The curve b0 + b1*x + b2*x^2 has its peak at x = -b1/(2*b2) (just take the derivative, set to 0, and solve for x), so we can compute the posterior distribution of this value from our simulations. It turns out that only 3 of the 4000 simulated curves has a peak outside the range of the data, so fair enough that, according to the fitted quadratic model (which I don’t like), the inverse U pattern is occurring within those bounds.

Adjusting for the group-level mean of the predictor

In general when fitting a multilevel model, it’s a good idea to adjust for the group-level mean of the predictor–in this case, the average value of x within each industry. Otherwise you have to worry about correlations between x and the varying intercepts. In this case, adding this group-level predictor doesn’t change much. I won’t share the result here just because I’ve already done a lot of work to write this up and there are enough other concerns with the analysis, but, yeah, it’s a good idea to include this predictor too.

Summary

OK, so what have we learned?

• The graph from the original paper (reproduced at the top of the above post) did not show a good fit because it didn’t display the adjustments for industry. We can see the pattern a lot clearer using separate plots for each industry.

• I think it makes more sense all around to model these data on the log(y+1) scale rather than using Poisson regression or any of its variants.

• The patterns within and between industries aren’t so clear to me. I don’t get why furniture has so many patents, why the number of patents for machinery and computers dropped, why the number of patents for electric and gas services shot up, and so forth. I’m not saying these numbers are wrong, and I might have made some mistake in coding; I’m just saying I’m baffled.

• I think the hinge model is much more appropriate than the quadratic for the propose of seeing the extent to which the data support an inverted U pattern. It wasn’t hard at all to program, debug, and fit the hinge model, and in this case it does support the inverted U. So in that sense the published paper was correct in its statistical conclusion, even if I think they got kinda lucky in finding the pattern with their quadratic curves.

• I still don’t buy the claim in the paper that they “find strong evidence of an inverted U relationship” between “product market competition and innovation.” Again, my main problem here is their measure of innovation using average weighted patent counts, an issue that further bothers me given the odd data patterns seen in the graphs for some of the industries, also with the selection in the data (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

• Also the model predicts y from x in the same year, and we’d expect a lag. I didn’t bother addressing this because of my other concerns about the data.

I guess that later work followed up with more comprehensive data sources, but I remain concern that any inverse U pattern, even if supported by this particular dataset, could depend strongly on various artifacts of the measures they are using. There aren’t really a lot of data points at high values of x here, so this downward slope is defined by how these patents are being counted in just a couple of industries.

All this might sound like picky criticism, but it’s the authors of the paper (and the Nobel prize committee), not me, who are making these broad claims about competition and innovation, and concerns about measurement seem really important here.

That said, it’s cool to see that a careful statistical analysis does find the inverted U. It does seem like a real pattern in the data, so the only question is the relevance of these data to the larger economics question.

tl;dr

In response to one of the comments below, I wrote that I don’t at all trust the idea of using time variation of patents within an industry to measure time variation of innovation. I get that the authors were doing their best using available data; still, I don’t think their data and analysis provide evidence for their substantive conclusions.

In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

Looking at it that way, all my statistical analysis with the transformations and the quadratic curves and the hinge function was all a waste of time, as I’m just doing a fancy analysis of bad (or, at least, irrelevant) data. But . . . summarizing the statistical information still has value in itself. My analysis just took a few hours, a small fraction of the time the authors must have spent on preparing, analyzing, and interpreting their data. I think it’s important, when doing statistical analysis, to do the best we can do, in this case accounting for all these sources of variation and uncertainty.

Or, to put it another way, yes, it would have been fine to dismiss the published results entirely given the problems with the data. But some of the problems with the data became apparent only after I made those plots showing the time series for each industry. At that point, all the fitting may well have been a waste of time–but I only thought of making the plots because I was trying to understand the puzzling fit.

All the links connecting theory, measurement, data, analysis, and conclusions are important.

P.S. All these data and measurement problems just leap out at us, but there’s a whole world of people out there who just accept these conclusions–even some of the graphs from that 2005 paper–as truth. For example I came across this post by political journalist Matthew Yglesias that just straight up accepts the iffy empirical claims from that paper. It’s tricky–journalists are busy, and it’s natural to think that a much-cited paper written by two Nobel prize winners and published in a top journal has to be correct. I don’t really know what to say, except that this is one of the useful functions of social media, to allow us to push back against default narratives. Not that the default is necessarily wrong, just that it can be wrong, and it’s hard for journalists to escape the bubble and recognize this.

P.P.S. There was some concern about the arbitrariness of the log(y+1) transformation so I also fit the model to sqrt(y), which is a standard variance-stabilizing transformation for count data. The results looked essentially the same:

So the inverted U really seems to be there–but you have to assume the curve has the same shape for all industries. If you fit a separate curve for each industry, there’s no way you’d find the U, and you can see that in the scatterplots. The trouble is that the data are too sparse and variable to try to estimate a separate curve for each industry. Just look at electric and gas services, for example. Or passenger transit.

Data and code

Here’s the data file, abbgh_data.dta.

Here’s the Stan program, hinge_multilevel.stan:

functions {
  vector logistic_hinge(vector x, real x0, real a, real b0, real b1, real delta) { 
    vector[size(x)] xdiff = x - x0;
    return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);
  }
}
data {
  real<lower=0> delta;
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real x0;
  real a, b0, b1;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  x0 ~ normal(1, 1);
  a ~ normal(0, 100);
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(logistic_hinge(x, x0, a, b0, b1, delta) + a1[group1] + a2[group2], sigma);
}

Here’s the Stan program, quadratic_multilevel.stan:

data {
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real b0, b1, b2;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  b2 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(b0 + b1*x + b2*x^2 + a1[group1] + a2[group2], sigma);
}

And here’s the R script:

library("haven")
library("arm")
library("rstanarm")
library("MASS")
library("cmdstanr")
hinge_multilevel <- cmdstan_model("hinge_multilevel.stan")
quadratic_multilevel <- cmdstan_model("quadratic_multilevel.stan")
set.seed(123)

abbgh <- read_dta("abbgh_data.dta")
abbgh$patcw_rounded <- round(abbgh$patcw)
industry_names <- c("Textiles", "Apparel", "Lumber", "Furniture",
"Leather", "Stone, glass, concrete", "Primary metal", "Fabricated medal", "Machinery and computers", "Electric, not computers", "Transport equipment", "Passenger transit", "Freight transport", "Postal service", "Transport services", "Communications", "Electric, gas services")

add_curve <- function(fit, multilevel=FALSE, logscale=FALSE, shift=0, ...) {
  if (multilevel) {
    beta_hat <- fixef(fit)[1:3]
  }
  else {
    beta_hat <- coef(fit)[1:3]
  }
  if (logscale)
    curve(shift + beta_hat[1] + beta_hat[2]*x + beta_hat[3]*x^2, ..., add=TRUE)
  else
    curve(exp(shift + beta_hat[1] + beta_hat[2]*x + beta_hat[3]*x^2), ..., add=TRUE)
}

pdf("invertedu_1.pdf", height=4, width=4)
par(mfrow=c(1,1), oma=c(0,0,2,0), mar=c(3,3,1,1), mgp=c(1.7,.5,0), tck=-.01)
plot(abbgh$Lc, abbgh$patcw, xlab='"Competitiveness"', ylab="Weighted Patents", xlim=c(0.85, 1), pch=20, cex=.5, bty="l")
fit_1 <- glm(patcw_rounded ~ Lc + I(Lc^2), family=poisson(link="log"), data=abbgh)
display(fit_1)
add_curve(fit_1, col="green")
fit_2 <- glm(patcw_rounded ~ Lc + I(Lc^2), family=quasipoisson(link="log"), data=abbgh)
display(fit_2)
add_curve(fit_2, col="red")
fit_3 <- glm(patcw ~ Lc + I(Lc^2), family=quasipoisson(link="log"), data=abbgh)
display(fit_3)
add_curve(fit_3, col="blue")
fit_4 <- glm.nb(patcw_rounded ~ Lc + I(Lc^2), data=abbgh)
summary(fit_4)
add_curve(fit_4, col="pink")
fit_5 <- stan_glm(patcw_rounded ~ Lc + I(Lc^2), family=neg_binomial_2(link="log"), data=abbgh, refresh=0)
print(fit_5)
add_curve(fit_5, col="orange")
mtext("Quadratic regressions on original scale", side=3, line=0, outer=TRUE, cex=.9)
dev.off()

# The orange curve looks different from the others.
# Check by re-running the Bayesian negative binomial regression on optimize setting
fit_5_opt <- stan_glm(patcw_rounded ~ Lc + I(Lc^2), family=neg_binomial_2(link="log"), data=abbgh, algorithm="optimizing")
print(fit_5_opt)

adjust <- function(fit, data=abbgh, multilevel=FALSE) {
  if (multilevel) {
    sic2_effects <- ranef(fit)$sic2[,1]
    year_effects <- ranef(fit)$year[,1]
  }
  else {
    sic2_effects <- c(0, coef(fit)[grep("sic2", names(coef(fit)))])
    year_effects <- c(0, coef(fit)[grep("year", names(coef(fit)))])
  }
  sic2_effects + mean(year_effects)
}

fit_3a <- glm(patcw ~ Lc + I(Lc^2) + factor(sic2) + factor(year), family=quasipoisson(link="log"), data=abbgh)
display(fit_3a)
fit_4a <- glm.nb(patcw_rounded ~ Lc + I(Lc^2) + factor(sic2) + factor(year), data=abbgh)
summary(fit_4a)
fit_5a <- stan_glm(patcw_rounded ~ Lc + I(Lc^2) + factor(sic2) + factor(year), family=neg_binomial_2(link="log"), data=abbgh, refresh=0)
print(fit_5a)
fit_6a <- stan_glmer(patcw_rounded ~ Lc + I(Lc^2) + (1 | sic2) + (1 |  year), family=neg_binomial_2(link="log"), data=abbgh, refresh=0)
print(fit_6a)

industries <- unique(abbgh$sic2)
n_industries <- length(industries)
avg_patcw <- rep(NA, n_industries)
avg_Lc <- rep(NA, n_industries)
count <- 0
for (j in industries) {
  count <- count + 1
  cond <- abbgh$sic2==j
  avg_patcw[count] <- mean(abbgh$patcw[cond])
  avg_Lc[count] <- mean(abbgh$Lc[cond])
}
industries_sort <- rev(order(avg_patcw))

plot_grid <- function(fits, multilevel, logscale, colors, n_rows=4, n_cols=5, title){
  par(mfrow=c(n_rows, n_cols), oma=c(0,0,3,0), mar=c(2.5,2,1,1), mgp=c(1.2,.2,0), tck=-.02)
  x <- abbgh$Lc
  y <- if (logscale) log(abbgh$patcw + 1) else abbgh$patcw
  y_label <- if (logscale) "log (Wt Patents + 1)" else "Weighted Patents"
  for (i in 1:n_industries) {
    j <- industries_sort[i]
    grid_left <- i%%n_cols == 1
    grid_bottom <- i > n_industries - n_cols
    plot(x, y, xlab = if (grid_bottom) '"Competitiveness"' else "", ylab = if (grid_left) y_label else "", xlim=c(0.85, 1), pch=20, cex=.2, bty="l", col="gray", main=industry_names[j], cex.main=.8, cex.lab=.8, cex.axis=.8)
    for (k in 1:length(fits)) {
      add_curve(fits[[k]], multilevel=multilevel[k], logscale=logscale,
                shift=adjust(fits[[k]],multilevel=multilevel[k])[j], col=colors[k])
    }
    cond <- abbgh$sic2==industries[j]
    points(x[cond], y[cond], pch=20, cex=.5)
    lines(x[cond], y[cond], lwd=.5)
    points(x[cond][1], y[cond][1], pch=20, cex=1.5)
  }
  mtext(title, side=3, line=1, outer=TRUE, cex=.8)
}

pdf("invertedu_2.pdf", height=5, width=7)
plot_grid(list(fit_3a, fit_4a, fit_5a, fit_6a), multilevel=c(FALSE, FALSE, FALSE, TRUE), logscale=FALSE, colors=c("blue","pink","orange","purple"), title="Quadratic regressions on original scale, adjusting for industry and year")
dev.off()

abbgh$log_patcw <- log(abbgh$patcw + 1)

pdf("invertedu_3.pdf", height=4, width=4)
par(mfrow=c(1,1), oma=c(0,0,2,0), mar=c(3,3,1,1), mgp=c(1.7,.5,0), tck=-.01)
plot(abbgh$Lc, abbgh$log_patcw, xlab='"Competitiveness"', ylab="log (Wt Patents + 1)", xlim=c(0.85, 1), pch=20, cex=.5, bty="l")
fit_7 <- lm(log_patcw ~ Lc + I(Lc^2), data=abbgh)
display(fit_7)
add_curve(fit_7, logscale="TRUE", col="green")
mtext("Quadratic regressions on log scale", side=3, line=0, outer=TRUE, cex=.9)
dev.off()

fit_8 <- lm(log_patcw ~ Lc + I(Lc^2) + factor(sic2) + factor(year), data=abbgh)
summary(fit_8)
fit_9 <- stan_glmer(log_patcw ~ Lc + I(Lc^2) + (1 | sic2) + (1 |  year), data=abbgh, refresh=0)
print(fit_9)
pdf("invertedu_4.pdf", height=5, width=7)
plot_grid(list(fit_8, fit_9), multilevel=c(FALSE, TRUE), logscale=TRUE, colors=c("orange","purple"),
          title="Quadratic regressions on log scale, adjusting for industry and year")
dev.off()

industries_in_order <- rep(NA, n_industries)
for (j in 1:n_industries) {
  industries_in_order[abbgh$sic2==industries[j]] <- j
}
abbgh$avg_Lc <- avg_Lc[industries_in_order]

years <- unique(abbgh$year)
n_years <- length(years)
years_in_order <- rep(NA, n_years)
for (j in 1:n_years) {
  years_in_order[abbgh$year==years[j]] <- j
}

fit_10 <- stan_glmer(log_patcw ~ Lc + I(Lc^2) + avg_Lc + (1 | sic2) + (1 |  year), data=abbgh, refresh=0)
print(fit_10)

logistic_hinge <- function(x, x0, a, b0, b1, delta) {
  return(a + b0*(x-x0) + (b1-b0)*delta * log(1 + exp((x-x0)/delta)))
}

plot_grid_hinge <- function(fit, n_rows=4, n_cols=5, title){
  par(mfrow=c(n_rows, n_cols), oma=c(0,0,3,0), mar=c(2.5,2,1,1), mgp=c(1.2,.2,0), tck=-.02)
  x <- abbgh$Lc
  y <- log(abbgh$patcw + 1)
  sims <- fit$draws(format="df")
  x0_hat <- median(sims$x0)
  a_hat <- median(sims$a)
  b0_hat <- median(sims$b0)
  b1_hat <- median(sims$b1)
  n_sims <- nrow(sims)
  for (i in 1:n_industries) {
    j <- industries_sort[i]
    grid_left <- i%%n_cols == 1
    grid_bottom <- i > n_industries - n_cols
    plot(x, y, xlab = if (grid_bottom) '"Competitiveness"' else "", ylab = if (grid_left) "log (Wt. Patents + 1)" else "", xlim=c(0.85, 1), pch=20, cex=.2, bty="l", col="gray", main=industry_names[j], cex.main=.8, cex.lab=.8, cex.axis=.8)
    a1_sims <- as.matrix(fit$draws("a1", format="df"))[,1:n_industries]
    a2_sims <- as.matrix(fit$draws("a2", format="df"))[,1:n_years]
    for (s in sample(n_sims, 20)) {
      curve(logistic_hinge(x, sims$x0[s], sims$a[s], sims$b0[s], sims$b1[s], delta_true) + a1_sims[s,j] + mean(a2_sims[s,]), col="red", lwd=0.5, add=TRUE)
    }
    a1_hat <- apply(a1_sims, 2, median)
    a2_hat <- apply(a1_sims, 2, median)
    curve(logistic_hinge(x, x0_hat, a_hat, b0_hat, b1_hat, delta_true) + a1_hat[j] + mean(a2_hat), col="blue", lwd=1.5, add=TRUE)
    cond <- abbgh$sic2==industries[j]
    points(x[cond], y[cond], pch=20, cex=.5)
    lines(x[cond], y[cond], lwd=.5)
    points(x[cond][1], y[cond][1], pch=20, cex=1.5)
  }
  mtext(title, side=3, line=1, outer=TRUE, cex=.8)
}

delta_true <- 0.05
stan_data <- list(delta=delta_true, N=nrow(abbgh), x=abbgh$Lc, y=log(abbgh$patcw + 1), J1=n_industries, J2=n_years, group1=industries_in_order, group2=years_in_order)
fit_11 <- hinge_multilevel$sample(stan_data, parallel_chains=4, refresh=0)
print(fit_11)
sims_11 <- fit_11$draws(format="df")
print(mean(sims_11$b1 < 0))

pdf("invertedu_5.pdf", height=5, width=7)
plot_grid_hinge(fit_11, title="Hinge regression with uncertainty on log scale, adjusting for industry and year")
dev.off()

plot_grid_quadratic <- function(fit, n_rows=4, n_cols=5, title){
  par(mfrow=c(n_rows, n_cols), oma=c(0,0,3,0), mar=c(2.5,2,1,1), mgp=c(1.2,.2,0), tck=-.02)
  x <- abbgh$Lc
  y <- log(abbgh$patcw + 1)
  sims <- fit$draws(format="df")
  b0_hat <- median(sims$b0)
  b1_hat <- median(sims$b1)
  b2_hat <- median(sims$b2)
  n_sims <- nrow(sims)
  for (i in 1:n_industries) {
    j <- industries_sort[i]
    grid_left <- i%%n_cols == 1
    grid_bottom <- i > n_industries - n_cols
    plot(x, y, xlab = if (grid_bottom) '"Competitiveness"' else "", ylab = if (grid_left) "log (Wt. Patents + 1)" else "", xlim=c(0.85, 1), pch=20, cex=.2, bty="l", col="gray", main=industry_names[j], cex.main=.8, cex.lab=.8, cex.axis=.8)
    a1_sims <- as.matrix(fit$draws("a1", format="df"))[,1:n_industries]
    a2_sims <- as.matrix(fit$draws("a2", format="df"))[,1:n_years]
    for (s in sample(n_sims, 20)) {
      curve(sims$b0[s] + sims$b1[s]*x + sims$b2[s]*x^2 + a1_sims[s,j] + mean(a2_sims[s,]), col="red", lwd=0.5, add=TRUE)
    }
    a1_hat <- apply(a1_sims, 2, median)
    a2_hat <- apply(a1_sims, 2, median)
    curve(b0_hat + b1_hat*x + b2_hat*x^2 + a1_hat[j] + mean(a2_hat), col="blue", lwd=1.5, add=TRUE)
    cond <- abbgh$sic2==industries[j]
    points(x[cond], y[cond], pch=20, cex=.5)
    lines(x[cond], y[cond], lwd=.5)
    points(x[cond][1], y[cond][1], pch=20, cex=1.5)
  }
  mtext(title, side=3, line=1, outer=TRUE, cex=.8)
}

fit_12 <- quadratic_multilevel$sample(stan_data, parallel_chains=4, refresh=0)
print(fit_12)
sims_12 <- fit_12$draws(format="df")
x_peak <- -sims_12$b1/(2*sims_12$b2)
print(mean(x_peak > min(abbgh$Lc) & x_peak < max(abbgh$Lc)))

pdf("invertedu_6.pdf", height=5, width=7)
plot_grid_quadratic(fit_12, title="Quadratic regression with uncertainty on log scale, adjusting for industry and year")
dev.off()

46 thoughts on “Reanalysis of that Nobel prizewinning study of patents and innovation (with R and Stan code)

  1. Cool analysis. Props to you for following up on the earlier post. Sorry if I missed it, but after doing this, do you have a better metric for innovation or competition than the original authors provide (patent counts / profit)?

    • Student:

      No, I don’t know about this except to say that I don’t at all trust the idea of using time variation of patents within an industry to measure time variation of innovation. I get that the authors were doing their best using available data; still, I don’t think their data and analysis provide evidence for their substantive conclusions.

      In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

  2. I’m not surprised that furniture has so many patents. I expect that industry (along with apparel and textiles) to rely on design patents. The other industries are likely weighted heavily to utility patents. I do wonder why apparel and textiles don’t have more design patents. It may be that tastes change fast enough that they never pay off.

    • But apparel and textiles may rely on copyright rather than patent? I’m sure there are som totally new inventions of kinds of raincoats or something, but I doubt there are that many.

      • “Design” patents are about look and feel. This is where Apple got Billions from Samsung because of rounded rectangles on windows or some bullshit. Putting an orange pocket on your green raincoat is probably design-patentable. That little polo character or alligator on the shirts was probably trademarked, so there’s also yet another “property law” possibility.

        I don’t know with something like the “Creeper hoodie” that kinda looks like a minecraft character and zips up over your face, it’s probably design patented, trademarked, and copyrighted.

        I’m not an expert, but I have read a little on some of this, so take it all with a grain of salt, but hopefully I wasn’t too far off and that might explain what Anon meant.

  3. I haven’t closely examined your thorough analysis but I did explore the data a bit. I share your concerns about selection problems, and especially the small number of industries (four of them to be exact) that dominate the patent activity. These four industries appear to have nothing in common except for the high level of patent activity – making me more concerned about patents as the relevant measure of innovation here. Also, the relationship between competitiveness and patents (using the authors’ measures) is quite different for these four industries and the other 17 low-patent activity industries. I see a modest U shape for the low-patent industries and a much stronger inverse relationship for the four high-patent activity industries.

    The economic story seems complex and murky to me here. The variables in the data set also suggest ample room for forking paths. Given the natures of innovative activity and competitiveness, I think you could find almost any pattern you want. Even though you eventually found empirical support for the U-shape, the idea that competitiveness is associated with increasing innovation up to a point, declining thereafter sounds too simplistic to me. I couldn’t tell from your analysis whether you looked at how the estimated hinge point varies across industries (I did see your check that it fell within the range of the data), but I would think that is worth exploring. Given that the nature of both innovation and competition varies across industries, where the hinge point occurs should relate to the economic story (I think).

    Color me skeptical.

    • I also wonder about the number of firms represented in their sample (I have not checked whether they report this or not). I did manage to find a listing of number of firms by two-digit SIC code and the total number of firms varies greatly. Code 31 has only 5 thousand firms while code 42 has 330 thousand firms (31 is among the lowest patent industries and 42 is among the highest – but the pattern is far from clear when I look at the other industries). In any case, I don’t know if they are using the total number of firms in each industry or some subset of these. I would think that the varying number of firms in the industries might be related to both innovative activity and competitiveness. Possibly another forking path?

      • I went back to the paper and they report using a total of 311 firms (across the 17 sectors) that are listed on the London Stock Exchange. Given that there are many thousands of firms in these sectors, it does make me wonder about the representativeness of the firms that they are using.

        • Dale:

          That paper came out in 2005, so in the meantime they or others might have replicated the analysis using all the firms. I don’t really know what’s been done since then. Given that there is interest in competition and innovation, I can only imagine that the field has advanced a lot in the past twenty years, so the 2005 paper should just be of historical interest. On the other hand, the way social science goes, I can well imagine that inverted U thing is just presented as a true empirical finding with reference to the old paper. I don’t know.

        • Here is a different question. I did a little searching and there is (not surprisingly) a considerable literature subsequent to the article in question. I looked at a few – published in leading journals – and quickly found that the inverted U-shape was not found in a study of US data – instead a negative relationship was found. Another study of data from the Philippines found a positive relationship. I didn’t look any further, as this was enough for my question. My question: how does a Bayesian treat the prior in such situations? In past discussions, my concerns about the prior related to the fact that much of the research in economics is similar to this: credible researchers finding a wide range of, often contradictory, results. I’d say that is more common than not. While I accept and believe that the Bayesian approach makes more sense than the frequentist approach, I am still wondering about the practical application in situations like this.

          I will say that these subsequent studies attempt to reconcile their findings with that of the original paper, but those attempts are not straightforward and would take a large effort to try to understand. My belief is that the U-shape, negative, or positive relationship is not settled, and I would like to know how a Bayesian analysis would proceed in such cases.

        • Dale:

          Indeed, there’s no reason to expect a constant pattern across industries over time, and, beyond that, any connection between patent counts and innovation will itself vary by industry and over time. So the whole thing is a mess no matter how you look at it.

  4. That said, it’s cool to see that a careful statistical analysis does find the inverted U. It does seem like a real pattern in the data, so the only question is the relevance of these data to the larger economics question.

    If you take the inverted U model (presumably +/- some distribution) and try to “predict” even these exact same points that were used to fit the curve, you will have essentially zero predictive skill. It would be simple to add a predicted vs actual scatter plot but we know what it would look like.

    If we get new data the predictive skill of this curve will be even worse.

  5. I am always very skeptical of these kinds of analyses when the dispersion of the data points is so great. For one thing, as Anoneuoid said, the curve has little predictive value. In fact, the real value one could add would be to explain why the highest values of “innovation” were so far above the fitted line, and how those could be replicated intentionally. In fact, nothing has shown that the data even have random errors around the fit line.

    For another, reporting the error of the fit coefficients has little value for data like this. Consider the situation near the left and right ends of the x axis. The density of points near them is very low. That means the error of the fit in those regions will be very large. Or equivalently, the values of those points contribute little to the overall fit. If there were a larger sample in those regions the fit might turn out to be very different.

    One sees this kind of effect is seen when lines of uncertainty (e.g., 2 SD) are plotted around a fit line for a variety of kinds of fit, including linear and LOESS. Those lines always spread apart near the ends, where there is less data.

    So, as I said, the math might be find based on the assumptions but the actual fit, and even more its practical value, see dubious.

    • Tom:

      Agreed. If the model fit the data better, and if the data connected better to the underlying constructs of interest, then these sorts of extrapolations could be reasonable. As it is, they’re finding some aggregate trends in some observational data and it’s not clear what to make of it all. I think the hierarchical hinge analysis on the log(x+1) scale is better than the alternatives, but ultimately it’s just a summary of a data pattern. As Dale notes above, at the very least you’d want to be understanding what’s happening in each industry rather than just throwing them all together like this.

  6. Ok, so once I’d dropped my kids off this morning, I had a broken toilet and a couple hours of procrastination I can afford to do before actually tackling it… But it’s not hours and hours. So I couldn’t go full Bayesian on this problem. Here’s what I did (link to my Mastodon thread, if you click the link you should get to the first post, there are as of now 6 posts in the thread with 4 graphs.)

    https://mastodon.sdf.org/@dlakelan/115413741221991207

    Now, the main thing I did differently from Andrew was I used Loess curves, and I fit a loess curve through the year data after subtracting the industry means. Andrews specification (1|year) + (1|industry) is equivalent to in essence subtracting the yearly averages. In my plot you can see that compared to the loess through time. The loess is much smoother than the per-year averages. Fitting per year categorically like that is an extremely flexible **overfitting** process. The smoothed loess is less overfit. It results in different end-results.

    Sorry I didn’t merge in the industry names in the small-multiples plot, did I mention that toilet?

    Seriously though, if this was a major contributor to the Nobel prize then it really says a lot about that prize… Maybe toilets are relevant.

    • Daniel:

      I put in additive industry and year effects to follow what the original authors did. But, yeah, it makes sense to fit year effects as a time series rather than independent terms. I doubt it will make much of a difference, though.

      Also, the overfitting to which you refer arises in the version where the year effects are estimated as unmodeled coefficients. The multilevel model should not overfit.

      As noted, none of this addresses the big problems with the analysis, involving the issues of data and measurement. But I do think this sort of discussion can be helpful, because it’s good to know what can be learned from the data, and it’s also good to be reminded that I can forget things (such as that years are in order).

      • If the different adjustments for year effects do not matter that much, then what causes the very different patterns in Daniel’s and your analysis? While you find somewhat inverted Us in all industries, Daniel only finds it in one industry. The others show mostly flat lines in his analysis. As far as I understand, you basically differ in your adjustments for year effects and the functions you fit. Daniel uses LOESS and you a hinge function. But the hinge function should be flexible enough to not show inverted you, if there is none; as you pointed out in your post. Which makes me think the year adjustment must cause the differences.

        As a sidenote, isn’t this year adjustment weird anyway? Basically subtracting the yearly average (or some smoothed version of it) of patents filed, averaged across industries, strikes me as odd, given that there are huge differences between industries.

        • I was wrong. The main difference seems to be that you assume the same functional form across industries and Daniel fits a separate regression to each industry. So I guess the inverted U is mainly driven by industry49 and without this also in your analysis there would be no inverted U

        • huan, looking at my industry specific graph, I think the main reason for the inverted U in the across-industry graph is the combinations of industry 33 and industry 22 and 35. Industry 33 has an upward trend starting lower left towards upper middle… 22 and 35 have a downward trend, starting in the middle declining towards lower right.

          Take those 3 out and the inverted U across industries disappears basically entirely, everything else is flat except the very weird industry 49 which is basically just bimodal rather than really any kind of inverted U

          The ends of the x spectrum have fewer points than the middle, and so they are also more affected by individual data points, meaning we should expect much more variation in the position and slope of the ends of the smoother curve. For example Industry 49 has an inverted U because it’s bimodal and the endpoints happen to have a few more of the lower-mode values

        • Daniel:

          If you’re gonna talk about industries, I recommend you use their names! 33 is Primary metal industries, 22 is Textile mill products, 35 is Industrial and commercial machinery and computer equipment, and 49 is Electric, gas, and sanitary services.

          I’m with the anonymous commenter who remarked that the most striking pattern in these data is that the average number weighted patents (“innovation”) is essentially unpredicted by average profit margin (“competitiveness”). If the point of gathering and analyzing the data is to learn something about the relation between competitiveness and innovation, the main takeaway is that, using these particular measures, X tells you virtually nothing about Y. So if your goal is to get more innovation and you want to interpret the data causally (as the authors do), the message is that competitiveness pretty much has nothing to do with it!

          I guess it’s time for another post on the topic . . .

        • Hi Andrew, yes ideally I should have merged in the industry names to the dataset… but, I had a toilet whose flush and fill mechanism needed to be fully replaced after a sudden emergency failure, and I did that instead of spending more time on data analysis.

          I’m with you, the real take-away here is that there’s virtually nothing going on.

      • To be clear, I had nothing really against your analysis, just wanted to mention that I took a look at it a different way and one of the ways it was different was to fit a continuous model for time variation. The other way was to use a very flexible (Loess) curve. Finally, I split up by industry and then Loess separately through each one rather than trying to see if one big inverted U fits all the data. To me, if the so called inverted U really did have some reality, it shouldn’t be just an “ecological fallacy” type thing, that some industries are at one end of the X scale and have low Y values, and others at the other end of the X scale have low Y values, and most industries fall in the middle with constant Y values… leading to an inverted U but no industry itself sees a large range of X values and an inverted U shape within industry…

        My graphs showed almost every industry was absolutely FLAT. A few were upwards or downwards trending… and the existence of those drove most of the inverted U.

        My guess is what happened here is someone did some very basic stats, saw an inverted u shape, retconned an explanation, made it mathy so it could get published… and that ensured a mild academic controversy for 30 years allowing a whole generation of economists to get paid to think about why this does or doesn’t happen and in what contexts it does or doesn’t or what mechanisms cause it or etc… all without anyone ever really figuring out with any assurance whether there really was anything happening empirically in the first place … I hate to sound so cynical, but my own feeling built up over the last 20 years of watching academia and empirical studies is that I’m probably not cynical enough.

    • So much effort here fitting various curves to this data. I do not get it. It is clear we need some other feature(s) to model it (to the extent thats worth doing at all).

      What is the point of fitting curves with no predictive skill? I would look at this and immediately decide to do something else, anything besides trying to fit a curve to it.

      • Anon:

        The answer to your question is in the tl;dr section of the above post. The short answer is that I think the most valuable part of my analysis was making the graph showing the separate time series for each industry. It just happened that I only thought of doing this after first thinking about fitting the model. Also the original paper was all about the model and I think it’s useful to address this on its own terms.

      • Anon
        This is your usual universal dismissal of an entire body of work. This time I share your conclusion, but for different reasons. I found the curve fitting exercise more than I wanted, but it does make it clear just how tenuous the empirical part of that paper is. I find the measurement issues far more serious, and I think the curve fitting helps point in that direction. One thing that neither curve fitting or measurement addresses, however, is the fairly classic form of the paper – a form that economists love. There is the empirical section but also a theoretical model that supports it. In language I often encounter on this blog, it might be called a “generative model” of the innovation process. It’s nice to have that and it is usually required by the top economics journals (I’ve often had papers rejected because I lacked that). But I found the theoretical model unconvincing. It required numerous simplifying assumptions in order to make the math tractable – ultimately to derived some propositions that show that a U-shaped relationship between innovation and competition could result from these assumptions.

        I found the assumptions neither convincing nor transparent. In the end, I see a somewhat contrived theoretical model and an even more contrived empirical estimation that might be consistent with each other. Yes, there is no predictive evaluation. Even if there was, I’m not sure it would convince me that this particular theoretical model or these particular fitted curves capture the essential features of the innovation/competition relationship. To be fair, it is a difficult problem to analyze – all the way from finding good measurements to building a reasonable theoretical model to conducting a convincing empirical analysis with predictive capability. I won’t claim the paper does not make a worthwhile contribution to understanding the process, and it is certainly a stronger effort than I am likely capable of making. But, as with many such economics papers, I am left wondering whether I have any better understanding of the process than when I started. And even less convinced it was worth the effort (coming back to my agreement with your conclusion).

        • The most distinctive part of the plot is the low (psuedo-)cardinality. Ie the large proportion of y-axis values at or near zero. An inverted U model does nothing to address this so will not be a good model.

  7. Someone ought to rewrite the plot code in ggplot2 and render them at a resolution that makes them legible. This kind of project might work nicely in tidymodels, although I’m not sure whether it supports all these models out of the box.

  8. Since this post is going to get a lot of views, I think I should try to close the barn door after the horse has bolted. In chronological order,

    (1) “The only thing that really puzzles me here is what’s going on with model 5”. It is mostly the default priors on the coefficients, which MASS::glm.nb does not do. If you NULL out prior_intercept, prior, and prior_aux, you get posterior medians that are similar to the (awful) MLEs in model 4.
    (2) “I wanted to connect to whatever version of Poisson regression was used in that published paper.” Several popular econometrics textbooks (linked at https://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/ ) recommend applying the exponential inverse link function to the linear predictor rather than applying the log function to the data and using the residuals to estimate standard errors. I think this is a good example of the difference between having a generative model for the data-generating process and having a moment condition.
    (3) “What I really want is a model that can have diminishing returns or can asymptote or can have that inverted U shape, with the question answered by the data.” Your hinge model that interpolates between two linear functions is similar to (but somewhat more restricted than) a Restricted Cubic Spline that Frank Harrell has popularized in R (albeit after the paper in question was written). So, you could do things in R of the form: patcw_rounded ~ rms::rcs(Lc) .
    (4) “We can see the pattern a lot clearer using separate plots for each industry.” Perhaps. I don’t think I can embed a plot in a comment, but everyone at home can do in R:

    abbgh$sic2_factor <- as.factor(abbgh$sic2)
    fit_splines <- stan_gamm4(patcw_rounded ~ s(Lc, by = sic2_factor), random = ~ (1 | year), family = neg_binomial_2, data = abbgh, adapt_delta = 0.98)
    plot_nonlinear(fit_splines)

    Essentially, only industry 42 strongly exhibits an inverted-U pattern, but the credible intervals on the spline functions are consistent with a lot of shapes. So, a lot of work is being done by the assumption that the nonlinear process is the same across industries. But PSISLOOCV does not favor models with s(Lc, by = sic2_factor) or even s(Lc), so these data really only permit using rather particular nonlinear forms.

  9. I really love these little stats case-study posts. Quick question, why did you connect the points in the industry panel plots? Was it just an aesthetic decision? I usually prefer to avoid connecting points unless there’s a reason to believe the data are connected by time and/or space.

    • 9P:

      I connected the dots to show the time order, and this indeed conveys valuable information, as in the plots for Machinery and computers and Electric and gas services.

      You write, “I usually prefer to avoid connecting points unless there’s a reason to believe the data are connected by time.” These data are connected by time!

  10. For math folk (also @daniel_lakeland, it reminds me of nonstandard analysis), I came across this which is fascinating me. It seems highly relevant to this practice of fitting polynomials:
    https://news.ycombinator.com/item?id=45823141

    In summary, say you have a theoretical model like (x – y)/(x + y). Currently the practice is to set equal to zero and solve it that way. However, that results in the line x = y. If instead we plot the function itself (rather than exact solutions), we get a much richer heatmap (or 3d plot) that shows what points could be approximate fits vs not fit at all. I don’t see why we have been throwing away like 99% of the information in our models (other than historical limits on computation/visualization).

    • Thanks I agree with some of the Hacker News commentary, and was turned off by the initial statements on the page, but the insight later in the page is really helpful. There is some connection to nonstandard analysis. In nonstandard analysis points have a “halo” where there’s locations that are infinitesimally far away from the exact location. So graphs can have this kind of nonstandard halo as well. the halos he shows are where the error is now appreciable but still small. So I can see some connection.

      I remember learning about Lagrangian mechanics and you can impose restrictions on the motion by including a constraint equation. You can think of these constraints as forces applied to enforce the constraint (like, a railroad track will apply forces to keep a railcar going along the track). If you loosen that constraint to something like a strong spring, you can get paths that go very “near” the perfect constraints but have fast oscillations or whatever. The plots he gives make me imagine all these mechanical trajectories that would look almost just exactly like the “exact” equation but when looked at carefully there’s some “error” that isn’t modeled. It’s a nice visualization.

      • Is it actually useful though? Eg, I’m imagining you have some theory about equilibrium between A and B, then you derive your equations for A, B and equate to get the first example he shows:

        y / (x^2+y^2) = (x+1) / (x^2+y^2)

        https://fuzzygraph.com/?equation=y+%2F+%28x%5E2%2By%5E2%29+%3D+%28x%2B1%29+%2F+%28x%5E2%2By%5E2%29&fuzzyLevel=32&colorMap=plasma&invertColor=true&xCenter=0&yCenter=0&yHeight=6&showAxes=true&minOverride=0&maxOverride=5

        Usually we would test the theory by seeing how data fit to the line. But instead it looks like a much more fruitful test is that z = abs(LHS – RHS) >> 0 in that specific region offset from zero. That is a more unique prediction, more likely to distinguish between various models. And of course that is probably not the ideal statistic to use. The process of solving the equation seems to be destroying information.

        • This is what’s done with 3D plots and level sets. if z=f(x,y), the slice z=0 gives all the zeros of f and you can also see where it is almost 0 on the 3D plots. This is not new.

        • Well let’s take a Bayesian viewpoint. So we have some equation y = f(x) but we know that equality isn’t right, real world things move the reality away from the model, so we can look at y-f(x) and say “how big is this error?” and specifically we could say something like “the reasonableness of different sized errors is like Normal(0,epsilon)”

          What that’s saying relative to the graphs is “the region where the color is such and such” is where the data should be. (ie. the color is plotting the size of the error)

          So literally we **are** doing this in Bayesian analysis, we’re saying that we “believe” a model which produces errors in some region (some color) and with “disbelieve” models where the colors are “too crazy”.

          Is that a helpful perspective on what Bayes is about? I mean I think it’s a nice intuition.

        • Agree with the other Anonymous, it’s not new, but it’s sometimes just good for someone to kind of “rediscover” a strategy for understanding and to publicize it… it’s like the XKCD comic about todays lucky 10000.

          https://xkcd.com/1053/

        • Sociologically: Its not new to point out NHST problems, there are thousands and thousands of papers about it. Yet it is standard to train people to do the wrong thing, with utter lack of urgency to fix it.

          Also I remember grants for doing replications not approved because they “lacked novelty”. So when something gets repeatedly called “not new” it makes it seem *more* interesting to me. Like its a sign something really dumb is going on.

          Anyway: I don’t think Bayes rule is addressing it, its an orthogonal issue. What is new to me, is this apparent destruction of information by simplifying a theoretical ~ to =. Eg, in that hypothetical equilibrium example, the theory does not claim exact equality. But for computational efficiency we assume equality, which changes the predictions of the model. I definitely want to stop (unknowingly) doing this right now if that’s the case.

          Or it is just another multiverse level and there are effectively infinite ways to express the model that will all have different “topologies”, so it doesn’t really mean anything.

          Thank you all for the feedback.

Leave a Reply

Your email address will not be published. Required fields are marked *