In the biggest advance in applied mathematics since the most recent theorem that Stephen Wolfram paid for . . .

Seth Green writes:

I thought you might enjoy this update from the STATA team:

. . . suppose we wish to know the effect on employment status of a job training program. Further suppose that motivation affects employment status and motivation affects participation. We do not observe motivation. We have an endogeneity problem.

Stata 14’s new eteffects eliminates the confounding effects of unobserved variables and allows us to test for endogeneity. In return, you must model both the treatment and the outcome.

Well ok then! Glad we can all retire!

Green continued:

I was shocked. I already emailed the support staff with a quote from Judea Pearl about how the correctness of the model is, even in principle, unverifiable. Whom do you think they hire to write these updates?

I replied:

To be fair, if you have 2 natural experiments you should be able to estimate 2 separate causal effects and then get what you want. The trouble is with any implication that this can be automatically done from observational data. “You must model,” sure, but a statistical model without some real-world identification won’t get you far!

To which Green responded:

I wish that that was what they were claiming. In the example on the page, however, “eteffects” models “wages as being determined by job tenure and age” and “college attainment by age and the number of parents who attended college.” So the actual implementation is “independent conditional on observables.” The post then gives a test of “the correlation between the unobservables that affect treatment and outcome. If these correlations are zero, we have no endogeneity.” The test detects endogeneity, the model was correct because it was simulated data, and therefore endogeneity has been addressed (!).

The deeper I peer in the less meaning there is.

All I can say is, what an amazing accomplishment. Whoever came up with it is the most extraordinary collection of talent, of human knowledge, that has ever been gathered in the field of statistics, with the possible exception of when Stephen Wolfram dined alone.

24 thoughts on “In the biggest advance in applied mathematics since the most recent theorem that Stephen Wolfram paid for . . .

  1. They hide it further down the page, but I think this is just a control function approach. In the example they give, the education of the parents is implicitly being used as an instrument for the college attendance of the individual. This is a silly instrument, but whatever.

  2. I chuckled at this post but I’m not sure if I understand it. Could someone explain to me Stephen Wolfram’s record of paying for theorems or being a fraud? It wouldn’t surprise me but I don’t think I know the background.

      • I support Andrews push for stata side interface development for StataStan.

        (Personally, I use python, r, matlab, julia, and stan).

        But a lot of my clients are entrenched in stata.

        Andrew, has there been amicable discussions between your Stan team and Stata?

        As an outsider, I suspect maybe the two groups just are not communicating with each other.

        • Mjt:

          We have been in communication with the Stata people about StataStan. My collaborator Sophia Rabe-Hesketh, who has been involved with some of the education modeling applications in Stan, also has done a lot with Stata. I haven’t talked directly with the Stata people myself. My impression is that they just want to do their own Bayesian package, even thought it is slower and much less flexible than StataStan. Seems like a waste of effort on their part. All of Stan is open source so there’s nothing stopping the Stata team (or anyone else) from copying our algorithms and even our modeling language. But for all concerned it would seem to make more sense for them to put their effort on the interface rather than reprogramming everything we’ve done (or using a substandard set of algorithms).

        • Andrew:

          These decisions to interface to an external product aren’t as simple as you make them sound:

          In an existing, complex product just to implement a specific feature to choose to interface to an external module, means you must figure out how to ensure that module will be present on all user systems. You can make that the users’ headache which is sort of what the Stan install process did by having complex install instructions & asking users to also have installed X, Y etc.

          Automatically packaging the external module you want to interface to is better but not always easy. And adds bloat unless the OS you are supporting has excellent package management capabilities.

          Then again, the external tool you are interfacing to must work & be supported on all the diverse architectures you’ve sold your main product to work on.

          Another risk is the uncertainty of the future: You must incorporate the risk of the external tool being maintained for the life cycle of your product.

          Interface stability is another notorious gotcha. Unless you are lucky to get someone with Linus Torvalds obsessive insistence on not breaking external interfaces at any cost, you can factor in a truckload of bug-fixes for each reincarnation of the external tool. If the external tool is “smaller” than your product they very very likely care more about adding cutting edge features than boring interface stability.

          All I’m saying is that these aren’t obvious decisions: Whether to interface or recode the functionality internally. Raw speed isn’t the only concern here.

        • +1. The dependencies and stability and portability are all huge risks. And the bloat is severe as are portability requirements — Stan requires a C++ compiler at runtime (an absolutely insane dependency for commercial software), with huge latencies which we accept because we’re trying to solve hard problems. These can be mitaged somewhat by wrapping in packages like RStanARM, where compilation can happen once and then binaries can be distributed.

          But intellectual property is the 800 pound gorilla here. Not only would using Stan introduce a dependency on Stan, there’s also Boost and Eigen. These are all large open-source projects. And while all three projects try very very hard to keep their intellectual property clean, it’s an enormous liability.

          And there’s also security. Who knows what might have gotten into open source code and what kind of security issues you might have like writing directly to files, etc.

          Now, contrast this with the upside, which is faster, more scalable (not the same as speed) and more flexible Bayesian inference. Now if they don’t really care about Bayesian inference because it’s a small part of their market, these risks seem not worth it.

          In this case, I think it is an obvious decision on Stata’s part and that they did the right thing. It’s certainly what I’d have done if I’d been running their project.

          I’ve tried to explain all of this to Andrew. Unless you’re on the ground with a wrench in your hand maintaining the software, it’s hard to see the details. You just get on the plane and fly to your destination. I find this frustating, because it indicates an underestimation of the complexity of code development, maintenance, and intellectual property oversight. We’ve been trying very hard not to break backward compatibility in Stan, but exactly the “little project” effect you mention leads to a lot of “can’t you just…” … “uh, no” discussions, where I play the bad cop.

        • On the other hand, Stan can work via the stanc compiler, ie. the dependencies can be essentially an external program rather than something you link into your big program.

          yes, you need a compiler, but Stata could certainly create some kind of installer package for a stanc thingy that included g++ somehow right?

          I agree with you, you wouldn’t do it unless you thought that Bayesian inference was important to your core goals though.

        • Exactly! Great points.

          Add to this that if you are providing paid support to now essentially have to train your help-desk team in all the quirks and pitfalls of installing not only your own product but each external tool you are going to call for via an interface. And to stay on top of the changes that each new version of the tool brings. If the tool has a much more rapid release cycle than your software you are just adding on an unnecessarily often change management cycle. ( I sense that Stan releases new versions far more often than Stata? )

          This will very quickly become a support nightmare.

        • @Bob:

          I recommend reading the Linux Kernel Mailing List for sensitizing someone to these software maintenance pains.

          I think the “bad cop” role you play is crucial. Good software is not just good coding but a lot of good design decisions often taken by heuristics.

          And often, it’s not the features you added BUT the features you wisely decided to say NO to that define a great software package.

        • Stata could pay the Stan team to create a kind of long-term-stable release line, thereby funding Stan development, but getting something that they could rely on in their software. Instead, they are paying an in-house team to develop something certainly less good. They will still have lots of development, debugging, testing, training, etc problems with their in-house software.

          It’s pretty clear that choosing between “Stan” and “do nothing” that the “do nothing” path is less complicated. But choosing between “Stan” and “develop our own in-house thing from scratch”? That’s not an obvious choice.

        • @Daniel

          How much “less good” is the crucial question? What percent of their users care about that much less good and how strongly. How many Stata users are using Bayesian inference.

          Also how are you certain that the Stata internal implementation will not get better? Of course, Stan can race on ahead & add a gazillion features but Stata doesn’t need to keep up. It only has to be as good as it’s users need / demand.

          Again, all I’m saying is that these are complex decisions. So I wouldn’t blame them for not doing what Andrew thinks they ought to be.

        • I don’t think the g++ solution would be all that simple. When developing C plugins for Stata, the Windows build uses the Borland compiler, while the gcc/g++ compilers are used in *nix based systems. In future releases of Windows 10 this will be cease to be a problem due to the integration of Ubuntu in the Windows environment (MS and Canonical made an announcement about this not too long ago), but for now would continue to present a lot of issues that they may be waiting to tackle. Another difference to be cognizant of is how data are represented in the language. Although Stata relies primarily on a “data frame” type structure, it also includes matrices – of which row and/or columnar vectors are a subset. So moving the data from Stata’s internal representation into an object/format used by the Stan internals could also present additional challenges and also constrain future development with regards to the representation of data. If there was a Java API for Stan it might be a bit easier to use the Java API for Stata or if someone is able to, use the C API (Plugin interface) to build things that way (with the potential compiler complications).

        • So would it be possible to modify the fundamental design to eliminate the runtime C++ need? Do BUGS / JAGS need compilers at runtime as well?

          What exactly is the core design problem that necessitates a compiler at runtime, a fairly unusual need for any software. Is it the autodiff? Are there other tools, scientific or otherwise, that have runtime compiler dependencies?

        • BUGS and JAGS are both interrpeted, so they don’t need compilation. Stan translates a Stan program to a C++ class, which is then compiled and dynamically linked. Anything using the inline package in R does the same thing. Julia does on-the-fly compilation, but like Java, it’s enapsulated. Lots of scientific tools use compilation on the fly because it’s often easy to write a code generator. Torch is based on Lua, which also uses a just-in-time compiler. TensorFlow supplies both a Python interface and a C++ interface, the latter of which can be compiled.

          The autodiff is not the issue; it was just a basic design decision to avoid the overhead of an interpreted language. Perhaps not the best decision in retrospect! Stan started as just a C++ library. There’s no fundamental obstacle to running in an interpreted mode other than speed and the effort it’d take to move in that direction.

          Any browser these days deals with compiling or interpreting Javascript. Java requires run-time compilation, but it’s part of the Java virtual machine. Java just does an amazing job of packing the compiler and making it portable.

          Yes, Stan works with clang++. But adding g++ or clang++ into Stata would be large (presumably much much larger than the rest of their distribution combined) and a huge install and maintenance nightmare as well as a manageable security risk (letting modules compile and link C++ code is dangerous!). Something like Javascript (ECMAscript) is popular because unlike R, Python, Perl, etc., it’s fully encapsulated and doesn’t make system calls itself, so it’s much less risky to adopt (my first professional programming job was integrating an ECMAscript compiler into the SpeechWorks speech recognition back end for semantics—I’ve always worked on parsers and languages).

          It’s not strictly necessary to keep up with underlying software packages if others aren’t installing and linking against them independently. But not keeping up undercuts the utility of using an external package and can make bug fixing difficult in a project like Stan that explicitly does not support older releases.

          @Billy — good point on data transport layers. That can be overcome with Stan’s design. The current Stata thing appears to be rather limited for exactly this reason, but as other people point out, they have room to grow (though they’ll be very limited in the complexity of problems they can solve if they stick to random-walk Metropolis).

  3. “To be fair, if you have 2 natural experiments you should be able to estimate 2 separate causal effects and then get what you want.”
    Why 2 natural experiments to estimate one effect? Wouldn’t one experiment (e.g. a lottery to enter the job training program with perfect compliance with the lottery results) do the trick? (At least for the effect among people interested in the training program). What’s the second experiment you have in mind?

    • Z

      You want two experiments because in this setting there are two variables that are being considered as treatments: employment status and motivation. So you’d want one experiment that manipulates employment status and another that manipulates motivation. If you have just one experiment, you can’t separately be manipulating the two factors so you’ll only be able to estimate one causal effect.

      • I think in the example they just wanted the effect of the job training program and were worried that motivation was a confounder. Of course, it’s not important what they wanted to know in their example, just explaining where my confusion about the 2 experiments comment was coming from.

  4. > double robustnesss is out the window
    Oh No! as bad as the loss of financial security that’s incurred by losing two lottery tickets bought in two different lotteries!

Leave a Reply to anon Cancel reply

Your email address will not be published. Required fields are marked *