Skip to content
 

How many Stan users are there?

This is an interesting sampling or measurement problem that came up in a Discourse thread started by Simon Maskell:

It seems we could look at a number of pre-existing data sources (eg discourse views and contributors, papers, StanCon attendance etc) to inform an inference of how many people use Stan (and/or use things that use Stan). We could also generate new data (eg via surveys etc). Do we know the answer and/or how best to work it out?

The cleanest way to do this would be to start with a list of the population possible Stan users, then survey a random sample of them, ask if they use Stan, and extrapolate to the population. But we can’t do this because no such list exists. We could count Stan downloads, but that’s not Stan users, as we assume that lots of the downloads are automatic, and also people might download Stan and then only use it once, or not at all.

Lauren Kennedy suggests doing a snowball or network sample using contributors to the Stan Forums as a starting point.

Snowball sampling could work. There could be other ideas too. Please offer your suggestions in comments.

Here are my thoughts:

1. A natural first step in any research project is to read the literature. There must be some estimates of the numbers of users of other programming languages such as Python, R, C++, Julia, Bugs, Stata, etc. I don’t know where these estimates come from, but looking at them would be a start.

2. If we’re gonna do a survey to estimate the number of Stan users, it perhaps makes sense to expand the project and simultaneously estimate the number of users of some other programming languages too, both for efficiency (with little more effort we can get information that will be of interest to others) and to get comparisons: comparing the uses different languages in our survey and also comparing our estimates to estimates that have been obtained by others.

3. We should also think about how the survey could be done again in the future. If we have a good estimate of the number of users, we might want to repeat the procedure every year or two to get a sense of trends.

4. How many Stan users are there? What’s a “Stan user”? Does this include users of rstanarm and brms? What about people who only use Stan through Prophet—does that count? Do we want to count every-users or current users? How often must you use Stan to count as a user? What if you took a class that used Stan? Etc.

The point of this last set of questions is not that we need a precise definition of Stan user, but rather that we should ask a battery of questions to get at mode and frequency of use. Also, we should consider how we might want to summarize and interpret the results: we should think about this before we conduct the survey (rather than doing the usual thing of gathering a bunch of data and then deciding what to do with it all).

17 Comments

  1. Alex says:

    I thought Stack Overflow might have something for you, but it doesn’t look like they asked about Stan https://insights.stackoverflow.com/survey/2019

  2. JA says:

    Telemetry is one obvious way to estimate, with the caveat that it would only measure usage of versions after telemetry is enabled,
    and would need to take steps to ensure privacy and account for bias introduced by user opt-in.

    • That would be one way to do it. But not something we wouldn’t even consider even if we could get it by CRAN.

      Though we could ask people to do something like email us if they’re using Stan or ask them to fill out a web form. I think seeding that process and asking the people who fill out the form to ask others to do the same is the “snowball” idea from the original post.

    • Logit says:

      @JA, are you aware of any good and already-existing telemetry data sources that can report on specific software that’s installed on a large sample of PCs? I would think that telemetry data from workplace PCs (and STAN users) would be particularly hard to come by.

  3. Could automated downloads have a different signature than manual ones? (like for example they occur on a precise or repetitive schedule or the software gives a user agent string that’s helpful in the logs?) Maybe you could classify them and at least get a decent estimate to within a factor of 2 or something?

  4. JP says:

    Don’t know why you need to do this, but assuming it is a valid question and in light of the limitations mentioned above. You could do an automated literature search. Presumably, people using Stan would mention it in their academic or corporate research. Downloads and Google searches which Tiobe and Plpy use to publish the most popular languages are meaningless no one I know in SEng trust these surveys.

    • Nobody’s forcing us to survey the Stan user community, but we have several motivations:

      • we’d like to know who uses Stan to do what in order to tailor development to our user needs
      • grant funding agencies and foundations like to know the scope of what they’re funding [being big is both a positive and a negative there]
      • some grants have deliverables that involve growing the community, which means we have to be able to measure it somehow if we want to demonstrate we succeeded

      We can measure source contributions, papers authored hitting search terms, Stan forums sign-ups, etc.. Although we might assume these are proporitonal to the population, that proportion changes if the makeup of the population changes. For instance, relatively fewer academics means relatively fewer papers.

      We have over 3000 people registered on our Discourse forums and the traffic’s steadily increasing. There are also over 3000 papers on Google scholar if you search for

      ("mc-stan.org" OR "Stan development team")
      

      There have been millions of downloads, but what does that mean? We have very limited ways to measure downloads (like the RStudio mirror).

  5. Guillem says:

    It sounds a lot like a Bayesian evidence synthesis situation, i.e. build an estimate from different data sources.
    For instance, this problem reminds me of the estimation of HIV prevalence in UK (cf. https://insights.ovid.com/crossref?an=00002030-201011270-00012).

  6. Slutsky says:

    This is an interesting problem, and I have thought a lot about a similar problem in the context of what I should recommend to students, i.e., should they learn R or Python, or why it may make sense to switch from, e.g., Stata to R or Python.

    In particular comparing usage between different languages is difficult because some languages are more universal programming languages, whereas others are more specialized to data wrangling and analysis (R) or estimation (Stan). On top of that, some languages are more difficult than others, which will affect, e.g., the number of questions on stackexchange, which would also be a nice measure of popularity.

    To my knowledge, the best source at the moment that describes the complexity of this task is this: http://r4stats.com/articles/popularity/

  7. Allen Riddell says:

    What about contracting with a market research/polling company which maintains a very large panel or otherwise knows how to reach people in science and industry?

  8. Sam says:

    In the purely academic world, if Stan or related packages were used they should be generally listed in the articles. For example, on PubMed there are 337 published articles mentioning “rstanarm” or “brms”, and 789 with those terms or “Stan”. PubMed doesn’t index most stats journals or non-medical either. It’s not clear how many of those are unique authors but once could go down this route to try to find all the academics using these packages.

Leave a Reply to Sam