Mailing List Degree-of-Difficulty Difficulty

The Difficulty with Difficult Questions

Andrew’s commented during our Stan meetings that he’s observed that when a user sends an easy question to a mailing list, it gets answered right away, whereas difficult questions often languish with no answers.

These difficult questions usually come from power users with real issues, whereas the simple questions are often ill-formulated or already answered in the top-level doc. So we’re arguably devoting our energy to the wrong users by adopting this strategy.

Of course, this is related to Andrew’s suggestion that this whole blog be called “tl;dr” (i.e., too long, didn’t read).

An Example

On the Stan Users Group, we often get very complex models with a simple accompanying question such as “How can I make my model faster or mix better?” An example is this recent query, which involves a difficult multivariate multilevel model. Such questions require a lot of work on our part to answer. Model fitting is hard and often very problem specific. And it varies by platform — the tweaks you need to do for BUGS/JAGS are Gibbs sampling specific whereas those required for Stan are HMC specific.

Mitigating the Problem

The degree-of-difficulty difficulty can be mitigated somewhat by breaking questions down into simpler, digestible bits. Everyone likes a short question they can understand and answer. Everyone feels good after the exchange.

But breaking a problem down is often impossible for a user —- if a user could isolate the problem, they could probably solve it. I often find myself struggling to express a problem to mailing lists such as the Boost Spirit or Rcpp list in terms other than “I tried this and didn’t work.”

Rubber Ducky to the Rescule?

Sometimes the mere act of breaking a question down into digestible bits leads me to the answer and I can spare the mailing list. This is closely related to the “rubber ducky” phenomenon in debugging, namely that the mere act of explaining a problem clearly often leads to a solution.

Lexicon Worthy?

Maybe Andrew can come up with a better name for this phenomenon than “degree-of-diffuclty difficulty” and drop it into his Handy Statistical Lexicon for posterity.

14 thoughts on “Mailing List Degree-of-Difficulty Difficulty

  1. Bob:

    It’s interesting. Last year, when writing about people who respond on the R help list, I wrote: “sometimes they seem sooooo eager to give snappy answers to stupid questions, that I fear that they go out of their way to answer the easy questions and skip out on the toughies. . . . There’s some way in which correcting error seems so urgent, while more serious exploration can wait.” But now here we are, doing this ourselves!

    On the plus side, we haven’t become so bitter as to answer dumb questions with replies like: “Vague false accusations are unfair to the package maintainers and CRAN team. You were asked not to send HTML mail and to ask on the correct list: see the posting guide” or “So someone is using some very old documentation” or “of course, this is in the ‘if all else fails read the manual, but at least do so before posting’ category. See the posting guide (footer of this and every R-help message).”

  2. I have a bad habit of responding promptly to 25-word emails, but postponing responding to carefully argued 1,000 word personal emails until I can give them the thought they deserve, which usually turns out to be never.

  3. Sadly, what you describe is a fact of life for a lot of open-source software. I run into difficult functionality questions & program bugs on average about once per month that aren’t resolvable via Stack Exchange, etc. I decided to pay up for commercial platforms– MATLAB, MSDN, etc — in large part to get professional-level support. It’s difficult enough to maintain high standards for correctness & reliability in my own code. I can’t risk excessive project delays or failure due to software platform issues.

    I think open-source entrepreneurs like RedHat & Ubuntu provide a huge safety net of professional support that people can turn to when needed. Stan would surely benefit from this at the appropriate time in its development..

    • I know the situation you describe but going commercial hasn’t helped me much really. A lot of commercial support exposes you to relatively raw front end staff who’s not a great help with tough questions. Even if it’s a bug, waiting for it to be assigned and coded is often a long wait. Open source is often way faster at supporiting or bug resolution. Case in point: Check the bug status delay on Red Hat versus a git download from the core Linux kernel.

      What seems to correlate well with reliability & help is user base size. My lesson over the years has been that on critical applications only to adopt software that has: (a) Been around for a while (b) Has a large / active community (c) Never install the latest versions.

      • I’m similarly inclined, though of course, all of (a), (b), and (c) are relative. How long, how large, and how recent a version?

        For some reason, most statisticians seem very eager to download the very latest piece of software. We were rather surprised just how many people installed the new Mac OS X (Mavericks) on the first day it was available, which not surprisingly caused Stan installs to fail and took us a bit of time to sort out. Bigger projects can be a bit more proactive.

        It also surprises me that the RTools distribution is always using the latest alpha or beta version of the g++ compiler, whereas R itself is astonishingly conservative in what it allows.

        • One reason there’s this image of “Open Source = unsupported / unreliable” is I think the very large number of Open Source Projects per se. A novice often ends up too quickly falling for an orphan project, or a fork or a little-used package & gets himself burnt that way.

          e.g. For plotting tools, there’s gnuplot, xmgrace, matplolib & a few other large-userbase tools & literally hundreds of wannabes or niche or orphan competitors. The latter may possibly leave one stranded.

    • One of the beauties of open source software is that you can open the hood yourself to try to diagnose problems. And then submit solutions in the form of patches. Of course, that only works for programmers familiar with the language that the system’s coded in, and works much better for projects with reasonable coding style.

      We licensed Stan under the BSD in part to make it easy to start a company like RedHat or Revolution or Lucid Imagination around it. I find companies like Revolution and RedHat really sketchy in the way they release “enterprise” versions of open-source licenses that skirt the intent of the GPL; GPL is a “copyleft” license requiring any derivative packages that are distributed to be open source. See, for instance, this comments from Ross Ihaka and U. Auckland. And not to start a license pissing contest, but people and institutions (they’re people, too, right?) seem to misunderstand the GPL — there are no clauses about not using it for commercial gain, just a restriction against redistributing it or derivative products that aren’t open source. There are also no clauses about using it internally to create, say a web service, that isn’t open source — it’s all in the redistribution. The BSD license used for Stan is a much more liberal license. (Do I really need to say I’m not a lawyer? I hope that’s obvious.)

  4. I’m one of these bad guys that Bob refers to. I think that it would help a lot to have a short introduction to Stan, a kind of starter kit, perhaps specific to particular areas. Such things exists for R but (as far as I know) not for Stan.

    For example, I mostly fit linear mixed models (hierarchical models) and I would not mind having a short manual that focuses just on that topic. I have such a manual (optimistically titled “lecture notes”) for my students and use it to teach bayesian data analysis. I am motivated to write one specifically for Stan because I keep running into trouble with Stan, but it would help greatly to have the source .tex code of the manual, so that I can simply quote from the relevant parts. There are a lot of people out there like me who live in a pretty narrowly defined universe, who need to walk down a very specific path to get their work done. Communicating with such people needs something less general than the Stan manual, in my opinion.

    • A bunch of the Stan dev team has been tossing around the idea for a year or so of writing an applied Bayesian data analysis book using Stan.
      But that’s not going to happen for awhile.

      The Stan manual is a multi-headed beast: install and getting started instructions for CmdStan, a basic programmer’s guide (how to write models, tips on programming), a reference manual detailing the guts of how the language works, and a reference manual to the built-in functions, as well as a very short intro to MCMC and Bayesian stats. We’re about to rip CmdStan out into its own package and may then split the rest of the manual into two parts.

      How do people feel about the The BUGS Book? I’d want to make the Stan book itself open source like the manuals, but provide a printing option via something like Lightning Source through Amazon and other print-on-demand outlets.

      Stan’s open source and the manual’s licensed under CC BY 3, so have at it. The manual’s part of the source download or you can just get it directly from GitHub.

      • I’m a novice to Bayesian data analysis & I found it hard to pick up from the Stan Manual. I’m too cheap to buy Andrew’s book ( :) ) so I’m really hoping someone will put together a document / tutorial that uses Stan but is more helpful than the Stan Manual.

      • Bob, as Rahul pointed out, the size of the user-base is critical for open-source projects. One of the single most effective ways you can get more people interested in & excited by Stan is: create a series of helpful training videos, with complete & accurate transcripts. I just did a Google search for “stan statistics tutorial video” & got no relevant hits on the first results page.

        Be sure to produce high-quality videos. Narrators should have at least above-average, well-hydrated, nicely-cadenced speaking voices. Check out lynda.com & pluralsight. Lynda’s site sets the current gold standard IMHO. pluralsight is much more programmer-oriented, but they botched their back-end system design, unfortunately. Content searches on their site are completely ineffective. Users must be able to search on terms & phrases & see exactly where in the video(s) their matches come up! (lynda does this extremely well)

        I still find pluralsight site worth $30 / month, as they have over 50 SQL Server courses.

  5. Perhaps this phenonmenon can also help explain the success of Facebook and the decline of personal email. Facebook makes short one or two sentence communication a socially acceptable way of communicating with a group of friends, whereas sending the same message, “Wow! Coffee Angel has the best coffee ever!!!” to an old friend you haven’t seen in two years would seem quite lazy.

Comments are closed.