Here’s what I think about Occam’s Razor.

]]>To start, I’d take a look at what we do here. I’d start by fitting the model using stan_glmer in rstanarm in R (or glmer in lme4 in R, if you prefer) with varying intercepts for all batches of main effects and interactions of interest. No F-tests, no pairwise comparisons; just estimate everything.

]]>I work in an industry where 2-factorial ANOVA experiments are common. For example, we’re testing 6 oils with 6 seals for a total of 36 ‘treatments’ in 2 blocks (so 2 replications, one in each block). In a typical frequentist scenario, we’d fit a linear model with fixed effects for Oil, Seal and Block plus an Oil:Seal interaction (as we believe oils behave differently at different seals). The typical procedure in a frequentist scenario is an F-test followed by Tukey pairwise comparisons.

I’m having a hard time building the blocks of this model in a bayesian context. How do I prevent the mutliple comparison here?

1. According to the post, as long as I have somewhat informative priors, we should avoid that problem. This is in itself a daunting task. I or the subject matter expert can’t tell me what they believe about each interaction coefficient. From the paper I referenced, the solution is to then build a multilevel model.

2. Do we then treat both factors, oil and seal as random effects? If we’re interested in making statements only about oils, do we then treat Oils as a random effect and seal and block as fixed effects? I’m finding it hard to build an intuition of this model and be able to justify the choice of priors and or random/fixed effects.

Hoping this gets some attention and discussion!

]]>Justin

]]>No, R (and Stan) parameterize the normal in terms of location and scale.

]]>Shouldn't tau be tau^2?

]]>Perhaps in principle we can write a bigger model that all the specifications fit into, but in practice it quickly becomes impossible. Even multilevel models, to handle different subgroups of the data, can be computationally prohibitive if we already have a highly parameterised model. So it’s good to have people thinking about how to perform inference in the presence of methodological ambiguity.

]]>Common LISP is *not* a pure functional language. it’s got all kinds of looping and iterative constructs, setf to set the value of just about anything, lots of printing and formatting stuff, explicit file-IO mechanisms. But, it’s cognitively very different from other languages. One big reason is the macro system where the language itself is used to write new language constructs.

LISP’s biggest problem seems to be that it just requires more than a High School education and a “learn X in 24 hours” level of knowledge in order to hire someone to do even basic work on a LISP program. For example, NASA hired some great programmers who came up with some kind of fancy autonomous robot planning and execution software which was fantabulously effective at what it was supposed to do, won some award, and then was canned by JPL. One big issue is no-one cheap could work on it. Same thing happened to Yahoo Store (programmed in LISP, bought out by Yahoo, then re-built over a decade in C++ by which time Yahoo had pretty much crashed and burned but just didn’t know it yet)

http://www.flownet.com/gat/jpl-lisp.html

I really want Julia to take-off, but I won’t be switching over until a very extensive version of ggplot2 is available ;-) and that gets right at your point about the extensive user community.

]]>Great points.

I’ve found the size, activity and expertise of the user community is a HUGE factor in how useful a language or package turns out.

>>>but I’ve never run across a practical problem where they made a lot of sense.<<

Lisp was used in AutoCAD and Emacs. Dunno if it meets the approval of functional purity.

]]>I also think it comes down to “horses for courses” as the British say. There are better matrix and math libraries in FORTRAN than in Common Lisp. And better internet libraries for things like unicode and sockets and threading in Java. And C++ has both everything and the kitchen sink in the language *and* in the libraries. It’s enough to make a programming language theorist like me cry.

I honestly don’t know what the functional programming languages are good for. I love them dearly as theoretical constructs (I used to do programming language theory and a lot of typed and untyped lambda calculus), but I’ve never run across a practical problem where they made a lot of sense. Part of that’s just the lack of large communities of programmers in fields I’m interested in.

What I miss the most from the functional programming languages is lambda abstraction. It’s a complete hack in C++ with bind, functors, function pointers, etc. and also a hack in Java, too (like functors in C++, but maybe they’re anonymous). You get some of their benefit with continuations in languages like Python and R and even in C++ you can code things in a continuation-passing style (sort of built into my brain after all that Prolog tail recursion).

Guaranteed, though, no matter what language you choose for a project, a chorus of naysayers will tell you that you should’ve chosen another one. My experience at Julia Con was everyone telling me we should’ve coded Stan in Julia!

]]>http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998 ]]>

Having done a bunch of Prolog programming back 15 years ago, I honestly think that Stan feels like Prolog at a certain level. Prolog is a method for specifying depth-first-searches through discrete spaces, Stan is a method for specifying Hamiltonian searches through continuous parameter spaces. If you don’t acknowledge the search mechanism you can be doomed in ether one.

A Prolog program basically says “find me values of variables that make certain statements true” and Stan basically says “find me values of variables that make certain statements plausible”

]]>Does the user have ANY idea what any of it means? When I ask my wife she gets really frustrated, because she’s basically just doing what someone told her to do. In the end after making 10 or so ad-hoc choices are you discovering anything, or just finding data to justify your preconceived notions, or maybe just arriving at a puzzle-solution to get past the grant/publication gatekeepers?

]]>Typical bench Biologists really don’t know much more than a t-test and a chi-squared in my experience. There are more specialist people who call themselves Bioinformaticists and of course there’s Biostatisticians neither of those groups would typically have the skills to say do sterile tissue culture, or debug PCR primer problems, but their level of statistical sophistication is much higher. So it’s a matter of specialization. Bioinformatics/statistics seems to be predominantly about fancy methods for adjusting p-values though. The whole framework is about “there is an effect” vs “there isn’t an effect”. I’ve even had biologists *explicitly* tell me “I don’t really care whether it’s big or small, just whether adding hormone X produces a difference” or the like.

]]>But, fair enough. I’m being perhaps too glib.

]]>When I was at Carnegie Mellon and Edinburgh, I used to work on logic programming (and some functional programming). It used to drive me crazy when people referred to Prolog as a logical programming language. It was basically depth-first search with backtracking (unless you used cut, which was neccesary for efficient non-tail recursion) and if you weren’t aware of this and didn’t code to this, your Prolog programs were doomed. (O’Keefe’s *Craft of Prolog* remains a great programming book, as does Norvig’s book on Lisp.)

And that’s a great use of generics in the statement about “Biologists”! There are lots of biologists who know quite a bit more about stats than that, and many more biostatisticians. The causes for this also involve supervisor and editor pressure to report things in terms of p-values. Mitzi used to work in this area and the editors for a paper she was working on for the ModEncode project for *Science* insisted that they provide p-values for their exploratory data analysis (which involved a clustering model nobody believed was a good model of anything, but useful for exploratory data analysis).

Sure. If you make no. of papers purely the metric, sure you get truckloads of crap.

But no one’s contesting that. I hope.

]]>What if the final metric of performance is how many papers you produce, and due to confusion, the limiting factor to publishing papers has become how many “statistically significant” results you can generate?

I think this is what is going on, the way of assessing progress used by many researchers these days is itself fatally flawed. That is exactly why NHST is such a destructive force, it allows you to *think* you are learning something when you are just producing massive amounts of garbage. I am by far not the first to come to this conclusion. It was pretty much Lakatos’ position in the 1970s…:

“one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of ‘scientific progress’ where, in fact, there is nothing but an increase in pseudo-intellectual garbage…this intellectual pollution which may destroy our cultural environment even earlier than industrial and traffic pollution destroys our physical environment.”

Lakatos, I. (1978). Falsification and the methodology of scientific research programmes. In J. Worral & G. Curie (Eds.), The methodology of scientific research programs: lmre Lakatos’ philosophical papers (Vol. 1). Cambridge, England: Cambridge University Press. http://strangebeautiful.com/other-texts/lakatos-meth-sci-research-phil-papers-1.pdf (pages 88-89)

]]>You could be right. I don’t know.

My point is that is I see a lot of arguments claiming superiority of an approach based on the merits of its logical structure and what information it uses than arguments based on the actual performance on outcomes that matter.

]]>My point is the challenger method must show that the improvement (if any) in the final metric of performance you get by removing shortcomings is worth the additional effort.

True, that many areas use simplistic, ad hoc predictive methods that are leaving information unused on the table or have alternatives with richer structure or logical foundations.

But unless the alternative can demonstrate that all its ingenuity can actually translate into better performance, or in fact, performance so much better that it is worth the additional modelling effort & the cognitive switching cost, I think practitioners are entirely justified in sticking to their same, old, flawed boring methods.

]]>In practice, sometimes this step is done where Biologists typically look at these lists and then just select stuff they think looks good (ie. stuff involved in pathways they can imagine being “real”)

It’s making the core concept driving your whole process into a dichotomization that bothers me. There’s a lot more information there.

]]>Do you have evidence that a nuanced approach gives better results?

]]>If you’re talking about time-series regression with discontinuity, you might find my recent post on placing priors on functions interesting:

http://models.street-artists.org/2016/08/23/on-incorporating-assertions-in-bayesian-models/

You could, for example, compute some summary statistic of the function behavior in the vicinity of the change-point, and assert something in the model about its plausibility. For example, abs(f(somewhatafter)-f(atchange)) ~ gamma(a,b) to assert that you think there should be on average a nonzero slope up or down in this region, where you choose a,b so as to constrain the slopes you’re entertaining.

You could do a similar thing for the endpoints of the time interval. Calculate a mean value pre-policy, and calculate a mean value in the asymptotic far-post-policy time period, and assert something about the plausibility of the change in asymptotic behavior.

Often this kind of very flexible model can be a substitute for a discrete set of plausible simple models. I mention this, because I recently tried to generalize the state-space vote-share model that was posted a couple weeks back using a Gaussian process, and found it computationally infeasible for 400+ days of data. Looking for a way to specify time-series function behaviors other than Gaussian processes led me to the suggested idea.

]]>I’m using hypothesis tests and p-values to go with the flow, because it’s not my project. ]]>

Unfortunately for Biologists if they understand anything about statistics it’s what you learn from “Statistics for Biologists 101” which is more or less a t-test and a chi-squared test, so their concept is “there are true targets and there are false targets, and I want to minimize the number of false targets while keeping most of the true targets so that I waste as little money as possible” so they describe this to a classically trained statistician and wind up with FDR procedures.

I think this makes them feel like they’re scientists because they’re “discovering” things, as opposed to engineers who “minimize costs”. But, the truth is, *at this stage in the process* they’re looking to trade off costs vs number of useful results. If they formulated their problem in that way, they’d wind up with a more consistent and logical framework.

]]>Which is to say, aren’t you basically reinventing an ad-hoc version of a Bayesian analysis? The core concept of a Bayesian analysis is basically this: “Here’s a bunch of different possibilities I’m willing to entertain, and some data, which of these possibilities should I entertain after I’ve seen the data?”

The ABC version of how Bayesian analysis works is “pick values from the prior distribution, compute the consequences of the model, and weight this consequences by the likelihood of the error you see between the computed consequences and data” which is to say that ABC looks a lot like “try this out and see how it works, try that out and see how it works, try this other thing out and see how it works…”

So, you have some kind of time-series of health outcomes, and a policy change-point, you have various ideas about how to describe the data itself (reasonable bandwidths to get summary statistics or whatever), and various ideas about how the summary statistics should change post-policy-change (broken-stick linear models, exponential decay to a new stable state, an initial confusion causing worse outcomes followed by a decay to a new stable state where things are better, nothing happens at all post change…)

Take your existing set of analyses as a kind of model-search, incorporate all the model-search ideas into one Bayesian model, put broad priors on it all, run the Bayesian machinery, and get a posterior distribution over what seems to be true.

]]>normal(3e8,3e7) m/s is probably good (actual value technically defined to be 2.9979246e8)

but exponential(1/3e8) is just fine for lots of purposes. Imagine that you have a measurement system capable of measuring to 10% accuracy. After 1 measurement your posterior is going to be +- 0.3e8 so the fact that the exponential prior includes values of 3 m/s out to 9e8 m/s that would still be considered within a high probability region is irrelevant.

The typical problem with experts in a field without high precision information is that their priors are TOO NARROW, and so if you ask several experts you’ll get a bunch of tight intervals that don’t even overlap. The solution is to use a prior that encompasses everything they all say.

What’s the a-priori frequency of failures of space-shuttle launches? The NASA managers said 1e-7, the rocket booster engineers said 0.1, the solution? beta(.5,5) or something thereabouts, or maybe uniform(0,.5) or exponential(1/.05) truncated to 1, or normal(0.05,.1) truncated to [0,1]… anything that includes all the experts favorite regions of the parameter space is reasonable.

My recent set of posts and commentary about Cox’s theorem make it clear, the “correctness” of a Bayesian probability model is a question of whether the state of information it’s conditional on is well summarized by the choice of distributions, not whether there’s an objective fact about the world that you’re trying to match.

]]>This is strange to me. This comment seems to say that statistical significance means finding strong evidence (it doesn’t) but also realizes that statistical significance is just something inevitable that happens by looking at some data.

]]>My personal preferred practice when faced with what you call “whimsical” decisions is to code up a loop and try every combination. But then, having obtained all this data, there doesn’t appear to be any widely recognised statistical way to analyse and report it. Model averaging??

]]>Long answer, I’d fit a multilevel model. Short answer, I’d take what you did already and call it a multiverse analysis as in this paper (which I guess I should blog sometime).

If you want a single p-value, I recommend taking the mean of the p-values from all the different analyses you did.

]]>But my clients want to write this up as a negative result, and so they should. So, with the intention of showing we have been thorough, and not with any intention of p-hacking, and because we really didn’t have any strong beliefs on various analytical decisions we had made, we redid the analysis with a range of “garden of forking paths” style tweaks to the sample selection, the bandwidths, covariates etc. Now we have a couple of hundred estimates on the same dataset, and they are very noisy, with different signs etc. Inevitably a couple of effects have significant p-values.

What would be the proper way to approach this? It doesn’t seem to fit into a hierarchical framework.

]]>Daniel: thanks for the link to your experiment that’s nicely matched to just this question.

]]>We (I’m a maths guy, pretending to be a statistician, advising bioinformaticians how to help scientists) tend to use FDRs in Andrew’s screening sense, where we threshold on a fixed FDR, and then rank on (if anything) effect size. But more commonly the hits are then taken as the starting point for the scientist building a narrative, and performing more focused experiments. ]]>

I got my PhD from OSU, and we actually overlapped in 1997 when you visited for your talk. We had dinner together (with Carl Pollard). I must say I am surprised that even OSU had that reaction to you and Bird and Sproat. But it looks like the so-called non-linguists won in the end there, because the current profs in computational linguistics at OSU would by the 1997 standards also not be considered linguists.

About the NSF rejections, it seems that reviewers routinely use the review process to control the direction that the field is going in (the same happens in Europe). The program officer/funding agency often also has entrenched interests and/or has political concerns in mind when they make their decisions. Science is more about politics and control, the science happens in spite of this whole expensive machinery for funding and jobs.

It is probably for the best that you ended up where you are and not linguistics.

]]>All of the academic fields I’ve been exposed to (linguistics, cognitive psych, computer science, and statistics) have a certainty bias along with a sweep-the-dirt-under-the-rug bias so that people don’t hedge claims or list shortcominings for fear of papers being rejected.

]]>I know what you mean. I think the problem here is that most math curricula were designed in a pre-computer algebra age and are relics of a time not so long ago when, if you couldn’t do an integral or a PDE you couldn’t just fire up Wolfram Alpha or Mathematica or Maxima and ask it to do it for you. Best you could do was run over to the library’s Reference Section and pore over a thick dusty Handbook of integrals or tabulated PDE solutions and boundary conditions.

But you are right, a lot of math tools being taught to undergrads are archaic. The Math Departments have been too slow to change.

]]>Sorry, maybe I wasn’t clear about my point: I didn’t mean that the Math class would make them literate about p-values, or t-tests or the intricacies of the pitfalls etc.

My simple point is that for @digithead it will be really hard to get these concepts across to a student body that is, as you put it, “elected out of being numerate”.

Ergo, if you make them take Calculus or similar foundational courses then an “Advanced Quant Methods” course will start making a lot more sense. It makes “digithead”‘s life easier and the students learn more too.

Of course, there may be some students who are just not capable or motivated to deal with Calculus-101 but then I wonder whether we should really be asking them to work on Soc Sci problems that actually need “Advanced Quant Methods”.

]]>The other stream is one introductory level course in stats where I deliver the basic ideas informally, mostly using simulation (Sanjoy Mahajan inspired me a lot over the last year after Andrew blogged on his books, and strengthened my belief that this was the right way). This is for the majority of the non-numerate grad student population and leaves them capable (if they do well in the exam) of correctly carrying out t-tests and fitting hierarchical linear models. I don’t teach Bayes at all in this stream because the goal is to get them to at least do the bare minimum correctly with frequentist statistics, and (importantly) because their own future PhD advisors will probably know nothing of Bayes and will likely react with puzzlement at best when presented with a Bayesian analysis.

BTW, in the stats-lite course I skip ANOVA entirely, but students keep clamoring for it, because even now, in 2016, they mostly see ANOVAs in papers. I stubbornly refuse to teach ANOVA.

]]>