How to model a non-monotonic relation?

Gur Huberman was discussing with me some examples of medical studies with non-monotonic effects, and he asks, What’s a clinician to do?

I have no idea what clinicians should do, but I do think that non-monotonicity can be very difficult to estimate. There are lots of examples to show this, for example this ridiculous study from a few years back.

Non-monotonicity is nonlinearity and as such can be thought of as a sort of interaction, and you need 16 times the sample size to estimate an interaction than to estimate a main effect, so, yeah, straight-up empirical approaches such as fitting splines (or, God forbid, high-degree polynomials) won’t in general do the job: such models can show you a range of curves that are consistent with the data but they won’t typically give you estimates that you’d want to plug into decisions.

My current thought about modeling a function g(x) that goes up and then down (or down and then up) is not to fit a quadratic curve or piecewise linear curve or whatever, but rather to express it as g(x) = g1(x) + g2(x), where g1 is strictly increasing with an asymptotic threshold (i.e., a function that kinda looks like 1 – exp(-a*x)), g2 is strictly decreasing from an asymptote (i.e., a function that kinda looks like a mirror-reflected version of g1), and then you’ll get non-monotonicity in a natural way. The idea is that g1 and g2 represent different processes so it makes sense to have separate models for them.

For example, in sports, ability increases with age and then decreases. This is called the “aging curve”; many years ago Bill James estimated that baseball players on average perform best around age 27. The point is that the increase and decrease are different phenomena: players increase in ability when young as they get stronger and learn how to play the game; they decrease in ability when older as they get slower and have more difficulty recovering from injuries. The simplest idea would be to have one curve for x < 27 and another curve for x > 27, but I find it more appealing to model as g1 + g2.

Not that I’ve actually done this; it just makes sense to me as a natural way of modeling.

23 thoughts on “How to model a non-monotonic relation?

  1. Andrew, I’m confused why your proposed solution isn’t just akin to the polynomial approach with a different basis? Not precisely, but conceptually. It feels like just putting a semi-structural interpretation on a similar approach.

    • Z:

      First, the basis matters. Polynomials do horrible things: they blow up instead of asymptoting at the extremes. Polynomials are also symmetric, and there’s no reason to expect the increasing and decreasing portions of the function to be symmetric.

      Second, I like the idea of an additive decomposition. Of course that’s just a start; there can be cases where a multiplicative decomposition or something more complicated makes sense. I think there’s value in modeling the different processes. At first it might seem silly to model g by writing g = g1 + g2, but if we have different stories for g1 and g2, this can suggest different models.

      Another way of putting this is that my proposed reparameterization is an attempt to reformulate the problem in a way that allows the introduction of prior information in the form of families of increasing curves. It’s hard to estimate a non-monotonic function, so I think priors are very important here. Try to do it without priors and the results can be very arbitrary, with researchers taking data and drawing whatever conclusions they want from their squiggly graphs.

      • I’ll just amplify the importance of constructing models with meaningful parameters. As Andrew says, this is critical for placing priors, since we need to use domain knowledge for that. But also as Andrew says, meaningful parameters help with interpretation and prediction too.

        Meaningful parameters help with interpretation because they correspond directly to what we want to learn about. We don’t care how much player ability changes with age cubed—age cubed doesn’t mean anything. We care about when ability stops increasing and starts decreasing and by how much. If we can write functions g1 and g2 with parameters that can be directly interpreted as “change point”, “rate of increase”, and “rate of decrease”, then the stage for inference is set.

        Moreover, using models with meaningful parameters makes it easier to manage the model’s behavior when extrapolating, since we can use domain knowledge to construct functions which make sense. If we know that a polynomial does not reasonably describe the relationship (as Andrew says, polynomials often imply explosive behavior at the extremes), then we should pick functions that do a better job. Picking such functions goes hand-in-hand with ensuring that the parameters of those functions are meaningful.

        • I’m going to go with an intermediate version. Typically most real world issues have bounded domains. Age of a baseball player is certainly a number between 0 and 200 years and most likely between 15 and 50. So the asymptotic behavior above 100 is irrelevant. On closed domains polynomials are a complete basis so they can represent any function. It’s perfectly fine to use for example chebyshev polynomials for these problems. What’s needed is a reasonable prior on the behavior of the functions. The right way to get that is usually to use a functional that maps paths to the real line… Consider that number as a log-prior type number and then incorporate it in your prior.

          Sometimes it’s easier to use purpose built bases such as Andrew mentions, but other times it’s fine to use a complete basis and simply restrict the probability to curves that “do the right thing” through this kind of method

  2. I’m grateful for this post, since it addresses one of my main concerns, learning from statistical relationships to better understand individual cases. If you have a specified g1 and g2 that correspond to potentially observable characteristics, you’re in a much better place to predict where an individual case is likely to fall relative to the model (it’s residual). If you just have a fitted polynomial, you just have the polynomial and its error band. Does that sound right?

  3. I’ll be succinct, at the risk of sounding pedantic. Despite the specialized context, your colleague is asking two questions we all ask all the time. “How do I measure this noisy effect with more precision?” and “How do I measure this complex process that I don’t understand?” The answer to the first question is to increase the resources that are brought to bear on the problem, like better measures, clever designs, larger samples, etc. The answer to the second question is to do more basic research on how to model causes of effects, and design measures for those causes.

    Okay, he’s also asking a third question we all ask all the time: “Are there any reasonable shortcuts?” No shame in that. The good news is, his lack of power is not simply due to a small effect. That means there are (as you and others point out) potential analytical solutions that don’t necessarily apply in the general case of low power. I’m particularly intrigued by the idea that measuring/contrasting increasing vs. decreasing != measuring/contrasting high vs. low. Not sure what to do with the idea, but I’m intrigued by it.

  4. Okay, bear with me. Suppose you had a variable/measure that increases nonlinearly with a linearly-increasing construct, but decreases linearly with the same construct linearly decreasing. For example, you’re interested in a correlation r that ranges from 0 to 1 between the given constructs. Anytime the correlation increases by quantity A, your measure Z changes by a factor of A/r. Anytime the correlation decreases by A, Z changes by a factor of Ar.

    Something very similar can–and has–been done, but not for direction, just for magnitude (to my knowledge). It just depends on constructing a variable that has a sampling distribution with particular characteristics.

  5. I don’t know much about it except that it is immensely successful in constructing surrogate models for optimization, but there is a method called kriging (spelling?) that I think has a relatively sound statistical basis. I’m not an expert though, I just use the models.

      • Sorry, got my comment cut off. The rest was supposed to be:

        kriging and GPs are indeed often described as being “nonparametric” in the sense of being very flexible and so fall into the family of “straight-up empirical methods” described above; not enough data.

  6. Andrew, I’m very curious your take on GAMs, such as the `mgcv` package (or the work by Simon Wood in general, I guess). Not calling you out here, I’m genuinely interested in how you view this work and when it’s an appropriate tool to use. You may dodge this, as needed.

    I get that tossing high-order polynomials into a GLM is “no bueno”, but aren’t penalized splines supposed to help avoid those problems by not behaving so wildly? That’s setting aside the issue of extrapolation with splines, which will get no argument from me.

    • Presumably part of the issue here is that whilst a GAM fit gives you an empirical/descriptive “squiggle”, it doesn’t generally give you an easily intepretable parameter estimate that can be used to make decisions (this was my reading of Andrew’s post anyway; very happy to be corrected by someone more knowledgeable). Feels like part of the distinction between descriptive inference and “causal”/process-based inference.

      • Olip:

        Yes, I think GAMs and GPs and splines and lowess can be useful in getting a sense of what can be learned from the data alone. Typically the result will be noisy and I’ll want to use prior information (including assumptions about functional forms, not just prior distributions on parameters) in order to get something I can use for decision making.

        • Ah, I’m following the critique now. I also unfairly skimmed over the “you need a bunch more data” piece (because that’s typically never relevant to my work), but when you put it all together I see the point.

          Stats is hard, and decision analytics can be even harder!

  7. Okay, last idea, I promise. It seems to me that certain processes may be more variable (or, perhaps, are less reliably measured) when they are increasing than when they are decreasing. For a completely hypothetical example, maybe people give more consistent ratings of pain when the pain is increasing than when it’s decreasing. Under that very particular, very strong assumption, you could model non-monotonicity via a latent growth model in which variance (or covariance) predicts sign of slope.

    The nice thing about this approach is that the correlation between variability and direction of change doesn’t have to directly involve the main outcome measure. You might have a second measure that’s completely redundant with the main measure in terms of expected value but not variance, or it might be an insufficient statistic where the main statistic is sufficient. Or, the change in variance might be methodological in nature–you measure the outcome three ways (e.g., self-report, behavioral observation, biological measure), and what you covary with the slope’s sign is the different functional forms of the different measures. Again, all speculative, but potentially useful across a wide range of intervention studies.

  8. I made a Git Gist to give an example of the path-integration functional prior idea. I owe a lot of credit to Joseph Wilson on this idea though it’s something I had done before even on my blog, he definitely encouraged me to look into the general case.

    https://gist.github.com/dlakelan/6204895ecce9b91040745ed5662daff1

    Download Julia here: https://julialang.org/downloads/

    Download the file from the Gist. Run julia and include(“PriorOnFuncs.jl”)

    It will produce a plot of 30 draws from this prior.

    This uses a generic 10 term Chebyshev polynomial to approximate a function, but the prior for the function is that it follows a specified path to within a specified tolerance.

    There’s nothing wrong with high dimensional polynomials on closed intervals, as long as you tell the polynomial what you think it should be doing. There are many ways to tell it what to do, this is just one example.

    • This is cool stuff, thanks for sharing!

      That said, I disagree with the statement that “There’s nothing wrong with high dimensional polynomials on closed intervals”. As I noted in my comment above, amplifying what I understand to be Andrew’s point, I think there is something “wrong”, which is that it obscures the mapping between the behavior of the function and hypothesized theoretical constructs. It’s just that there’s something “wrong” with every approach, and whether any particular “wrong” matters depends on your goals and givens.

      Building a model with meaningful parameters also has plenty of possible “wrongs”. The main thing is that the model may be wrong, in the sense that it fails to make a good approximation to the processes that generated your data. Of course, you can be wrong with any approach (e.g., maybe the real function doesn’t actually go up then down), but the more assumptions the more likely one of them is to fail. That said, working with a model with meaningful parameters can make it easier to diagnose problems with the model.

      The second thing wrong with models with meaningful parameters is that building one requires domain knowledge. A lot of the time, we don’t have that knowledge or can’t express it formally. In such a situation, something like your approach would be a great middle-ground, as you say in your comment above.

      More broadly, I guess all I’m saying is that modeling is hard and requires making tradeoffs depending on your goals and what you have to work with. Personally, I’m always more comfortable working with models with meaningful parameters because I’m usually working with reasonably clean data from well-controlled experiments and my goal is inference about possible causal mechanisms. That said, if I were in a situation where a high-quality functional description was the goal or where high-quality prediction was the goal, I would probably rely more on the kinds of “non-committal” approaches like what you describe.

      • It is possible to build on top of this approach to make things more mechanistic. While using the high dimensional polynomial as the infinitely flexible basis, one can fashion parameters that describe the functional behavior that constrains the polynomials. For example a Lagrangian in Lagrangian mechanics “picks out” a particular path because it behaves according to a generalization of Newton’s equations. You can do something similar for other processes. Let’s say you’re trying to describe something like the immune response of different age groups. Define the immune response of 20 year olds as the reference level. Then you’re looking for a dimensionless ratio between a given age group and the response of an average 20 year old. You can draw any curve you like using Chebyshev polynomials, but the curve must go through (20,1) by definition. Also, you can put a prior on the mean rate of change of behavior… and perhaps that’s a function of age, and health conditions like diabetes or cancer onset, which are well understood to have certain rates in the population… this informs the path-functional that assigns probability (density) to the path. Thinking now in terms of dynamics and differential or at least difference equations could be advantageous and provide you with a mechanistic an interpretable model.

        The difference between this approach and “throw 10 terms into a polynomial regression and minimize the sum of squared errors” is quite large of course. My point is really just that the problems of OLS on high dimensional polynomials are problems of that *entire system of modeling* not of polynomials as function representation per-se.

  9. An interesting real world example of this I’ve seen is Cooling/Heating Degrees. For modeling electricity demand/consumption this is a very common approach. It builds upon the assumption that there is some middle point between heating and cooling and as you move further away from that balance the system uses more electricity to respond. You can almost see what people set their thermostat to if you optimize the balance point per household.

  10. I can’t help but look at the question from the fundamentals of computer science. A (binary) relation can be modeled by a pair of multi-valued functions. Here, I’m assuming that discrete values won’t work because we’re in a measurable world of probabilities, so perhaps we should model relations R : X Y by a pair of functions f_R : X -> P(Y) and g_R : Y -> P(X) which construct probability distributions over the variables.

    The punchline would simply be that the non-monotonicity requirement can generalize to non-monotonicity of the means of the probability distributions.

  11. My current thought about modeling a function g(x) that goes up and then down (or down and then up) is not to fit a quadratic curve or piecewise linear curve or whatever, but rather to express it as g(x) = g1(x) + g2(x), where g1 is strictly increasing with an asymptotic threshold (i.e., a function that kinda looks like 1 – exp(-a*x)), g2 is strictly decreasing from an asymptote (i.e., a function that kinda looks like a mirror-reflected version of g1), and then you’ll get non-monotonicity in a natural way. The idea is that g1 and g2 represent different processes so it makes sense to have separate models for them.

    This looks like 3-6 parameters, for asymptote 1 and 2, rate 1 and 2, and center 1 and 2, where some parameters possibly be pinned to meaningful constants or shared between the two functions. Seems like there’s the potential for identifiability problems though; is the dose response leveling off because of the negative function ramping up or because it’s approaching its positive asymptote? This can be ameliorated by principled prior modeling; therein lies the arts and sciences.

  12. I’ve thought of this before in the context of dose response modeling. However, let’s say we take our g(x) := g1(x) + g2(x) to be something like functions we often use in dose response modeling. E.g. g1(x) = Emax_1 * x^h1/(x^h1+ED50_1^h1), which starts at 0 at x=0, reached 0.5*Emax_1 at x=ED50_1 and asymptotes at Emax_1. the parameter h1 determines the steepness of the curve. Let’s similarly use g2(x) = Emax_2 * x^h2 / (x^h2+ED50_2^h2). Even with reasonable vague or even semi-informative priors and even if we set a joint prior that ensures that ED50_1 the lowest point x at which we have data) and that Emax_1 and Emax_2 have the opposite sign, I believe we end up having a lot of trouble identifying the model parameters, especially if we only get outcome values for a small number of values of x (e.g. in a dose finding clinical trial, it would be very rare to test more than 7 doses plus a placebo [x=0] and very often fewer than that). I suspect this is even tricky, if we put a pretty strong prior on h1 & h2 or even fix them to be 1. Or does someone have experience with getting this to work?

Leave a Reply

Your email address will not be published. Required fields are marked *