Skip to content

Bothered by non-monotonicity? Here’s ONE QUICK TRICK to make you happy.

We’re often modeling non-monotonic functions. For example, performance at just about any task increases with age (babies can’t do much!) and then eventually decreases (dead people can’t do much either!). Here’s an example from a few years ago:

A function g(x) that increases and then decreases can be modeled by a quadratic, or some more general functional form that allows different curvatures on the left and right sides of the peak, or some constrained nonparametric family.

Here’s another approach: an additive model, of the form g(x) = g1(x) + g2(x), where g1 is strictly increasing and g2 is strictly decreasing. This sort of model gets us away from the restrictions of the quadratic family—it’s trivial for g1 and g2 to have different curvatures—but also it is a conceptual step forward in that it implies two different models, one for the process that causes the increase and one for the process that causes the decrease. This makes senses, in that typically the increasing and decreasing processes are completely different.

This is an example of the Pinocchio principle.

P.S. The original title of this post was “Additive models for non-monotonic functions,” but that seemed like the most boring thing possible, so I decided to go clickbait.


  1. Dogen says:

    I’m confused. The way this is worded it seems to fit the Pinocchio Principle, but the way it’s worded doesn’t make sense to me.

    I mean, if you’re modeling something associated with age over a lifetime wouldn’t you naturally use multiple functions? What would make anyone think a single function would apply? So then how does Pinocchio come into play?

    Hmm, ok maybe you’re making an inside joke. That fits, and I can see it as pretty funny now.


  2. Jonathan (another one) says:

    My old boss always used to use a/x + bx + cx^2, which gives three regimes for initial, run-of-the-mill and extreme behavior of x.

    • Rahul says:

      Is there any advice for the selection of g1(x) & g2(x)?

      Just in this particular thread I see several functional forms: a/x + bx + cx^2

      exp(-mt) (1-exp(-kt))^b

      a_1*exp(-b_1*x) + a_2 + a_3*exp(b_2*x)

      and a few more……..

      Is there any way to know what to use? Is minimizing the fit error the only yardstick?

      • My best advice would be to think about the properties the function should have (this is a kind of prior). For example in the graph shown at the top there’s a rapid almost linear increase, a period where things are relatively constant with a slow decay, and then a roll-off at the end. At the very least, the functional form should be able to reproduce a reasonably wide variety of those types of curves.

  3. David P says:

    A (possible) real-life example:

    Rising incomes, holding longevity constant, might cause workers to finance longer retirements (a curve for retirement age falling but tending to level off).

    Rising longevity, holding incomes constant, will require workers to finance longer retirements (approximately constant as a fraction of lifetime, and consequently rising as lifetimes rise).

    If the curvatures cooperate, the net effect as both incomes and longevity rise can be a decline then a rise in average retirement age.

  4. Bill Harris says:

    I’m glad you brought this up. System dynamics models nonlinear functions often, and system dynamicists emphasize the need to disaggregate causes to get monotonic functions. See, or see chapter 14 of John Sterman’s /Business Dynamics/.

    A classic example is production measured in goods per person per day as a function of hours worked per day. Loosely speaking, working 10% more hours might be expected to increase production by 10%. Working 100% more per day (from 9 to 18 hours, say) is not likely to double one’s production per day. Indeed, production per day may have a peak somewhere between 8 and 18 hours. That’s not because production is a quadratic function of hours worked but because there are two effects: how long one works and how fatigued one gets.

  5. Richard McElreath says:

    The model I presented at StanCon Helsinki had an age-related function of this sort:

    S(t) = exp(-mt) (1-exp(-kt))^b

    This is multiplicative and strictly positive and doesn’t have any crazy Runge swings. It’s derived explicitly by specifying an increasing ODE and a decreasing ODE and then multiplying them and solving. I was driven to this after fighting with some splines and deciding that using a little biology would be a better idea.

  6. Anonymous says:

    Are you trying to model the average performance or individual performance? The shape of the average curve can be quite different than any individual one…

    There is a long history of people complaining about this regarding learning curves, confusion caused by theorizing about the shape of average curves messed up the field for many decades, and probably still is doing so. I cant paste a link on mobile for some reason, but check out Gallistel et al, 2004.The learning curve: Implications of a quantitative analysis.

    Actually I mention it in the comments in the linked blog post.

  7. Anonymous says:

    Also, you really have to get someone to take a look at this site, its almost becoming to annoying to visit with the inconsistent links and comment counts, etc.

  8. Additive models of this sort are extensively used in demography to model the infant, accident, and senescent components of the age specific hazard of death. Among the most popular of these parametric „competing risk“ models would be the Siler, which expresses the risk of death at age x as y(x) = a_1*exp(-b_1*x) + a_2 + a_3*exp(b_2*x), but it was the Gompetz-Makeham curve that started the additive age effects modelling trend in demography back in the 19th century. What followed were douzens of proposed additive „laws of mortality“ (there may have been some physics envy going on :)

    When done right these additive models offer a free decomposition in meaningful partial effects and have parameters which correspond to relevant domain specific quantities (i.e. the rate of aging, the level infant mortality etc.). Some people think these models may even be more than just useful, they may be true…

    • Jon Minton says:

      I was going to suggest something similar.

      A benefit of this approach seems to be that it allows biological and non-biological longevity to be compared more clearly. N Taleb is keen on highlighting the ‘Lindy Effect’, i.e. the observation that the hazard of ruin (‘death’) of many non-biological phenomena falls with age. This seems equivalent to setting the senescent component to zero in a three component model.

  9. Mike Hunter says:

    “Here’s another approach: an additive model, of the form g(x) = g1(x) + g2(x), where g1 is strictly increasing and g2 is strictly decreasing.”

    If I understand the definition of a “strictly increasing (decreasing)” function correctly, wouldn’t these expressions for g1 and g2 require transforming raw (x) into “strict” sequences, thereby destroying the relationship between (x) and age (time)?

    • g(x) is strictly increasing if and only if whenever x2 is greater than x1 also g(x2) is greater than g(x1).

      if g is differentiable, you also have dg/dx is always greater than 0

      none of that requires adjusting x in any way.

      If you use greater than or equal to instead of greater than you get a strictly non-decreasing function.

      • Mike Hunter says:

        Thank you for your comment. Please forgive my lack of clarity. So, in a temporal sequence of raw x with, for instance, x1=0.1, x2=0.2, x3=0.005, x4=0.01, x5=0.001…, how would g1(x) be operationalized?

        • The point here is just that g is some function of x and that whenever x increases so should g, you simply calculate g(x[i]) for all i.

          I think your point is something about how x[i] is not a monotone/increasing function of i. but that’s an entirely different issue.

          • Mike Hunter says:

            So, if I were to operationalize g(x[i]) for all i, it would be represented by an indicator function, correct?

            • Imagine x is age in years for many different people, and g is a prediction for their performance under some test. The i variable is just the index into the table of people.

              Now for every person i you get their age x[i] and you calculate your prediction g(x[i]) using whatever function you have decided to use for g… Suggested forms that follow Andrews advice are elsewhere in the thread.

              This has nothing to do with indicator functions, which are functions that take on either 0 or 1 as their only values. It also doesn’t require transforming the ages in any way, we aren’t concerned about having multiple people with the same age for example. Also it doesn’t matter that as a function of i the predictions may fluctuate wildly do to changes in age from one table entry to another.

              The point I’d just that if you put different ages in, the g function should first increase for small ages, then flatten for middle ages, then drop off for large ages, as shown in the graph Andrew posted.

  10. Another useful trick is to model the derivative of your function directly using “smooth step functions” b*inverse_logit((x-x0)/a)

    these are symbolically integrable, bounded functions that are asymptotically constant for “large” x. Maxima is your friend here but also be aware of need to maintain numerical stability. Integral looks like:

    b*(a*log(1+exp(-(x-x0)/a)) + x0 + x )

    so using log1p_exp in stan is the way to go for numerical stability.

    For manipulating symbolic integrals Maxima is your friend

    There are also online maxima servers for quick calculations:

  11. Radford Neal says:

    My reading of this is that you think that if g1(x) is strictly increasing and g2(x) is strictly decreasing, then g(x) = g1(x)+g2(x) will increase to some point and then decrease thereafter.

    But this isn’t true. Writing g(x) in this form does not constrain its behaviour at all. You can see this from the fact that the derivative of g(x) is the sum of the derivatives of g1(x) and g2(x), which must be positive and negative, respectively, but are otherwise unconstrained. You can write any value for the derivative of g(x) as the sum of a positive number and a negative number.

    Maybe you meant to say that g1 and g2 are also concave? Then g would be concave as well, and hence have only one peak. But concavity would be too strong a constraint for most problems.

  12. Andrew says:


    No, I don’t think that if g1(x) is strictly increasing and g2(x) is strictly decreasing, then g(x) = g1(x)+g2(x) will increase to some point and then decrease thereafter. What I said is that if g(x) increases and then decreases, it can be a good idea to model g(x) as the sum of two functions, one increasing and one decreasing. I’m not saying that any such sum will increase and then decrease.

  13. Jon Minton says:

    Will the parameters have a solution ‘ridge’ rather than converge to a point?

    • This depends strongly on your choice of parameterization, and the priors. A solution ridge occurs when there are multiple ways to get exactly the same thing: for example 5 = 1+4 or 5=2+3 or 5=3.5+1.5 etc. In this case instead of a single number we’re talking about multiple ways to get exactly the same function (at least at the points where it’s evaluated which is always a discrete set in real computation)

      So for example, if you parameterize your function as g1(x) + g2(x) + a*x+b so that you have two nonlinear components a linear component, and a constant, then if you accidentally include a constant in the parameters for g1, both that constant and b will be interchangeable and will wind up with a ridge. Given the priors it might not be an infinite ridge, but it will be a useless feature. Looking carefully at your parameterizations is a good idea.

Leave a Reply