He’s a devops engineer and he wants to set some thresholds. Where should he go to figure out what to do?

Someone who wants to remain anonymous writes:

I’m not a statistician, but a devops engineer. So basically managing servers, managing automated systems, databases, that kind of stuff.

A lot of that comes down to monitoring systems, producing a lot of time series data. Think cpu usage, number of requests, how long the servers take to respond to requests, that kind of thing. We’re using a tool called Datadog to collect all this, and aggregate, run some functions on them (average, P90…), make dashboards etc. We can also alert on this, so if a server is CPU low on RAM, we page someone who then has to investigate and fix it.

When making these alerts you have to set thresholds, like if there’s 10% of requests are errors over 5 minutes, then the person gets paged. I’m mostly just guessing these by eyeballing the graphs, but I don’t know anything about statistics, I feel this field probably has opinions on how to do this better!

So I’m wondering, can you recommend any beginner resources I can read to see some basics, and maybe get an idea what kind of stuff is possible with statistics? Maybe then I can try to approach this monitoring work in a bit more systematic way.

My reply: I’m not sure, but my guess is that the usual introductory statistics textbooks will be pretty useless here, as I don’t see the relevance of hypothesis testing, p-values, confidence intervals, and all the rest of those tool, nor do I think you’ll get much out of the usual sermons on the importance or random sampling and randomized experimentation. This sounds more like a quality control problem, so I guess I’d suggest you start with a basic textbook on quality control. I’m not sure what’s the statistical content in those books.

32 thoughts on “He’s a devops engineer and he wants to set some thresholds. Where should he go to figure out what to do?

  1. The mention of thresholds reminds me of an interactive article I saw just a day ago on Twitter about classification thresholds, precision, and recall.

    While not exactly relevant to his Question, I do wonder if there’s a use in them exploring a similar trade off (Eg true positives vs false positives) with their work – after all devops work is mostly about monitoring system failures and trade offs.

    They may also benefit from some basic probability (Eg binomial dsn) to try and quantify expected failure rates?

    Post I’d mentioned: https://mlu-explain.github.io/precision-recall/

  2. Hi, Did a quick Google search and found this O’Reilly book for starters: Effective Monitoring and Alerting: For Web Operations 1st Edition by Slawek Ligus, the contents look like what you’re looking for. It’s on that popular book seller beginning with A, cheap and in stock. Hope this helps…

  3. I thought this chapter was great food for thought on this topic: https://sre.google/workbook/alerting-on-slos/

    I used to have responsibility for a bunch of DevOps monitoring, primarily to maintain service availability, and I was nodding along at the thought process, though I have not done the fancier stuff in here.

    Aside from the statistics, experiment, iterate, don’t wake your people up if you don’t have to, and try to negotiate realistic goals.

  4. A lot of that comes down to monitoring systems, producing a lot of time series data. Think cpu usage, number of requests, how long the servers take to respond to requests, that kind of thing. We’re using a tool called Datadog to collect all this, and aggregate, run some functions on them (average, P90…), make dashboards etc. We can also alert on this, so if a server is CPU low on RAM, we page someone who then has to investigate and fix it.

    When making these alerts you have to set thresholds, like if there’s 10% of requests are errors over 5 minutes, then the person gets paged. I’m mostly just guessing these by eyeballing the graphs, but I don’t know anything about statistics, I feel this field probably has opinions on how to do this better!

    First of all, you need the cost of paging someone vs the cost of letting it work itself out or waiting until the morning, etc.

    Second, as mentioned by Troy above, you have a classification/clustering problem. To do better than eyeball you will need a decent sized training dataset of events where the person really did need to get paged vs not. After that it is pretty straightforward to train a model (there are many options that will be approximately equal in performance). Then you have it output a 0/1 or -1/0/1 based on if it seems there is an issue deserving a page, or maybe does, or not.

    I would probably use xgboost, but perhaps there are better algorithms for this type of thing today.

    tldr; The main issue is getting the costs of missing a problem vs sending someone in to solve a non-problem, along with getting a good training dataset. Picking an algorithm is the least important aspect of your issue.

    • Second, as mentioned by Troy above, you have a classification/clustering problem. To do better than eyeball you will need a decent sized training dataset of events where the person really did need to get paged vs not. After that it is pretty straightforward to train a model (there are many options that will be approximately equal in performance). Then you have it output a 0/1 or -1/0/1 based on if it seems there is an issue deserving a page, or maybe does, or not.

      I would probably use xgboost, but perhaps there are better algorithms for this type of thing today.

      I’m afraid I don’t see the value of this advice. Non parametric learners are pattern recognizers, which poses a problem here because if you’re doing your job right, no two incidents will be the same. In particular, tree based classifiers like xgboost are completely incapable of extrapolation. Its output for an arbitrarily large input will be exactly the same as that for the largest value seen so far. What that means is that if you have a latency metric A that has never been the leading signal of an incident before, then A can go to 10e9 and the classifier will not see anything wrong and the alert will be suppressed.

      In this setting, you have a complete dataset, and you define what constitutes acceptable behavior. A predictive model can only approximate, not exceed, a definition. The best possible case is that it gets you a look ahead of a few timesteps. By making it the primary gatekeeper of an alerting system you lose a lot of the robustness that comes with your knowledge of acceptable behavior.

      • They just have multiple correlated time series. If you take the last n timepoints of each to use as features (probably doing some transformations like standardizing, etc) *and* have sufficient previous examples of failure then it will work fine as a classifier with little effort.

        As I said, that is the least important and least difficult aspect. Figuring out the costs and getting a good training dataset are the hard parts.

        If they know the bounds of acceptable behaviour, then why are they using a model and choosing a threshold at all?

        If A > 1e9 then send text. Those cases are simple.

        • probably doing some transformations like standardizing, etc

          Boosted tree classifiers are invariant under monotonic transformations of covariates (note, not true for the target). There is no point to doing standardization or log transformation on your input data with xgboost.

          If they know the bounds of acceptable behaviour, then why are they using a model and choosing a threshold at all

          From a quality control perspective, choosing a threshold is defining the bounds of acceptable behavior.

          From an anomaly detection standpoint:

          *and* have sufficient previous examples of failure then it will work fine as a classifier with little effort.

          The problem is that if you’re doing your job well, you don’t have a large number previous examples of failure. Ideally, your number of incidents is small to begin with. Furthermore, if you’re making a code change to fix the root cause of each incident, then you’re shifting the distribution away from conditional iid and any learnable pattern for the next incident will be different from the pattern for the last. The only way a non parametric classifier will be useful is if you’re just restarting the service each time it runs into a problem and hoping it lasts a while. This is like trying to predict economic recessions with a machine learning model. Tulip prices, housing prices, credit default swaps and collateralized debt obligations, big tech ipos, foreign exchange trading in Latin America—the next recession will not look like the last one.

        • probably doing some transformations like standardizing, etc

          Boosted tree classifiers are invariant under monotonic transformations of covariates (note, not true for the target). There is no point to doing standardization or log transformation on your input data with xgboost.

          If they know the bounds of acceptable behaviour, then why are they using a model and choosing a threshold at all

          From a quality control perspective, choosing a threshold is defining the bounds of acceptable behavior.

          From an anomaly detection standpoint:

          multiple correlated timeseries>
          Not how software systems fail. A downstream latency failing will often have no signature at all (exception in streaming applications where some kind of backpressure can be applied). A part of the application downstream of the failure may also see no change in its latency, only in its throughput. So with an A -> B -> C pipeline, when you feed in all the data into some nonparametric learner, if you have a bunch incidents with elevated latency in B, no amount of latency in C will tell your classifier that you have an incident. Your classifier has to get good coverage of what the signature of incidents in each part of your application looks like, leading into

          *and* have sufficient previous examples of failure then it will work fine as a classifier with little effort.

          The problem is that if you’re doing your job well, you don’t have a large number previous examples of failure. Ideally, your number of incidents is small to begin with. Furthermore, if you’re making a code change to fix the root cause of each incident, then you’re shifting the distribution away from conditional iid and any learnable pattern for the next incident will be different from the pattern for the last. The only way a non parametric classifier will be useful is if you’re just restarting the service each time it runs into a problem and hoping it lasts a while. This is like trying to predict economic recessions with a machine learning model. Tulip prices, housing prices, credit default swaps and collateralized debt obligations, big tech ipos, foreign exchange trading in Latin America—the next recession will not look like the last one.

          This problem is fundamental to nonparametric learners—behavior for out of sample or fat tailed patterns is formally undefined. For tree based classifiers, it reverts to exactly the nearest seen pattern, which means it will probably not alert on exotic signals. In my opinion, suppressing alerts on arbitrarily weird behavior is exactly the wrong behavior. A neural network based classifier depends on the activation function, but for your usual RELU/SELUs, you get extrapolation output that’s multilinear in each input, which means it’s as likely to give a -1 “don’t alert” on exotic signals as it is on 1.

          As for the “send a text”:

          1. I don’t think exotic signals should necessarily be lower urgency than incidents you’ve seen before
          2. You have the same problem. How high does it need to be to send a text? How low before texts are noisily useless?

          Machine learning is great, but it’s just the wrong tool for this job. Machine learning learns the usual patterns, we’re looking for the unusual here.

        • There is no point to doing standardization or log transformation on your input data with xgboost.

          Sure there is, eg if you want to treat a 50% increase as the same regardless of the baseline value. The algorithm can then see an increase of 1 to 1.5 same as 100 to 150.

        • Sure there is, eg if you want to treat a 50% increase as the same regardless of the baseline value. The algorithm can then see an increase of 1 to 1.5 same as 100 to 150.

          This is just not true. Tree based classifiers do not care about magnitudes. They choose a split point to partition covariate space into two bins that minimize a loss function on the target. The loss of each candidate split is completely invariant under a monotonic transformation of the covariate space. In practice, you can’t search candidate splits exhaustively, but even so, candidate proposal algorithms are also invariant under monotonic transformations. XGboost and lightgbm proposes candidate splits by a kind of grid search over percentiles. They use percentiles to build an approximately equal bin-weight histogram of a continuous covariate, then check the loss function at each bin boundary. This procedure is based only on ranks and counts—the only way a monotonic transformation can affect it is if you’re losing floating point precision and colliding unequal values, or if a feature lands exactly in between two observed bins and so the package’s arbitrary choice of left/right/midpoint interpolation comes into play.

          Some packages may use a splitting strategy that depends on the variance of the bins, but those don’t perform as well in practice and are comparatively niche. Trees are fundamentally different from GLMs and neural networks—the intuition does not cross apply.

        • You can go on stack exchange and still read people arguing years ago that one layer of a neural network is enough.

          That is what this discussion reminds me of.

        • To be less vague…

          For infinite time and resources you are correct. But you are wrong for real-world applications where there are finite resources.

        • No. You are incorrect IN REAL PRACTICE. This has nothing to do with universal approximation theorems.

          This is how a simple regression tree works.

          For a list L of (x, y) tuples

          splits := emptylist
          For ptile in 1:99:
          left = L[index / len(L) ptile]
          lefty = avg(val.y for val in left)
          righty = avg(val.y for val in right)
          splits.append((ptile, loss(lefty, left + loss(righty, right))

          tree.add(min(splits.ptile, by=x.loss))

          At no point does it multiply or add or do an arithmetic operation on the covariate. It ranks the covariates and checks partitions at split points.

          In case you need further elaboration, take a list of objects and rank them by a variable. Now take the same list, and rank them by the logarithm of that variable. The same ordering! The same percentiles!

          You just don’t know how this tool works.

        • https://imgur.com/a/ytcNXzp

          Even if I lower the boosting iterations to get a half-converged model, the resultant pattern is exactly the same between log transform and arbitrary rescaling. Interestingly, this also holds true if I zoom in on bin cutpoints–it seems xgboost’s current (1.5.2) implementation always chooses left when a new datapoint falls in between fitted bins, so log transforms have no effect at all. I know their implementation has gone back and forth on this though. Even with midpoint interpolation though, there’s no way standardization would do anything!

        • Say at night a feature has a typical value of 1, while during the day it is 100. A spike of 50% (to 1.5 or 150) is meaningful.

          Yes, you can have a datetime feature and it will figure it out eventually, but if you standardize there are fewer splits that need to be done and probably require less hyperparameter tuning too. You are adding information by normalizing in this case.

          You are *exactly* like the “one layer is enough for anyone” people. It is funny.

        • Yes, you can have a datetime feature and it will figure it out eventually, but if you standardize there are fewer splits that need to be done and probably require less hyperparameter tuning too.

          No. As I just pointed out, xgboost makes literally the same exact splits at EACH INDIVIDUAL boosting iteration, regardless of the transformation.

          If you apply a monotonic transform to any decision tree’s cut points and the input data, the output is exactly the same, because a decision tree never does any operations at all on the actual value of covariants beyond comparisons and sorting. Exactly the same number of splits. Consider

          f(log([10, 100])) = 1
          f(log((100, 1000]) = 2
          f(log((1000, 10000]) = 3

          This can be represented as a regression tree in log space with splits on logx=2, 3. or equivalently, you can split in unlogged space at 100, 1000. No more or fewer splits, exactly the same.

          Do you understand that logX < log(splitpoint) X < splitpoint? Because that is literally the only operation in the evaluation of a decision tree.

          You’re just repeating hollow aphorisms you’ve heard at one point or another about technologies you don’t understand. I love log transforms, I think the world is geometric. But they just don’t do anything for boosted trees.

        • I think I see.

          What I am talking about is normalizing the last n timepoints and having that as a feature. Then your target is 0/1, was there a problem or not.

          If you take the last n timepoints at night and normalize you are doing a different transformation than on the last n timepoints during the day.

      • Maybe a clearer way of putting it is that you are talking about *column-wise* normalization (where each column is a different feature),

        Meanwhile I have been talking about (partial) *row-wise* normalization (where each row is a different multi-dimensional datapoint).

  5. Honestly, as someone who’s done similar work, I think DataDog’s anomaly detection feature is pretty great. If your main concern is “this number is now different than it was, is that a problem”, then the out-of-the-box feature fits pretty well. It’s one of the options when you’re setting up alerts.

    Otherwise, I think the useful thing here is to take a step back from the numbers and think about the problem itself. If you’re tracking some database shard that only sees high traffic during a failover, is it meaningful for RAM to be higher than the 90th percentile of past usage, or is it notable for RAM to be beyond a threshold of what’s available to the box? I’d guess the second — if the RAM usage is exceeding the historical norm, that means you’re doing a failover, in which case you don’t need your perfectly well-equipped server screaming at you too; if RAM is exceeding 80-90% usage, that means your app performance is degraded. Then it becomes a business question, not a statistics question, which means it’s a lot harder to answer definitively.

    • +1 that if you don’t know you can use DataDog’s built in tools, which is tuned to the practicalities of people complaining if they’re alerted too much unnecessarily and people complaining if they’re not alerted when something goes wrong.

      On the statistics part you can test if something is an outlier, e.g. you can test if the results for the last 10 minutes are three standard deviations from the mean of prior 10 minute periods. I’m surprised Andrew sold more formal statistics short here. You can also look at a percentile of the historical distribution to understand how many times a year an alert is going to fire. I find percentiles work best for picking a threshold because you know roughly how many (mainly false) positives will fire in a year. Finally, if you’re coding it yourself just save everyone the trouble and hard code it to never fire on Christmas.

  6. If you’re trying to monitor user experience and ensure you’re meeting your standards for product performance, you don’t really need any kind of probabilistic model. Datadog actually has all or very close to all of the performance data—you just need to set what your standards are. How many users get a slow page load, etc

    If you’re trying to detect anomalies, like product breakages, while avoiding alerts for noisiness, more important than the probabilities is the periodicity. Typically, you’ll have an expected pattern repeating every hour, day, week—you want to make sure you’re looking back over a period that captures the relevant periodicities, or your alert’s frame of reference for “normal” behavior will be oscillating. As for the probability, you can be real lose with it. In my experience, software systems tend to fail catastrophically and obviously. If you want to get fancy, take a probability lower bound for “ordinary behavior”, then derive bounds on deviation from a mean of your lookback window using Chernoff bounds or Chebyshev’s inequality. These inequalities should hold under very general conditions, and are very easy to calculate. Note that these are a kind of null hypothesis significance testing, but I think their ease of use and the catastrophic way software systems fail justify their use here.

    If you’re trying to tease out something more subtle, like a slow performance degradation over a long time, datadog is an inappropriate tool beyond eyeballing graphs. Then you’ll have to get into real timeseries modeling. There’s no shortcuts there

    • > Datadog actually has all or very close to all of the performance data

      I don’t know Datadog but my guess is also that boring solutions will dominate here. Rent bigger computers, or raise the alert thresholds, or figure out some way to have people awake at night.

      > get an idea what kind of stuff is possible with statistics?

      But! In terms of curiosity, I am not expert and don’t know shortcuts, and I suspect somebody is right that there aren’t any shortcuts for the general case but in terms of learning what to learn,

      If you haven’t already, mess with Stan + multilevel modeling stuff: https://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html

      If you’re dealing with high volume time series stuff there’s no way you don’t run into Kalman series stuff. Find a copy of Bayesian Filtering and Smoothing (free PDF: https://users.aalto.fi/~ssarkka/#publications) — this book rules for understanding what is going on there.

      And then just in general find sections in Regression and Other Stories that sound interesting and poke around/try problems (free PDF: https://avehtari.github.io/ROS-Examples/)

      In the end the story will probably be grouping by factors and computing quantiles in SQL, but then you’ll have more context for the alternatives and maybe that helps you do the quantile thing better.

    • IMO, statistical process control is somewhat overkill for software engineering, and if you do end up using it you’ll have to come up with your own thresholds. These are systems designed for process engineering and manufacturing. For example:

      https://en.wikipedia.org/wiki/Control_chart#Choice_of_limits

      Shewhart set 3-sigma (3-standard deviation) limits on the following basis.

      The coarse result of Chebyshev’s inequality that, for any probability distribution, the probability of an outcome greater than k standard deviations from the mean is at most 1/k2.
      The finer result of the Vysochanskii–Petunin inequality, that for any unimodal probability distribution, the probability of an outcome greater than k standard deviations from the mean is at most 4/(9k2).
      In the Normal distribution, a very common probability distribution, 99.7% of the observations occur within three standard deviations of the mean (see Normal distribution).

      In a web application, you might be getting 10000 requests every second, in which case a 99.7% event would be triggered all the time. To detect anomalous activity in a web app, you’ll want something more like 99.999…

  7. This is a control-chart monitoring problem. I would google various control-charts and how various control limits are set. The type of control limit you set may be based on the particular thing you’re monitoring and tradeoffs between false detection or false non-detection. Or instead of googling, get in touch with a quality engineer! This is their bread-and-butter.

Leave a Reply

Your email address will not be published. Required fields are marked *