Ben Hanowell writes:

I’ve worked for tech companies for four years now. Most have a key performance indicator that seeks to measure the rate at which an event occurs. In the simplest case, think of the event as a one-off deal, say an attempt by a buy-side real estate agent to close a deal on a house for their client. Suppose that we know the date the deal began, and the deal may have been closed and won, closed and lost, or still be open (therefore not closed), and we know the date that those events occurred.

In most companies, the way they measure this rate is by counting the number of deals that get closed within D days from the day the deal began (e.g., the day the client first reached out to the agent). Usually the closed-and-won outcome is the outcome of interest, but closed-and-lost rates are also calculated. The value of D varies, and most businesses look at the number for different values of D, e.g., 30, 60, 90, 180, 360. This metric is easy to interpret, but it has drawbacks. For one, the D-day close rate can only be calculated meaningfully from deals that occurred at least D days ago. For another, all deals that occurred occurred at least D days ago and that started within the period of interest are counted equally in the denominator even though some of those deals spent far more person-time units than others at risk of getting closed.

Another metric companies often use is the time between the deal open and close dates, and usually there is a different metric reported for days between open and closed and won versus closed and lost. This metric is problematic because it leaves out all those deals that aren’t closed, and so it will usually under-estimate the expected lifetime of a deal. I usually strongly advocate against its use.

For these reasons, when I come to a team, I try to reach back into my formal demographic training, where I learned that the proper definition of a demographic rate is the number of events that occur in a population of interest out of the total number of person-time units lived by that population. From those close rates, we can infer the expected waiting time, either as the inverse (assuming the rate is constant), or maybe using some life-table method. Or we could get fancier and estimate the survivor function and hazard rate with some kind of event history analysis model. All of these methods require me to know the start date of observation, and (if we’re trying to measure the closed and won rate) the end date (which is either the close and won date, the closed and lost date, or the current date for those deals that remain open, which are basically lost to follow-up).

But an issue usually crops up that reminds me why companies stick with the supposedly inferior D-day close rate metrics. It has to do with the process whereby deals get closed and lost. What happens is that, over time, many deals remain open when they should probably have been set to closed and lost. Meanwhile, deals that were opened more recently obviously would not have many cases where the deal has an extremely old age. As a result, these ancient deals contribute many person-time units to the denominator of the rate for cohorts in earlier years, causing the rate to look like it is increasing over time.

I know there is something I could do that is better than D-day close rates. I know the solution to the problem probably has something to do with recognizing that the closed and lost examples shouldn’t be treated as just lost to follow-up since their censoring is meaningful. I also know that the method I use to infer these rates needs to account for the fact that closed and lost deals should not contribute person-time units to the denominator because they are no longer at risk of being closed and won. I also know that one way to deal with this problem is to construct period-based rates rather than cohort-based rates.

I replied:

I assume I can blog your qu and my reply. Or I could respond privately but then I’d charge my consulting fee!

Hanowell responded:

The reason I contacted you about this is so that you could reply to it in the public sphere. It’s a big issue in corporate analytics that isn’t handled well by a lot of teams. If your reply is just, “I don’t have time to think much about this, what say you, followers?” I guess that’s okay, too.

So then I read the full message above. I have not tried to follow all the details but it seems like a pretty standard survival analysis problem. So I’ll just give my generic advice which is to model all the data—set up a generative (probabilistic) model for what happens when for each deal. In my experience, that works better than trying to model summary statistics. Make your graphs of what you want, but forget the D-day close rate and just go to the raw data. Then once you’ve fit your model, you can simulate replicated data and do posterior predictive checks on summaries of interest such as those D-day close rates.

It seems like the main issue is imputing when deals are closed and lost since these are not always recorded properly. it’s a kind of measurement error or censoring model.

Doing this well requires some knowledge about realistic deal lifetimes, which can probably be estimated accurately from deal closed and won data, assuming that is high quality.

I had the same thought. The measurement error that might need to be modeled here is the possibility that “deals lost” may not be consistently measured. If there is a standard practice that deals are marked “lost” after a specified time period (e.g. 30 days), then it seems like a fairly standard survival analysis. However, if the practice is sloppily applied (e.g., deals are marked “lost” depending on who looks at the record and when), then this measurement error becomes important to model. It reminds me of the way that Japanese banks used to record bad loans – they kept them on the books as assets long after American banks would have recorded them as bad. This made the banks look profitable when they were not (they were referred to as “zombie banks”). To make matters worse, it was up to the discretion of the bank when to mark a loan as “bad.”

I assume there are deals in the dataset which are not marked lost or won ever… and I also assume that if you won a deal you know it, so if it’s not marked after a long time, it was lost. The question is, “when was it lost?”.

so in your analysis it’s easy to simply say “the longest a deal can run is 30 days” and mark it lost, but the truth may be that real world deals could take longer than 30 days. Instead you look at the deals won data, which I assume is quite complete because there’s a record of the transaction if it occurs, and you model the survival time of an open deal to be closed by “won”. Then you go back to deals that were opened, but never closed, and impute probabilistically the “time to be won” and mark it lost at that point in time.

Then, by MCMC, you get a probabilistic “time to loss” since you’ll impute a different loss time at each sample step in the MCMC. So, you don’t know at which point in time you lost the deal, but you have a probably fairly narrow window where it “should have closed if it was going to close” which is the point you’re calling “lost”.

This makes me think of another complication. Survival analysis in health usually makes sense since most diseases are progressive and the time to an event is consistent with something bad happening. Deals may not behave in this way, unless they are relatively homogeneous. If some deals are much more complicated than others (e.g., larger, more moving parts, etc.), then we should expect some deals to take longer to close than others. The fact that one deal takes longer than another, in that case, does not tell us much on its own. If we can model the various factors that go into a deal (including its complexity), then the “adjusted” length of time to closure may enable a survival analysis such as is typically done in the medical literature. But it seems to me that there may be a qualitative difference between medical survival analyses and business deals. For example, the Cox Proportional Hazards model may not be appropriate for the latter. [I’ve tried to get an intuitive feel for exactly what the proportional hazards model assumptions really mean, but I have not yet found a good explanation].

I’m not quite sure exactly what is considered proportional hazards, but I think the general idea of hazard modeling is best understood as a differential equation…

dp/dt = Q(covariates,t) * H(t) * (1-p)

p(0)=0

basically H(t) is some “baseline” rate of occurrence per unit time, Q is a function of covariates that multiplies the baseline rate, which might also be time dependent, and the (1-p) factor forces p to never go above 1.

if Q is not a function of t, and p(0)=0 then the general solution is

p(t) = (1-exp(-integrate(Q(c,t)*H(t),s,0,t))

The basic “proportional hazards” model is the one where Q(c) has no time dependence and can be removed from the integral I think.

As Dale suggests, the standard is sloppily applied. I have looked into an extension of event history analysis called cure modeling, wherein you assume that for some unknown fraction of the population (to be estimated from data), the event will never occur. There are other types of cure models, but that is the one that is easiest to understand. Unfortunately (well… fortunately for those of us who like to invent stuff), cure modeling in the case of competing risks hasn’t been studied all that much.

As Dale suggests, the closed-lost case is sloppily recorded, and the rules for doing so are dynamic across time. I’ve been looking into cure models as a way to adjust for the probability that a deal will never get closed. Another possibility is seeing if there is some way to signal whether deals get closed in some big clean-up operation all at once, and then build a model that accounts for that competing event… and how it affects multiple deals at once.

What about replacing the three states with a continuous score, where closed+won=1, closed+lost=0, and open=p(eventually being won | time elapsed so far)?

What if you diagrammed out the stocks and flows of deals going through the system much as you might diagram a PBPK model? The flows between stocks (accumulations or compartments, if you prefer) are more heavily driven by human decisions than in the typical PBPK case, I suspect; texts such as /Business Dynamics/ (Sterman, 2000) might help show how that can be managed. Note that, in a way, a PBPK model can be seen as a system dynamics model with a rather constrained structure.

Then write the equations (including prior information), add the data, fit or calibrate the model to get good estimates for parameters, and do the posterior simulations as Andrew suggested to get the quantities of interest and other insights into what’s going on. Since these are basically ODE models, you should be able to do that in Stan (*); I’ve also done it using MCSim and Vensim.

You can, if it fits your needs, consider expanding the model boundaries to include the effects of the D-day results on other key parts of the organization and the effects of their actions back on the measures. For a simple example, lower D-day results than desired might lead to more hiring, which, after an on-boarding phase, might lead to improved deal closure rates. Of course, the on-boarding could lead to temporarily lower productivity, as existing staff divert some time to training new hires. If not understood, that could lead to excess hiring, much as eating too fast without understanding dietary limits or waiting for a feeling of satiation can lead to eating way too much.

(*) Human and organizational decision processes are often nonlinear, and system dynamics simulators traditionally describe those functions through a set of x-y coordinates that are interpolated into some sort of piecewise linear function or perhaps some smoother splines. Vensim lets you specify the points by clicking on a graph, while MCSim lets you specify both the points and the interpolation method through the GSL in inline code. I don’t know the best, most natural way to approach that in Stan.

The advantage I see to the PBPK / system dynamics approach is that it’s more of a generative model as Andrew suggested, since you have the freedom to map the path of deals more realistically. If the closed-lost case is sloppy, can you disaggregate it to make a better, more useful representation of reality? Does that disaggregation help you?

My (generic) advice is similar: create functions for the separate metrics, simulate data under a variety of conditions (varying the proportion of not-closed-but-should-be-closed cases within plausible ranges, for example), and compare the ability of each approach to predict other iterations of data generated using the same parameters. It might turn out that the problem with one or more of the common methods is actually trivial for prediction purposes. The same can (and should) be done with alternative metrics, including those that are theoretically superior, to ensure that they are actually better, or rather, to determine the conditions under which they are better. Those who insist on using the old, flawed metrics may also find this kind of empirical evidence more convincing.