Flaws in stupid horrible algorithm revealed because it made numerical predictions

Kaiser Fung points to this news article by David Jackson and Gary Marx:

The Illinois Department of Children and Family Services is ending a high-profile program that used computer data mining to identify children at risk for serious injury or death after the agency’s top official called the technology unreliable. . . .

Two Florida firms — the nonprofit Eckerd Connects and its for-profit partner, Mindshare Technology — mined electronic DCFS files and assigned a score of 1 to 100 to children who were the subject of an abuse allegation to the agency hotline. The algorithms rated the children’s risk of being killed or severely injured during the next two years, according to DCFS public statements.

OK, this could work. But then:

More than 4,100 Illinois children were assigned a 90 percent or greater probability of death or injury . . . 369 youngsters, all under age 9, got a 100 percent chance of death or serious injury in the next two years . . . At the same time, high-profile child deaths kept cropping up with little warning from the predictive analytics software . . . The DCFS automated case-tracking system was riddled with data entry errors . . .

Illinois child care agencies told the Tribune they were alarmed by computer-generated alerts like the one that said: “Please note that the two youngest children, ages 1 year and 4 years have been assigned a 99% probability by the Eckerd Rapid Safety Feedback metrics of serious harm or death in the next two years.”

Check out this response:

“We all agree that we could have done a better job with that language. I admit it is confusing,” said Eckerd spokesman Douglas Tobin.

Ummm . . . “confusing”? That’s all you can say? How about, “We screwed up. Our numbers were entirely wrong.”

“We could have done a better job with that language.” Jesus. Do these people have no shame? They went to the George Orwell Newspeak school of communication?

The statistics

We can draw two useful lessons from this fiasco:

1. Predictive model checking is important, indeed vital. Get testable, actionable predictions from your data, and look at these predictions carefully and critically. This is the same thing that I do when I read a scientific paper: I look at its specific numbers. For example, when I saw a paper with a graph saying that (a) in a certain area in China the life expectancy was 91 years, and (b) a certain policy had an effect of -5 years, implying that an expected life expectancy of 96 years in that area, then I was suspicious. The paper had a lot of problems, but the good news is that it put its data right out there so I and others were able to see that something had to be wrong.

Similarly, Eckerd really screwed up with its algorithm—but it’s good they gave numerical probabilities, as this supplies the loud siren that makes us realize that these numbers are bad.

What would be worse is if Eckerd had just reported the predictions as “high risk,” “mid risk,” and “low risk,” or something like that. The stupid probability numbers were a feature, not a bug, as they made the problems apparent.

2. Data quality is important. The best algorithm in the world won’t work if you’re not serious about your data. And it seems that these people were much more serious about making serious money via no-bid contracts (see below) than about their data.

The politics

From the news article:

The $366,000 Rapid Safety Feedback program was central to reforms promised by . . . [former Illinois Department of Children and Family Services Director] George Sheldon . . .

A May 2017 Tribune investigation found the arrangement with Eckerd was among a series of no-bid deals Sheldon gave to a circle of associates from his previous work in Florida as a child welfare official, lawyer and lobbyist. Sheldon left Illinois under a cloud a month later, and a July joint report by the Office of Executive Inspector General and the DCFS inspector general concluded that Sheldon and DCFS committed mismanagement by classifying the Eckerd/Mindshare arrangement as a grant, instead of as a no-bid contract. . . .

Eckerd Connects — which recently changed its name from Eckerd Kids — told the Tribune that variants of its Rapid Safety Feedback are used today by child welfare agencies in Ohio, Indiana, Maine, Louisiana, Tennessee, Connecticut and Oklahoma. . . .

Even before arriving in Illinois, Sheldon had professional ties to both Eckerd and Mindshare.

He is quoted on Mindshare’s website endorsing that company and its technology. And as head of Florida’s child welfare agency, he worked closely with Eckerd, which runs child welfare programs in Florida’s Hillsborough County under a $73 million state contract, using for-profit companies as subcontractors.

When Sheldon arrived in Illinois in 2015, he appointed Eckerd’s Chief External Relations Officer Jody Grutza to a $125,000 senior DCFS position. While Grutza did not supervise the Eckerd contract, Sheldon put her in charge of overseeing other deals with Sheldon’s Florida associates, including a Five Points Technology contract that paid $262,000 to Christopher Pantaleon, a longtime Sheldon aide with whom Sheldon owned Florida property, the Tribune revealed in a July report.

After a year in Illinois, Grutza returned to a top position with Eckerd in Florida.

20 thoughts on “Flaws in stupid horrible algorithm revealed because it made numerical predictions

  1. Is the algorithm public? do we know what went into it? could it be a simple as an OLS model with a dummy dependent variable and many values which exceed 1 (or 100) and then they manually narrowed it down to the 0-1 range?

  2. I suspect they used a training data set which was enriched with children with bad outcomes. If you start with a 50:50 chance, a few predictors with a moderate impact on the odds-ratio will quickly push the prediction towards 1.

    • Quite likely, since it’s standard predictive analytic practice to enrich the data set. But only a bozo fails to correct for the oversampling.

  3. How much do you really expect for $365K? That’s an honest question. Both about the cost of building a statistical model, and the value of a political pay-off.

    If you really want to make a pay-off, just call it a Big Data Research Initiative and say the research revealed that the data set the previous head of the agency collected was “full of holes.” Then you woudn’t distract people who are actually doing their job. And who might talk to the press.

    • How much do you really expect for $365K?

      Im pretty sure if they sent me the data I could do better in less than an hour. Of course it also sounds like DCFS expected to be able to fill in the data wrong and still get useful predictions.

      • Hired! Except we need you to include the cost of acquiring the data, documentation of you analysis, and integration of your output with the relevant deparments systems.

        • It sounded like DCFS just sent some files over and got an auto-generated report back:

          Illinois child care agencies told the Tribune they were alarmed by computer-generated alerts like the one that said: “Please note that the two youngest children, ages 1 year and 4 years have been assigned a 99% probability by the Eckerd Rapid Safety Feedback metrics of serious harm or death in the next two years.”

          […]

          DCFS gives Eckerd a nightly “data dump” from the state’s automated case-tracking system, and the next morning Eckerd generates real-time scores flagging the most imperiled children.

          Front-line caseworkers should never get those raw scores, let alone make decisions based on them, Eckerd says; the data instead should be reviewed by DCFS supervisors who are trained and coached by Eckerd to decide which cases need immediate attention and how to tackle them.

          http://www.chicagotribune.com/news/watchdog/ct-dcfs-eckerd-met-20171206-story.html

        • Do you think that day 1, DCFS said “Here’s a csv with predictor variables and an outcome, filled out. Everyday, you will be sent a new csv in which the same predictors in the same format will be sent, but the outcome will be left blank. Please send us back your predictions.”?

          If you’ve ever had an experience like this, myself and thousands of others would like to know where.

        • Do you think that day 1, DCFS said “Here’s a csv with predictor variables and an outcome, filled out. Everyday, you will be sent a new csv in which the same predictors in the same format will be sent, but the outcome will be left blank. Please send us back your predictions.”?

          If you’ve ever had an experience like this, myself and thousands of others would like to know where.

          What makes you think that is what I think happened?

        • This post looks like nonsense to me, not sure what I was thinking.

          But anyway I do agree that 99% of the work for these projects is preparing the data in one way or another. I think that the performance described in the OP is so bad that it would only take a few features to beat it. Actually training the algo would only use up a few seconds to minutes of the hour, most of the time would be spent figuring out how to scrape the 3-5 features from the input.

  4. “The DCFS automated case-tracking system was riddled with data entry errors . . .”

    Sounds like Garbage in, Garbage out. So even if the algorithm system were basically sound, it wouldn’t have been able to generate good results…

    • It could happen that way, but in this case it pretty clearly didn’t. A “basically sound” algorithm with even the most rudimentary unit-testing should have revealed the fact that the data was providing WAY too many inflated values. So either they designed a good algorithm but failed software design 101 and didn’t build tests or they failed software design 101 AND they failed their machine learning courses.

      • Maybe they never had software design 101 nor any machine learning courses? Or only online courses from fly-by-night companies? Or other variations of “Good enough for gumment work”?

      • The question that rarely gets answered is who gets the results right? The pool of get right appears to be minuscule. What distinguishes members of such a pool?

    • Clearly the score produced by the system is not a probability (maybe not even a percentile if only 3% of the cases got a score over 90, see link below). It’s not so clear who introduced the notion of probability in the alarms. Eckard is probably responsible to some extent, I guess it was part of the customization they provided and even if it was subcontracted they should have checked the implementation.

      https://eckerd.org/response-chicago-tribune-story/

  5. Just attended this morning’s JSM overview lecture on Big Data use in criminology. The speaker, to her credit, pointed out many examples of predictive models based on Big Data (here, I mean, OCCAM data, such as surveillance data) which led to adverse outcomes, such as increasing arrests but not decreasing crime – similar issues as in this Eckerd project, maybe less severe. I asked whether criminologists have, or are talking about, a “first do no harm” principle; and if people are measuring the harm of their models.

    The answer, if I may paraphrase, is
    a) there are a few people trying to measure the harm – one author did work on LAPD models and found problems (feedback loops if I recall, which is a big issue when using observational datasets)
    b) there is no organized effort to work on such a principle
    c) the authorities like what we are doing

    Link about OCCAM data

Leave a Reply to Dzhaughn Cancel reply

Your email address will not be published. Required fields are marked *