“The Hitchhiker’s Guide to Responsible Machine Learning” and “Statistical Analysis Illustrated”

Przemysław Biecek writes:

I am working on Responsible Machine Learning methods. I recently wrote a short fusion of a comic book and a classic book, the comic serves to present the iterative process of building a predictive model and the book is used to understand exploratory methods.

And Jeffrey Kottemann sends along this book, Statistical Analysis Illustrated, which could be useful as a supplementary text in a standard intro statistics course.

It’s always good to see new illustrated introductory statistics material. Enjoy!

5 thoughts on ““The Hitchhiker’s Guide to Responsible Machine Learning” and “Statistical Analysis Illustrated”

  1. I flipped through Biecek book which displays very nicely and it seems well executed with many insightful points being raised. Perhaps in the next edition they can connect being responsible with choosing the most interpretable model that has adequate accuracy – From Exploring to Building Accurate Interpretable Machine Learning Models for Decision-Making https://www.statcan.gc.ca/en/data-science/network/decision-making.

    As for Statistical Analysis Illustrated, a quick scan indicates the same worn out largely useless for most curriculum that has persisted for 20 or 40 years. A metaphor would be providing a hand axe and a sharpening stone along with a manual for building a log cabin from scratch. Yes that will be the largest audience for a stats book for an intro course.

    And I noticed this on page 52 “When our _sample statistic_ is outside of our 95% confidence interval, we reject the Null Hypothesis and call the result statistically significant.” Hope this is just a typo.

    • Keith:

      I agree that Statistical Analysis Illustrated is teaching standard stuff. I wouldn’t dream of teaching this stuff myself but, as I wrote above, I think it could be a useful learning aid for students who are required to learn that stuff.

    • Most people don’t interpret simple linear models correctly. They simply *think* they can interpret the coefficients without realizing the meaning is conditional on the model being correct. So if you add a new feature to your model, the magnitude and sign of your coefficient can change drastically.

      And in the comic the first example is predicting “the risk of death in case of infection. We need to know in what order they should be vaccinated”.

      Then they go on to look at cdc data corresponding only to deaths attributed to covid. If you want risk of death you need to look at all cause mortality.

      So this is not capable of answering the question. It is mass confusion out there.

    • > And I noticed this on page 52 “When our _sample statistic_ is outside of our 95% confidence interval, we reject the Null Hypothesis and call the result statistically significant.” Hope this is just a typo.

      I would add the emphasis elsewhere: “When our sample statistic is outside of our 95% _confidence interval_, we reject the Null Hypothesis and call the result statistically significant.”

      Those confidence intervals are unrelated to the data! They depend only on the null hypothesis:

      > Referring to Figure 5.1, suppose a population is 50% in favor of a new public health policy, and 1,000 surveyors survey the population using random sampling of sample size 100. All 1,000 surveyors hypothesize that the population is 50% in favor, and they use the appropriate 95% confidence interval spanning from 40% to 60%.

  2. In response to a previous commenter, who admits to only having given the book(let) a “quick scan,” I freely admit that this book(let) covers “standard stuff.” It simply tries to better explain a selection of topics that students have traditionally had a hard time grasping.

    As for the comment “And I noticed this on page 52 ‘When our _sample statistic_ is outside of our 95% confidence interval, we reject the Null Hypothesis and call the result statistically significant.’ Hope this is just a typo.”

    The careful reader will note on page 31:

    “Again, in summary, and for emphasis:
    1) We expect the 95% confidence interval around the population proportion to
    contain 95% of all sample proportions obtained by random sampling.
    2) We expect 95% of all the 95% confidence intervals based on random sample
    proportions to contain the population proportion.

    Because of these two facts, we will reach the same conclusion whether we (1) check
    if a sample proportion is outside the 95% interval surrounding a hypothesized
    population proportion, or (2) check if the hypothesized population proportion is
    outside the 95% interval surrounding a sample proportion. The analysis can be
    done either way.”

Leave a Reply

Your email address will not be published. Required fields are marked *