Regression and Other Stories is available!

This will be, without a doubt, the most fun you’ll have ever had reading a statistics book. Also I think you’ll learn a few things reading it. I know that we learned a lot writing it.

Regression and Other Stories started out as the first half of Data Analysis Using Regression and Multilevel/Hierarchical Models, but then we added a lot more and we ended up rewriting and rearranging just about all of what we had before. So this is basically an entirely new book. Lots has happened since 2007, so there was much new to be said. Jennifer and Aki are great collaborators. And we put lots of effort into every example.

Here’s the Table of Contents.

The chapter titles in the book are descriptive. Here are more dramatic titles intended to evoke some of the surprise you should feel when working through this material:

• Part 1:
– Chapter 1: Prediction as a unifying theme in statistics and causal inference.
– Chapter 2: Data collection and visualization are important.
– Chapter 3: Here’s the math you actually need to know.
– Chapter 4: Time to unlearn what you thought you knew about statistics.
– Chapter 5: You don’t understand your model until you can simulate from it.

• Part 2:
– Chapter 6: Let’s think deeply about regression.
– Chapter 7: You can’t just do regression, you have to understand regression.
– Chapter 8: Least squares and all that.
– Chapter 9: Let’s be clear about our uncertainty and about our prior knowledge.
– Chapter 10: You don’t just fit models, you build models.
– Chapter 11: Can you convince me to trust your model?
– Chapter 12: Only fools work on the raw scale.

• Part 3:
– Chapter 13: Modeling probabilities.
– Chapter 14: Logistic regression pro tips.
– Chapter 15: Building models from the inside out.

• Part 4:
– Chapter 16: To understand the past, you must first know the future.
– Chapter 17: Enough about your data. Tell me about the population.

• Part 5:
– Chapter 18: How can flipping a coin help you estimate causal effects?
– Chapter 19: Using correlation and assumptions to infer causation.
– Chapter 20: Causal inference is just a kind of prediction.
– Chapter 21: More assumptions, more problems.

• Part 6:
– Chapter 22: Who’s got next?

• Appendixes:
– Appendix A: R quick start.
– Appendix B: These are our favorite workflow tips; what are yours?

Here’s the preface, which among other things gives some suggestions of how to use this book as a text for a course, and here’s the first chapter.

The book concludes with a list of 10 quick tips to improve your regression modeling. Here’s the chapter, and these are the tips:

– 1. Think about variation and replication.

– 2. Forget about statistical significance.

– 3. Graph the relevant and not the irrelevant.

– 4. Interpret regression coefficients as comparisons.

– 5. Understand statistical methods using fake-data simulation.

– 6. Fit many models.

– 7. Set up a computational workflow.

– 8. Use transformations.

– 9. Do causal inference in a targeted way, not as a byproduct of a large regression.

– 10. Learn methods through live examples.

And here’s the index.

You can order the book here. Enjoy.

P.S. I saved the best for last. All the data and code for the book are on this beautiful Github page that Aki put together. You can run and modify all the examples!

96 thoughts on “Regression and Other Stories is available!

  1. I like e-books for easy reference but I have gotten a number that are statistical/mathematical in nature and they are nearly worthless because of the way the equations are done, so that they usually are small and unreadable, can be “opened” to a larger view but then you can’t see the context. Is there anyway to get a sample of what the equations look like in the Kindle edition? Otherwise I am tempted by the paperback, it is priced nicely.

    And congrats on getting this out.

    • I have the same concern. I’m hoping that kindle sample available on the 23rd will help me make a decision one way or the other. I’m very looking forward to the book.

    • The Kindle edition is out. It’s a print replica, so the equations are better than other e-formats I’ve seen. Several chapters are available as previews so you can get a goid sense of what’s it’s like.

  2. Thank you so much for this work! I’ve spent the past few weeks constantly checking the blog for an update about the book. Looking forward to learning.

  3. The paperback is extremely affordable for a stats text, very pleased to see that. I wish BDA and McElreath’s Statistical Rethinking could be had anywhere nearly as cheaply!

  4. At long last! And just in time for my August trip to the beach.

    Actually, you can’t order it yet — Amazon says it will be released on 23 July. But you can pre-order it, which I just did.

    By the way, this book will make a great Christmas or birthday present for _anyone_, regardless of whether that persons knows statistics or has any interest in statistics. So be sure to buy several copies.

  5. Thank you for this book. I pre-ordered it many months ago. Finally, I can count just days to have it on my table!

    Andrew:
    When is the latest second half of Data Analysis Using Regression and Multilevel/Hierarchical Models expected?
    Thanks.

  6. Hi Andrew, Jennifer & Aki,

    congratulations on publishing this book. The chapter titles sound so fun & juicy that this may well turn out to be the first book since Cohen & Cohen’s classic on “Applied correlation/regression analysis” that I will read from cover to cover. I already placed an order. If it shows up on the NYT non-fiction bestseller list, you’ll know you’ve hit a nerve. If it shows up on the NYT fiction bestseller list, though, you’ll know something went wrong in a big way…

    Cheers!
    Oliver

    • …let me qualify that “first book since” bit — it should have been “the first book on REGRESSION since…”! Because I have read two or three other books…

      –ocs

  7. Hi Andrew,
    by and large this looks like a really cool book. I particularly like the recommendation to simulate fake-data, but I’ve already found other things to like.
    However I still don’t agree with your negativity about testing, and it seems you don’t agree with yourself on that.
    Your earlier posting about regression discontinuities says this:
    “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.”
    I’m fine with that, but this is just a significance test worded differently. The Appendix B in the book says there are no true zeroes and we’re not interested in them. But the argument you’re making here is that a true zero model is compatible with the data. And it’s a good argument, because you’re not claiming that the effect really is zero, but rather that the data cannot tell apart whatever goes on from zero. Which is good to know. But this could be acknowledged in your book as well!

      • I’m just talking about the linked Appendix B, 10 quick tips, because that’s all I know yet. B.2 says “Forget about statistical significance. Forget about p-values, and forget about whether your confidence intervals exclude zero.” The point you made in your discussion of the regression discontinuity thing was exactly that the confidence interval for the effect includes zero (“is consistent with noise”). So there are good reasons to be interested in whether that’s the case.

        • Are you saying andrew wrote somewhere that:

          Confidence interval for the effect includes zero == Consistent with noise?

        • To me “consistent with noise” requires knowing how the data was collected, what would be a practical deviation from the prediction, etc. You need to know the order of magnitude you are dealing with when it comes to sources of systematic error, whether the magnitude of deviation has any practical or theoretical importance, etc.

          The test of whether it is consistent with zero is irrelevant.

        • What I’m claiming is that Andrew tested that (using wording that made it less obvious), in the specific example, and that I’m fine with the conclusion he got from that. Did you read it? Do you think there’s anything wrong with what he did there and how he interpreted it? (In that case maybe Andrew better defends himself…)

        • Id need to see the context of the quote but it does seem like he got lazy there. Maybe there are assumptions made that were not written down? Its easy to fall back on NHST! He could answer you better than me about his intent of course.

        • “Consistent with noise” means “we can’t prove using this test that a specific RNG wouldn’t tend to produce this kind of data”.

          If you have some data x, and you do a t-test that x has mean 0 and get p = 0.21 then this is “consist with [our model of] noise” in the sense that the pure noise model might produce data of this type frequently enough. If you get p = 0.0002 then this is “not consistent with [our model of] noise” in the sense that the proposed noise would be extremely unlikely to produce such data.

          Most people go wrong when they jump to unwarranted conclusions: “The signal is not consistent with noise, therefore power pose is true”. or “The signal is not consistent with noise, therefore winning an election adds 5-10 years to your life” etc.

          It’s much more likely that “the signal is not consistent with our model of noise, but it’s consistent with any number of other models for uninteresting stuff… what we really need is to make fairly specific predictions, and then see those come more or less true…”

        • Well, if the data are consistent with such an RNG, it’s pretty clear that they do not allow strong conclusions in the opposite direction, which is a good thing to know, no? And I take it that Andrew agrees, because this seems to be his argument in that posting.

          Of course there are lots of misinterpretations of tests, but this whole idea that because people don’t properly understand and misuse one approach they should learn another one that is surely at least as subtle (don’t get me started on interpretations of prior distributions, where they come from, and what many people make of them) is a mystery to me. If people want to jump to quick conclusions and not want to properly think through the foundations, they will misuse whatever approach to get them.

        • Well, if the data are consistent with such an RNG, it’s pretty clear that they do not allow strong conclusions in the opposite direction, which is a good thing to know, no?

          That just means you didn’t spend enough money, so the sample size is too low and/or measurements are too imprecise. So when you use NHST, “paradoxically” the better the study the less evidence is provided for the research hypothesis.

          Meehl 1967 covers this well: http://meehl.umn.edu/sites/meehl.dl.umn.edu/files/074theorytestingparadox.pdf

        • Or in some cases, you don’t have a very specific research hypothesis… if you sample the position of a chaotic oscillator you might be unable to reject the idea that it’s a Normal(0,1) random number generator… But if you have a model for its dynamics you might be able to predict every single data point to within 3 significant figures using an ODE…

        • Guys, you don’t need to teach me about tests. Andrew applied it to a specific situation where the researchers spent what they spent and did what they did, and there it makes sense. And there are many situations like this.

        • You have replied to postings that I wrote about it for some time now. It’s not that difficult to find really. Just scroll a few of his entries down.

        • Just easy to post a link: https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/

          Yea, like Ive said on here before its not useful to interpret the coefficients of arbitrary statistical models to begin with. He shows that by playing with the model specification later.

          You can also add other variables like wealth, age of spouse, preexisting health conditions, etc and that will change it too.

          So trying to interpret these arbitrary numbers is an even more egregious error to begin with. Then people do NHST on top of that to make it even more nonsensical.

        • It seems now you’re commenting on what the people did that Andrew cited; the test I was referring to was the one Andrew did to show that it’s compatible with an “all noise” H0. (As I mentioned, Andrew didn’t present it as a test, but what he wrote is equivalent to running a test.)

        • Yes, seems he did some NHST there. I wouldnt be looking at these tables of regression coefficients to begin with.

          Id split the data into train, validation, and test sets by time and pick a model based on predictive skill and computational resources required. Then report the skill on the final test dataset that was only run once. Next you have to wait for future data to come in, or maybe see how well it performs for other countries.

        • Christian:

          Huh? We do have a section in the book on regression discontinuity analysis, but we don’t ever say that Confidence interval for the effect includes zero == Consistent with noise. At least I don’t think we do!

        • I think his reply to Anoneouid is again a reference to the post from July 2nd where you wrote:

          “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.“

        • Christian:

          In that post I was talking about uncertainty in the parameter estimate. I was not checking whether a confidence interval includes zero. I do think that null hypothesis significance testing can be valuable, especially in cases where you can’t reject the null hypothesis, but I don’t think it’s so useful to do this using p-values or by looking at the endpoints of confidence intervals.

        • “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.”
          How do you find out that the effect 2.4 se 2.4 means that it’s consistent with noise other than checking whether 0 is in the CI?

        • I mean, you are a pro, so you don’t need to compute the CI to actually know it, but in order to convince a first year student that your numbers mean “it’s consiostent with noise”, what can you do that isn’t equivalent to having zero in the CI?

        • Christian:

          It’s a good question, so I’ll have a go.

          Andrew said, “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.”

          Implicit here is that we are talking about changes in lifespan of a politician.

          Say that instead of a standard error of 2.4 years, it was 2.4 months, and the effect size was also 2.4 months. (I know, implausible, but just say.) We probably wouldn’t say “consistent with noise”. We’d probably say instead something like “a fairly precisely estimated effect that is small or zero”.

          So an answer to your question:

          “in order to convince a first year student that your numbers mean “it’s consistent with noise”, what can you do that isn’t equivalent to having zero in the CI?”

          could be to say “Is the SE big or small in an absolute sense?”

        • Christian:

          We discuss hypothesis testing in chapter 4. We’re not entirely negative about it—we have an long example in which we perform a hypothesis test that is not rejected from the data, from which we draw the conclusion that the data are consistent with a simple random model. But I still think it’s a good idea to forget about statistical significance. forget about p-values, and forget about whether your confidence intervals exclude zero. You can learn from simple models without making go / no-go decisions based on p-values.

        • I’m not quite sure what you’re saying. Are you saying that the argument that I cited and the one in Chapter 4 do not really involve statistical significance, p-values, or checking whether a confidence interval includes zero? Or are you saying that people should forget these things despite your using them? If it’s the latter, why do you use them then?

          When you do things and interpret them I normally agree with you, and I also agree that there’s lots of misuse of hypothesis tests. But they have their place (as you seem to acknowledge) so your general negative statements against them read like propaganda, and that’s unnecessary.

        • Shouldn’t we be able to send differentiated messages rather than “forget, forget, forget”?

        • My observation is that the large majority of folks were taught the incorrect NHST paradigm where you ‘win’ and ‘there is an effect/something meaningful going on’ with p 0.05. Making carefully guarded statements like ‘consistent with noise‘ is much different, although it takes unpacking to avoid being confused or reinterpreted in the Bad NHST fashion. Easier to just sidestep the whole thing most of the time!

        • Do you seriously believe that the same people who misinterpret NHST all the time now would suddenly know what they’re doing and come to reliable conclusions when they all went Bayesian or adopt whatever good idea in Andrew’s book in the same unreflected quick-and-dirty way in which they use NHST now?

  8. Hi Andrew,
    congratulations on your book.

    Do you know if the publisher plans to sell a pdf version of the book? I like to followed the examples with an R session open right next to the book.

    Fred

  9. Hi Andrew,
    I talked to the publisher about using this in my class next fall and they were concerned it would not actually be out until September. I am so glad they are hopefully getting it out this month. As of this morning, July 10, https://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/regression-and-other-stories?format=PB still says September. Does anyone know about this, I have asked the publisher but they have yet to get back with me. I will ask again.

    Laura

  10. Andrew,

    Congratulations on getting another book out! Thanks for making the code and data available. As you say, the website Aki set up is nicely laid out, and that is very helpful.

    BUT, as I write this, there is a link underneath the box to your next post “Ugly code is buggy code” and that says it all. The code I have looked at from the book is ugly. For your analysis of the distribution of last names in Chapter 2 you do all sorts of strange things, but, worst of all, you include a detailed function called discrete.histogram. It produces barcharts , not histograms, and they are not pretty. Presumably there is some strange reason why the bars for girls are narrower than those for boys and why the bar for the letter ‘a’ for girls in 1900 looks as if it is part of the vertical axis at first glance. Why do you need your own complicated function?! Doubtless you do not want to use the tidyverse, because then you could produce everything in a few lines and it would look so good so easily. Students might get the false impression that graphics are simple. I’m sure you could do everything (almost) as well in base R, so why don’t you?

    I have only looked closely at one bit of code and probably there is some excellent stuff you are really proud of elsewhere. Maybe you could give stars to the best stuff like Stella Gibbons did in her novel “Cold Comfort Farm” (which is well worth reading, if you don’t know it already). Where should I look?

    You offer plenty of sound advice about graphics and I like the quick rule in your final chapter that “any graph you show, be prepared to explain.” (although I prefer the maxim “No graphic left undiscussed.”)

    • Antony:

      I appreciate the frank comments! Regarding our analysis of the distribution of last names in Chapter 2:

      – We’re looking at first names, not last names.

      – I’m not sure what you mean by saying that we “do all sorts of strange things.” We did some data exploration! I’ve done a lot more with these data, most of which I’ve never published, but here I just wanted to include a few graphs to get a sense of what could be learned from some simple plots.

      – I think our bar charts are pretty! But I respect that your view is different. At different times, I’ve plotted these data in different ways. I think that once I tried it with slightly wider bars, and maybe that did look better. I’m not quite sure why I picked these particular versions of the graphs to include in the book; it was probably just because they were easy for me to find on my computer.

      – You write, “the bars for girls are narrower than those for boys.” Huh? We have no bar graphs for girls’ names in the book.! Oh . . . I see, you might be referring to some of the code on our website. On that webpage we have all the code for making the graphs in the book, but then we also included some markdown files. Some of the code in the markdown files does not correspond to the graphs in the book. It would be safer to just go to the .R files in the directories.

      – I’m proud of the whole book, including chapter 2 and including the graphs of names. We were very careful to make the book clear, and we were also careful to include code to reproduce what’s in the book. We did not put in the effort to go back and clean up all the markdown files. They’re there for convenience but they’re not the primary material.

    • I’m in the same boat. I pre-ordered the paperback version on Amazon weeks ago, which advertised an availability date of 7/23/20. When I checked my order today, the status was “temporarily out of stock”, with an undetermined fulfillment date. The customer service rep kept trying to get me to buy the Kindle version, instead. There was a third party seller on Amazon with an expected delivery date of 8/5. And Barnes & Noble seems to be in the same window (availability date of 7/31).

  11. I would like to register a complaint with the proprietor of this blog. Earlier this evening (or yesterday for some of you) I received the following email from Amazon:

    “Hello,
    We’re encountering a delay in shipping your order. We’ll make every effort to get the delayed item to you as soon as possible. If you still want this item, please confirm below. We apologize for the delay.

    Details
    Order #113-*******-*******
    Placed on Tuesday, May 19, 2020
    New estimated delivery date:
    We will email you as soon as we have a delivery date.

    I still want this item

    Cancel Order

    If we don’t hear from you by Saturday, August 22, 2020, we will cancel the item. Otherwise, we will send it to you when we have a delivery date and it’s ready to ship.

    Regression and Other… Regression and Other Stories (Analyt…
    Sold by Amazon.com Services LLC”

    • Behold the power of Andrew:

      Just received from Amazon:

      “Hello,

      We have an updated delivery estimate for your Amazon order.

      New estimated delivery date:
      Saturday, August 1, 2020
      Previous estimated delivery date:
      Tuesday, August 25, 2020 – Thursday, September 24, 2020”

  12. I spoke to the sales team at Cambridge University Press. They indicated that the initial run of print copies was 80 (!) books in the UK, so there were none set to be shipped to the US market. They anticipate availability in September. They recommended I consider the ebook version.

  13. I preordered way back, and just received an Amazon email that the “New estimated delivery date: Tuesday, August 25, 2020 – Thursday, September 24, 2020”

    Thinking about switching to the kindle edition, but slightly reluctant due to having seen the typesetting in other technical books on kindle.

    • It looks like I’ll be getting it this week! From Amazon:
      “New estimated delivery date: Saturday, August 1, 2020 – Sunday, August 2, 2020
      Previous estimated delivery date: Tuesday, August 25, 2020 – Thursday, September 24, 2020”

    • Anon:

      I do think they screwed up, but I guess anybody can screw up. In retrospect we should’ve checked earlier that their production line was working, but (a) we were focused on finishing the book—Aki and I were working nearly full time on that for awhile—, (b) we’d never had that sort of problem before with any of our books (Cambridge University Press included), and (c) coronvirus.

  14. I have just received my copy. A great book, that I think will become a classic. Lots of important stuff explained that is missing from other books and modern views expressed on hypothesis testing etc.

  15. I’ve been slowly digging into the paperback version. The price is great, but I wish the print quality and typesetting were better: the paper is quite thin, and more importantly, the ink that was used isn’t as crisp as I would like (certainly not as clear as in, e.g., the books published by CRC Press), which results in squinting :/

    (Also, I honestly think it would have been more readable if the page / text width weren’t as wide.)

    • (And the font size could be as a teensy bit bigger and the space between lines could be a tad larger — I compared the paperback to my hardcover version of McElreath’s book, and the text there seems bigger, or at least, more readable and less squished together.)

  16. Just got my hands on a paperback copy of this book and while I’m very happy with the content I’ve noticed one slightly annoying thing which is more to do with how the book was printed/binded by the publisher. The sidenotes (on the inner column closest to the spine) printed on the right hand side of the book are not always fully visible. For example see page 11 from first chapter https://statmodeling.stat.columbia.edu/wp-content/uploads/2020/07/raos_overview.pdf, so the side note test is cut off and difficult to read.

    “Example:
    Radon,
    smoking,
    and lung
    cancer”

    I’ve raised the issue with Cambridge and hopefully this was just a once off issue and not present in all copies.

  17. I got my copy today from Amazon UK (ordered on Wednesday 29, dispatched a couple of days later). I’ve noticed some minor layout issues [*] but it’s very nice overall (and it’s offered at an even nicer price!).

    Having access to a few chapters on kindle format was quite convenient and left me eager for more (but I guess I would have bought it later just the same).

    Congratulations.

    [*] The margin at the bottom seems a bit too narrow. The entry labels in the margins would have been better in the right margin on odd-numbered pages. The heading of chapters 19 and 20 is all caps (not small caps like the others).

  18. The book is stunning! I am adopting it for the PhD and advanced graduate classes I’ll be teaching in the Fall. Any plan to release ensuing teaching materials? (beyond what Aki has already shared on github)

  19. Wonderful book. I preordered on Amazon, and it arrived August 1. Not much of a wait, and definitely worth waiting for.

    I downloaded the code from the web site. Very useful, and there are at least a few goodies that did not make it into the printed book. (Or I haven’t found them yet.)

    One error in the code that caught me for a moment. Others might have the same issue. In at least some of the .Rmd files there is the code:

    library(“rprojroot”)
    root<-has_dirname("ROS-Examples")$make_fix_file()

    However, the actual upstream directory is "ROS-Examples-master", and changing the above line accordingly makes things work.

    Actually I may owe thanks for the error, because it forced me to look at the code carefully. I had not been aware of the rprojroot package, and already have put it to good use elsewhere.

    Bill

  20. (s_1^2+s_2^2)^(1/2) “The mathematics of this expression implies that it will typically be most effective to reduce the *larger* of these two quantities.”

    That’s because you picked a superlinear measure of “effectiveness”, right? If you picked abs(s_1) + abs(s_2) instead then it wouldn’t be any more effective. Why do we use the “sum of squares” here instead of some other “effectiveness”?

Leave a Reply to Leo Egidi Cancel reply

Your email address will not be published. Required fields are marked *