Skip to content
 

Regression and Other Stories is available!

This will be, without a doubt, the most fun you’ll have ever had reading a statistics book. Also I think you’ll learn a few things reading it. I know that we learned a lot writing it.

Regression and Other Stories started out as the first half of Data Analysis Using Regression and Multilevel/Hierarchical Models, but then we added a lot more and we ended up rewriting and rearranging just about all of what we had before. So this is basically an entirely new book. Lots has happened since 2007, so there was much new to be said. Jennifer and Aki are great collaborators. And we put lots of effort into every example.

Here’s the Table of Contents.

The chapter titles in the book are descriptive. Here are more dramatic titles intended to evoke some of the surprise you should feel when working through this material:

• Part 1:
– Chapter 1: Prediction as a unifying theme in statistics and causal inference.
– Chapter 2: Data collection and visualization are important.
– Chapter 3: Here’s the math you actually need to know.
– Chapter 4: Time to unlearn what you thought you knew about statistics.
– Chapter 5: You don’t understand your model until you can simulate from it.

• Part 2:
– Chapter 6: Let’s think deeply about regression.
– Chapter 7: You can’t just do regression, you have to understand regression.
– Chapter 8: Least squares and all that.
– Chapter 9: Let’s be clear about our uncertainty and about our prior knowledge.
– Chapter 10: You don’t just fit models, you build models.
– Chapter 11: Can you convince me to trust your model?
– Chapter 12: Only fools work on the raw scale.

• Part 3:
– Chapter 13: Modeling probabilities.
– Chapter 14: Logistic regression pro tips.
– Chapter 15: Building models from the inside out.

• Part 4:
– Chapter 16: To understand the past, you must first know the future.
– Chapter 17: Enough about your data. Tell me about the population.

• Part 5:
– Chapter 18: How can flipping a coin help you estimate causal effects?
– Chapter 19: Using correlation and assumptions to infer causation.
– Chapter 20: Causal inference is just a kind of prediction.
– Chapter 21: More assumptions, more problems.

• Part 6:
– Chapter 22: Who’s got next?

• Appendixes:
– Appendix A: R quick start.
– Appendix B: These are our favorite workflow tips; what are yours?

Here’s the preface, which among other things gives some suggestions of how to use this book as a text for a course, and here’s the first chapter.

The book concludes with a list of 10 quick tips to improve your regression modeling. Here’s the chapter, and these are the tips:

– 1. Think about variation and replication.

– 2. Forget about statistical significance.

– 3. Graph the relevant and not the irrelevant.

– 4. Interpret regression coefficients as comparisons.

– 5. Understand statistical methods using fake-data simulation.

– 6. Fit many models.

– 7. Set up a computational workflow.

– 8. Use transformations.

– 9. Do causal inference in a targeted way, not as a byproduct of a large regression.

– 10. Learn methods through live examples.

And here’s the index.

You can order the book here. Enjoy.

P.S. I saved the best for last. All the data and code for the book are on this beautiful Github page that Aki put together. You can run and modify all the examples!

83 Comments

  1. Roy says:

    I like e-books for easy reference but I have gotten a number that are statistical/mathematical in nature and they are nearly worthless because of the way the equations are done, so that they usually are small and unreadable, can be “opened” to a larger view but then you can’t see the context. Is there anyway to get a sample of what the equations look like in the Kindle edition? Otherwise I am tempted by the paperback, it is priced nicely.

    And congrats on getting this out.

    • Chris Prosser says:

      I have the same concern. I’m hoping that kindle sample available on the 23rd will help me make a decision one way or the other. I’m very looking forward to the book.

    • Joe says:

      The Kindle edition is out. It’s a print replica, so the equations are better than other e-formats I’ve seen. Several chapters are available as previews so you can get a goid sense of what’s it’s like.

  2. Anonymous says:

    Thank you so much for this work! I’ve spent the past few weeks constantly checking the blog for an update about the book. Looking forward to learning.

  3. Zhou Fang says:

    The subtitle of the book is Analytical Methods for Social Research. How specific is the book to social sciences?

  4. anon e mouse says:

    The paperback is extremely affordable for a stats text, very pleased to see that. I wish BDA and McElreath’s Statistical Rethinking could be had anywhere nearly as cheaply!

  5. Bruce McCullough says:

    At long last! And just in time for my August trip to the beach.

    Actually, you can’t order it yet — Amazon says it will be released on 23 July. But you can pre-order it, which I just did.

    By the way, this book will make a great Christmas or birthday present for _anyone_, regardless of whether that persons knows statistics or has any interest in statistics. So be sure to buy several copies.

  6. Paolo Inglese says:

    Looking forward to getting my copy on the 23rd!!

  7. D Kane says:

    Web version freely available? There was talk of this previously.

  8. RobMcD says:

    Looking forward to this, well done :-)

  9. Hernan Bruno says:

    Thank you, and your co-authors. I will order it during the Summer.

  10. Doug Davidson says:

    Thank you! Just in time.

  11. Jonathan says:

    i love ‘to understand the past, you must first know the future’.

  12. dl says:

    it’s between this and Curb Season 10 (release date 7/21)

  13. Nedson says:

    Thank you for this book. I pre-ordered it many months ago. Finally, I can count just days to have it on my table!

    Andrew:
    When is the latest second half of Data Analysis Using Regression and Multilevel/Hierarchical Models expected?
    Thanks.

  14. Oliver C. Schultheiss says:

    Hi Andrew, Jennifer & Aki,

    congratulations on publishing this book. The chapter titles sound so fun & juicy that this may well turn out to be the first book since Cohen & Cohen’s classic on “Applied correlation/regression analysis” that I will read from cover to cover. I already placed an order. If it shows up on the NYT non-fiction bestseller list, you’ll know you’ve hit a nerve. If it shows up on the NYT fiction bestseller list, though, you’ll know something went wrong in a big way…

    Cheers!
    Oliver

    • Oliver C. Schultheiss says:

      …let me qualify that “first book since” bit — it should have been “the first book on REGRESSION since…”! Because I have read two or three other books…

      –ocs

  15. Keith O'Rourke says:

    It is really nice to see simulation being emphasized as important in understanding.

  16. Stephen Olivier says:

    I’m so excited about this book! The original had a profound influence on my direction in academia

  17. Christian Hennig says:

    Hi Andrew,
    by and large this looks like a really cool book. I particularly like the recommendation to simulate fake-data, but I’ve already found other things to like.
    However I still don’t agree with your negativity about testing, and it seems you don’t agree with yourself on that.
    Your earlier posting about regression discontinuities says this:
    “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.”
    I’m fine with that, but this is just a significance test worded differently. The Appendix B in the book says there are no true zeroes and we’re not interested in them. But the argument you’re making here is that a true zero model is compatible with the data. And it’s a good argument, because you’re not claiming that the effect really is zero, but rather that the data cannot tell apart whatever goes on from zero. Which is good to know. But this could be acknowledged in your book as well!

    • Andrew says:

      Christian:

      What particular part of the book are you disagreeing with?

      • Christian Hennig says:

        I’m just talking about the linked Appendix B, 10 quick tips, because that’s all I know yet. B.2 says “Forget about statistical significance. Forget about p-values, and forget about whether your confidence intervals exclude zero.” The point you made in your discussion of the regression discontinuity thing was exactly that the confidence interval for the effect includes zero (“is consistent with noise”). So there are good reasons to be interested in whether that’s the case.

        • Anoneuoid says:

          Are you saying andrew wrote somewhere that:

          Confidence interval for the effect includes zero == Consistent with noise?

          • Christian Hennig says:

            That’s pretty much what the text from regression discontinuities amounts to, isn’t it?

            • Anoneuoid says:

              To me “consistent with noise” requires knowing how the data was collected, what would be a practical deviation from the prediction, etc. You need to know the order of magnitude you are dealing with when it comes to sources of systematic error, whether the magnitude of deviation has any practical or theoretical importance, etc.

              The test of whether it is consistent with zero is irrelevant.

              • Christian Hennig says:

                What I’m claiming is that Andrew tested that (using wording that made it less obvious), in the specific example, and that I’m fine with the conclusion he got from that. Did you read it? Do you think there’s anything wrong with what he did there and how he interpreted it? (In that case maybe Andrew better defends himself…)

              • Anoneuoid says:

                Id need to see the context of the quote but it does seem like he got lazy there. Maybe there are assumptions made that were not written down? Its easy to fall back on NHST! He could answer you better than me about his intent of course.

              • “Consistent with noise” means “we can’t prove using this test that a specific RNG wouldn’t tend to produce this kind of data”.

                If you have some data x, and you do a t-test that x has mean 0 and get p = 0.21 then this is “consist with [our model of] noise” in the sense that the pure noise model might produce data of this type frequently enough. If you get p = 0.0002 then this is “not consistent with [our model of] noise” in the sense that the proposed noise would be extremely unlikely to produce such data.

                Most people go wrong when they jump to unwarranted conclusions: “The signal is not consistent with noise, therefore power pose is true”. or “The signal is not consistent with noise, therefore winning an election adds 5-10 years to your life” etc.

                It’s much more likely that “the signal is not consistent with our model of noise, but it’s consistent with any number of other models for uninteresting stuff… what we really need is to make fairly specific predictions, and then see those come more or less true…”

              • Christian Hennig says:

                Well, if the data are consistent with such an RNG, it’s pretty clear that they do not allow strong conclusions in the opposite direction, which is a good thing to know, no? And I take it that Andrew agrees, because this seems to be his argument in that posting.

                Of course there are lots of misinterpretations of tests, but this whole idea that because people don’t properly understand and misuse one approach they should learn another one that is surely at least as subtle (don’t get me started on interpretations of prior distributions, where they come from, and what many people make of them) is a mystery to me. If people want to jump to quick conclusions and not want to properly think through the foundations, they will misuse whatever approach to get them.

              • Anoneuoid says:

                Well, if the data are consistent with such an RNG, it’s pretty clear that they do not allow strong conclusions in the opposite direction, which is a good thing to know, no?

                That just means you didn’t spend enough money, so the sample size is too low and/or measurements are too imprecise. So when you use NHST, “paradoxically” the better the study the less evidence is provided for the research hypothesis.

                Meehl 1967 covers this well: http://meehl.umn.edu/sites/meehl.dl.umn.edu/files/074theorytestingparadox.pdf

              • Or in some cases, you don’t have a very specific research hypothesis… if you sample the position of a chaotic oscillator you might be unable to reject the idea that it’s a Normal(0,1) random number generator… But if you have a model for its dynamics you might be able to predict every single data point to within 3 significant figures using an ODE…

              • Christian Hennig says:

                Guys, you don’t need to teach me about tests. Andrew applied it to a specific situation where the researchers spent what they spent and did what they did, and there it makes sense. And there are many situations like this.

              • Anoneuoid says:

                Im not defending andrews statement, can you link to the context?

              • Christian Hennig says:

                You have replied to postings that I wrote about it for some time now. It’s not that difficult to find really. Just scroll a few of his entries down.

              • Anoneuoid says:

                Just easy to post a link: https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/

                Yea, like Ive said on here before its not useful to interpret the coefficients of arbitrary statistical models to begin with. He shows that by playing with the model specification later.

                You can also add other variables like wealth, age of spouse, preexisting health conditions, etc and that will change it too.

                So trying to interpret these arbitrary numbers is an even more egregious error to begin with. Then people do NHST on top of that to make it even more nonsensical.

              • Christian Hennig says:

                It seems now you’re commenting on what the people did that Andrew cited; the test I was referring to was the one Andrew did to show that it’s compatible with an “all noise” H0. (As I mentioned, Andrew didn’t present it as a test, but what he wrote is equivalent to running a test.)

              • Christian Hennig says:

                PS: Not “all noise” of course, but the age effect claimed by the authors.

              • Anoneuoid says:

                Yes, seems he did some NHST there. I wouldnt be looking at these tables of regression coefficients to begin with.

                Id split the data into train, validation, and test sets by time and pick a model based on predictive skill and computational resources required. Then report the skill on the final test dataset that was only run once. Next you have to wait for future data to come in, or maybe see how well it performs for other countries.

            • Andrew says:

              Christian:

              Huh? We do have a section in the book on regression discontinuity analysis, but we don’t ever say that Confidence interval for the effect includes zero == Consistent with noise. At least I don’t think we do!

              • Carlos Ungil says:

                I think his reply to Anoneouid is again a reference to the post from July 2nd where you wrote:

                “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.“

              • Christian Hennig says:

                Yes.

              • Andrew says:

                Christian:

                In that post I was talking about uncertainty in the parameter estimate. I was not checking whether a confidence interval includes zero. I do think that null hypothesis significance testing can be valuable, especially in cases where you can’t reject the null hypothesis, but I don’t think it’s so useful to do this using p-values or by looking at the endpoints of confidence intervals.

              • Christian Hennig says:

                “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.”
                How do you find out that the effect 2.4 se 2.4 means that it’s consistent with noise other than checking whether 0 is in the CI?

              • Christian Hennig says:

                I mean, you are a pro, so you don’t need to compute the CI to actually know it, but in order to convince a first year student that your numbers mean “it’s consiostent with noise”, what can you do that isn’t equivalent to having zero in the CI?

              • Christian:

                It’s a good question, so I’ll have a go.

                Andrew said, “The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.”

                Implicit here is that we are talking about changes in lifespan of a politician.

                Say that instead of a standard error of 2.4 years, it was 2.4 months, and the effect size was also 2.4 months. (I know, implausible, but just say.) We probably wouldn’t say “consistent with noise”. We’d probably say instead something like “a fairly precisely estimated effect that is small or zero”.

                So an answer to your question:

                “in order to convince a first year student that your numbers mean “it’s consistent with noise”, what can you do that isn’t equivalent to having zero in the CI?”

                could be to say “Is the SE big or small in an absolute sense?”

        • Andrew says:

          Christian:

          We discuss hypothesis testing in chapter 4. We’re not entirely negative about it—we have an long example in which we perform a hypothesis test that is not rejected from the data, from which we draw the conclusion that the data are consistent with a simple random model. But I still think it’s a good idea to forget about statistical significance. forget about p-values, and forget about whether your confidence intervals exclude zero. You can learn from simple models without making go / no-go decisions based on p-values.

          • Christian Hennig says:

            I’m not quite sure what you’re saying. Are you saying that the argument that I cited and the one in Chapter 4 do not really involve statistical significance, p-values, or checking whether a confidence interval includes zero? Or are you saying that people should forget these things despite your using them? If it’s the latter, why do you use them then?

            When you do things and interpret them I normally agree with you, and I also agree that there’s lots of misuse of hypothesis tests. But they have their place (as you seem to acknowledge) so your general negative statements against them read like propaganda, and that’s unnecessary.

            • Christian Hennig says:

              Shouldn’t we be able to send differentiated messages rather than “forget, forget, forget”?

            • Chris Wilson says:

              My observation is that the large majority of folks were taught the incorrect NHST paradigm where you ‘win’ and ‘there is an effect/something meaningful going on’ with p 0.05. Making carefully guarded statements like ‘consistent with noise‘ is much different, although it takes unpacking to avoid being confused or reinterpreted in the Bad NHST fashion. Easier to just sidestep the whole thing most of the time!

              • Christian Hennig says:

                Do you seriously believe that the same people who misinterpret NHST all the time now would suddenly know what they’re doing and come to reliable conclusions when they all went Bayesian or adopt whatever good idea in Andrew’s book in the same unreflected quick-and-dirty way in which they use NHST now?

  18. Andrew Hamilton says:

    Congratulations! I’ve been looking forward to getting this!

    One question: for those of us allergic to Amazon, can you confirm that we are able to order directly from Cambridge Press? It looks like they have a link but I’ve never gone through them before. Thanks!

    https://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/regression-and-other-stories?format=PB

  19. Leo Egidi says:

    This looks so cool!
    Congratulations Andrew

  20. Blake Shurtz says:

    ordering the kindle version b/c i’ll be reading it and working through the code side-by-side

  21. Fred says:

    Hi Andrew,
    congratulations on your book.

    Do you know if the publisher plans to sell a pdf version of the book? I like to followed the examples with an R session open right next to the book.

    Fred

  22. Winston says:

    Is there any plan for a PDF version? I’m very happy to pay and am fine with DRM version, I just really loathe the Kindle interface and like to have everything in iBooks.

  23. LauraRK says:

    Hi Andrew,
    I talked to the publisher about using this in my class next fall and they were concerned it would not actually be out until September. I am so glad they are hopefully getting it out this month. As of this morning, July 10, https://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/regression-and-other-stories?format=PB still says September. Does anyone know about this, I have asked the publisher but they have yet to get back with me. I will ask again.

    Laura

  24. Antony says:

    Andrew,

    Congratulations on getting another book out! Thanks for making the code and data available. As you say, the website Aki set up is nicely laid out, and that is very helpful.

    BUT, as I write this, there is a link underneath the box to your next post “Ugly code is buggy code” and that says it all. The code I have looked at from the book is ugly. For your analysis of the distribution of last names in Chapter 2 you do all sorts of strange things, but, worst of all, you include a detailed function called discrete.histogram. It produces barcharts , not histograms, and they are not pretty. Presumably there is some strange reason why the bars for girls are narrower than those for boys and why the bar for the letter ‘a’ for girls in 1900 looks as if it is part of the vertical axis at first glance. Why do you need your own complicated function?! Doubtless you do not want to use the tidyverse, because then you could produce everything in a few lines and it would look so good so easily. Students might get the false impression that graphics are simple. I’m sure you could do everything (almost) as well in base R, so why don’t you?

    I have only looked closely at one bit of code and probably there is some excellent stuff you are really proud of elsewhere. Maybe you could give stars to the best stuff like Stella Gibbons did in her novel “Cold Comfort Farm” (which is well worth reading, if you don’t know it already). Where should I look?

    You offer plenty of sound advice about graphics and I like the quick rule in your final chapter that “any graph you show, be prepared to explain.” (although I prefer the maxim “No graphic left undiscussed.”)

    • Andrew says:

      Antony:

      I appreciate the frank comments! Regarding our analysis of the distribution of last names in Chapter 2:

      – We’re looking at first names, not last names.

      – I’m not sure what you mean by saying that we “do all sorts of strange things.” We did some data exploration! I’ve done a lot more with these data, most of which I’ve never published, but here I just wanted to include a few graphs to get a sense of what could be learned from some simple plots.

      – I think our bar charts are pretty! But I respect that your view is different. At different times, I’ve plotted these data in different ways. I think that once I tried it with slightly wider bars, and maybe that did look better. I’m not quite sure why I picked these particular versions of the graphs to include in the book; it was probably just because they were easy for me to find on my computer.

      – You write, “the bars for girls are narrower than those for boys.” Huh? We have no bar graphs for girls’ names in the book.! Oh . . . I see, you might be referring to some of the code on our website. On that webpage we have all the code for making the graphs in the book, but then we also included some markdown files. Some of the code in the markdown files does not correspond to the graphs in the book. It would be safer to just go to the .R files in the directories.

      – I’m proud of the whole book, including chapter 2 and including the graphs of names. We were very careful to make the book clear, and we were also careful to include code to reproduce what’s in the book. We did not put in the effort to go back and clean up all the markdown files. They’re there for convenience but they’re not the primary material.

  25. Smithy says:

    Any news on when the printed versions will be released? I thought the publication date would be 23rd July (Amazon) but no luck I’m afraid.

    • Shane says:

      I’m in the same boat. I pre-ordered the paperback version on Amazon weeks ago, which advertised an availability date of 7/23/20. When I checked my order today, the status was “temporarily out of stock”, with an undetermined fulfillment date. The customer service rep kept trying to get me to buy the Kindle version, instead. There was a third party seller on Amazon with an expected delivery date of 8/5. And Barnes & Noble seems to be in the same window (availability date of 7/31).

  26. Thanatos Savehn says:

    I would like to register a complaint with the proprietor of this blog. Earlier this evening (or yesterday for some of you) I received the following email from Amazon:

    “Hello,
    We’re encountering a delay in shipping your order. We’ll make every effort to get the delayed item to you as soon as possible. If you still want this item, please confirm below. We apologize for the delay.

    Details
    Order #113-*******-*******
    Placed on Tuesday, May 19, 2020
    New estimated delivery date:
    We will email you as soon as we have a delivery date.

    I still want this item

    Cancel Order

    If we don’t hear from you by Saturday, August 22, 2020, we will cancel the item. Otherwise, we will send it to you when we have a delivery date and it’s ready to ship.

    Regression and Other… Regression and Other Stories (Analyt…
    Sold by Amazon.com Services LLC”

    • Thanatos Savehn says:

      Behold the power of Andrew:

      Just received from Amazon:

      “Hello,

      We have an updated delivery estimate for your Amazon order.

      New estimated delivery date:
      Saturday, August 1, 2020
      Previous estimated delivery date:
      Tuesday, August 25, 2020 – Thursday, September 24, 2020″

  27. Shane says:

    I spoke to the sales team at Cambridge University Press. They indicated that the initial run of print copies was 80 (!) books in the UK, so there were none set to be shipped to the US market. They anticipate availability in September. They recommended I consider the ebook version.

  28. Ying says:

    I preordered way back, and just received an Amazon email that the “New estimated delivery date: Tuesday, August 25, 2020 – Thursday, September 24, 2020”

    Thinking about switching to the kindle edition, but slightly reluctant due to having seen the typesetting in other technical books on kindle.

    • Ying says:

      It looks like I’ll be getting it this week! From Amazon:
      “New estimated delivery date: Saturday, August 1, 2020 – Sunday, August 2, 2020
      Previous estimated delivery date: Tuesday, August 25, 2020 – Thursday, September 24, 2020″

  29. Anon says:

    Andrew – maybe this snafu shows that you should maybe not consider Cambridge university Press for future books?

    • Andrew says:

      Anon:

      I do think they screwed up, but I guess anybody can screw up. In retrospect we should’ve checked earlier that their production line was working, but (a) we were focused on finishing the book—Aki and I were working nearly full time on that for awhile—, (b) we’d never had that sort of problem before with any of our books (Cambridge University Press included), and (c) coronvirus.

  30. Wow. Looks like a great undertaking. The chapter headings are like WOW

  31. Blaise says:

    I have just received my copy. A great book, that I think will become a classic. Lots of important stuff explained that is missing from other books and modern views expressed on hypothesis testing etc.

Leave a Reply