From comments to my recent 538 post, I’ve learned the following:
1. I don’t understand logarithms.
2. I really don’t understand logarithms.
3. I don’t know how to use Wikiipedia.
4. I don’t know that “percent” means “divided by 100”.
5. I don’t know the difference between correlation and R-squared.
My first reaction is to respond in a snippy and sarcastic way, but when it comes to writing, the reader is (almost) always right: When someone misunderstands something I wrote, this tells me I was being unclear.
Points 1-5 above are all wrong, but they are all reasonable conclusions to be drawn from a casual reading of my post–and when writing a blog post, we certainly can’t demand more than a casual reading. On the other hand, someone familiar with my work (for example, a regular reader of my regular blog) would probably want more evidence before jumping to the conclusion that I don’t know about logarithms, correlation, R-squared, and so forth.
The real message for me is that, when communicating outside this blog and outside of technical venues (scientific journals and the like), I need to be redundant around possible points of confusion.
For example, in the above-linked post, I noted a nonlinear pattern in a graph. I could tell right away, just based on my background knowledge, that the logarithmic transformation, which was involved in some of the calculations, was not causing the nonlinearity. So I didn’t even bother mentioning it. This would be OK for general audiences who probably wouldn’t think about this issue anyway–the log transformation betting essentially irrelevant hare, and so I wouldn’t even bring it up. And it would be OK in class–if any student happened to focus on the log, I could redo the graph then and there, to demonstrate how it all worked. But on the blog, people can read it, think what they want, and jump to their own conclusions. If I don’t want to be misunderstood, I would need to put in extra sentences rounding the sharp corners, as it were, anticipating mistakes that the readers might make. It’s not clear it’s worth the effort–I’m doing this all for free, after all–but I think that’s what it takes.
Now for the details
In case people care . . .
1 and 2. The log transformation doesn’t have a big effect on the shape of that curve, because all the data fall in the range of a factor of 2 on income. This was an important enough point that I added a parenthetical note to the blog entry (“rounding the sharp corner”) and most of the later commenters seem to have gotten the point.
3. Yes, I looked at Wikipedia and several other sources, many of which were linked in the post. As I’d explained there, different sources gave different formulas, and it wasn’t clear exactly how these numbers were created. I had actually played around with the formula on the Wikipedia page but couldn’t find all the numbers, then I saw an official-looking document that had a much different formula. I’d thought the general point was clear in my blog entry, but I guess another sentence would’ve helped with this.
4. Percent means “dvided by 100.” 0.86 = 86%, -.10 = -10%, etc. Correlations can be anywhere between -100% and 100%. I guess if I really wanted to explain this, I could’ve said “the correlation is .86, or 86%,” although I have to admit that seems like overkill to me.
5. If the R-squared were 86%, then the correlation would be sqrt(.86) = .93. The numbers on that second graph are highly correlated, but no way do they have a correlation of .93. Decades of statistical analysis make this clear to me at a glance.
The funny thing about all the above mistaken criticisms is that they require knowledge, they’re not completely ignorant comments. For example, you have to know something about statistics and mathematics to know that you’re not supposed to write a correlation of .86 as 86%, or to have heard about phrases such as “the regression explains…” or to think about the log transformation.
It’s important to communicate to this middle range of people, who know enough rules to over-think things and get confused. For one thing, these people are already thinking about the problem, so the most difficult step has already been taken.
P.S. Yeah, sure, maybe I shouldn’t read the comments in the first place. But I’d like to improve my writing!
P.P.S. I came across another one, this time a commenter to Megan McArdle’s blog who helpfully added a link that he hadn’t noticed was in my original post (but, as I’d explained but he unfortunately also didn’t notice, presented much different numbers (with different rankings) from those used in the map that got the discussion started).
This last commenter made a mistake that I’ve been noticing a lot lately, which is to find a single piece of documentation on the web and assume it’s correct, without checking it against other numbers out there. I thought I’d made this point clear in my post, but perhaps in the future I should use bold font so people don’t miss the point. Or maybe I’m just too thin-skinned: with dozens of comments spread across three different blogs, I shouldn’t be so upset that one person made this mistake.