“99.60% for women and 99.58% for men, P < 0.05.”

Gur Huberman pointed me to this paper by Tamar Kricheli-Katz and Tali Regev, “How many cents on the dollar? Women and men in product markets.” It appeared in something called ScienceAdvances, which seems to be some extension of the Science brand, i.e., it’s in the tabloids!

I’ll leave the critical analysis of this paper to the readers. Just one hint: Their information on bids and prices comes from an observational study and an experiment. The observational study comes from real transactions and N is essentially infinity so the problem is not with p-values or statistical significance, but the garden of forking paths still comes into play, as there is still the selection of which among many possible comparisons to present, and the selection of which among many possible regressions to run. Also lots of concerns about causal identification, given that they’re drawing conclusions about different numbers of bids and different average prices for products sold by men and women, but they also report that men and women are using different selling strategies. The experiment is N=116 people on Mechanical Turk so there we have the usual concerns about interpretation of small nonrepresentative samples.

The paper has many (inadvertently) funny lines; my favorite is this one:

Screen Shot 2016-02-21 at 1.03.00 PM

I do not however, believe this sort of research is useless. To the extent that you’re interested studying behavior in online auctions—and this is a real industry, it’s worth some study—it seems like a very sensible plan to gather a lot of data and look at differences between different groups. No need to just compare men and women, you could also compare sellers by age, by educational background, by goals in being on Ebay, and so forth. It’s all good. And, for that matter, it seems reasonable to start by highlighting differences between the sexes—that might get some attention, and there’s nothing wrong with wanting a bit of attention for your research. It should be possible to present the relevant comparisons in the data in some sort of large grid rather than following the playbook of picking out statistically significant comparisons.

P.S. Some online hype here.

31 thoughts on ““99.60% for women and 99.58% for men, P < 0.05.”

  1. “Likewise, women had a slightly higher percentage of transactions for which positive feedback had been given in the year preceding the current transaction (99.60% for women and 99.58% for men, P [less than] 0.05).”

    This is a nice find. Presumably this sentence would not be in the paper if p wasn’t less than 0.05.

    From table 1 there were n=148,017 female transactions and 0.004*n=592.068. The exact number depends on rounding (eg as high as 100(*1-599/148017)=99.5953%), but there were approximately 592 female transactions did not meet that feedback criteria. On the other hand, 0.0042*n=621.6714, or about 622 transactions (as low as 615 depending on rounding) would have been required to have the same percentage as the men (99.58%). So we are talking about a most likely difference of 30 transactions to get equality between women and men, while uncertainty due to rounding error is half that size (16 transactions). And that is to get equality rather than just push the p-value past the threshold!

    Sure, this isn’t the headline finding of the paper, the point is the nonsensical nature of NHST it highlights. The magical 0.05 level (chosen arbitrarily to begin with!) makes this totally meaningless difference the size of a rounding error into a notable difference.

    • I don’t think that description is correct. There were 148017 auctions by women. In each case there is a variable “percent positive feedback” (calculated over the previous year for the seller). The mean is 99.60 (and the standard deviation 2.35). For male sellers, then mean of the “percent positive feedback” variable for the 483499 auctions is 99.58 (sd 2.41).

      • I was basing my interpretation off what they wrote: “percentage of transactions for which positive feedback had been given”.

        Maybe that is wrong, but in table S1 it says there were 631,516 total “Auction” *transactions*. From table 1, the sample size for female and male are 148,017+483,499=631,516. So these seem to be transactions, not people like you are thinking.

        I agree that this standard deviation doesn’t make much sense then. I didn’t notice that. Also, those p-values seem to be wrong if I am correct (or maybe I am misusing R’s prop.test function?):

        > dat=matrix(nrow=2,ncol=2)
        > dat[1,]=c(148017-592, 592)
        > dat[2,]=c(483499-2030, 2030)
        > ptest=prop.test(dat)
        >
        > ptest$estimate
        prop 1 prop 2
        0.9960005 0.9958014
        > ptest$p.value
        [1] 0.3082836

        • > I was basing my interpretation off what they wrote: “percentage of transactions for which positive feedback had been given”

          Maybe you should have read the full sentence: “… in the year preceding the current transaction”

        • I don’t follow you. I did read that part, you can see it quoted in my original post… but I left it out of the quote in next post because it seemed irrelevant to the point I was making there. How does considering that additional phrase make it seem they are not describing proportions?

        • You wrote “there were n=148,017 female transactions.” I wrote: “There were 148017 auctions by women.” I don’t understand where do you see the problem with that (note that I never said that there were 148017 different women).

          Now, for each transaction they collect a number of variables relative to the auction, the seller, and the buyer. Including one variable that is the “percentage of (seller’s) transactions for which positive feedback had been given in the year preceding the current transaction.” I don’t see how this can be interpreted differently.

        • Yes, I think I understand now.

          You interpret it as (eg for females):
          “(average) percentage of (*female* seller’s) transactions for which positive feedback had been given in the year preceding the current transaction.”

          I interpreted it as:
          “percentage of transactions (by *female* seller’s) for which positive feedback had been given in the year preceding the current transaction.”

          I considered your interpretation at first, but in the same table the sample size they give is by transaction (because it matches up to that in table S1 that is explicitly labeled transactions). However, my interpretation does disagree with the fact the column headers are mean and sd. On the other hand, the sample size (n) row is also listed under “mean” when it is clearly not an average, which is why I didn’t take the headers seriously.

          So those “percent positive feedback” values refer to averages over people while the sample size refers to transactions…How many male/female sellers were there?

          p.s.
          I still don’t see what that had to do with the “in the year preceding…” phrase.

        • I’m not sure if I made myself clear. I interpret it, for each transaction, as
          “the (exact) percentage of transactions (where the seller is the same as in the current transaction) for which positive feedback had been given in the year preceding the current transaction.”

          To give a detailed example (which maybe doesn’t work exactly as in the paper, but I hope it captures the spirit) imagine that there are five “female transactions” (i.e transactions where the seller is a woman). The sellers are Alice, Barbara, Carol, Alice, Barbara. The values of the variable “percent positive feedback” are:
          Alice: 100% (for each transaction where she was the seller in the past year, she got positive feedback)
          Barbara: 98% (maybe out of 50 transactions she got negative feedback once)
          Carol: 99%
          Alice: 95% (the transaction above got negative feedback, lowering the percentage over the year preceding the current date)
          Barbara: 100% (the transaction with negative feedback has dropped now out of the calculation period)
          The table would report the mean (98.4) and standard deviation (2.1).

          For me the “in the year preceding the current transaction” indicates clearly that you have one datapoint per transaction. I don’t think you were considering 148017 percentages and averaging them.

        • Carlos wrote: “I’m not sure if I made myself clear. I interpret it,…”

          Ok, I see now.

          1) What method should be used to compare two sets of data generated in this way? In your example Alice appears twice, so her data is being counted twice (+ 1 extra transaction the second time), it seems wrong to just plug in averages of these into a t-test…

          2) Don’t we need to know the number of men and women sellers to perform this analysis? That seems to be missing from the paper.

    • How did they get p less than 0.05? If there are about 600 not positive-feed-backed transactions, the square root is about 25 and there is not even 1 standard deviation.

  2. Is your criticism of that P-value line that it’s a forking paths thing? Or that it’s such a small difference that nobody cares? If that, I don’t think I’m on board – 2 cents on each of a billion transactions or whatever these auction sites get is still a lot.

    • Lauren:

      As I wrote in the above post, the observational study comes from real transactions and N is essentially infinity so the problem is not with p-values or statistical significance, but the garden of forking paths still comes into play, as there is still the selection of which among many possible comparisons to present, and the selection of which among many possible regressions to run. There’s also variation. A positive difference of .02% (that is, .0002) in one dataset could easily be -.0002 in another.

      • Isn’t the “garden of forking paths” a criticism that you can apply to pretty much every study that uses p-values?

        Do we even have to go into the specifics of a NHST study to decide?

        • Rahul:

          As I’ve written before, I think the best approach is to look at all comparisons and interactions that might be of interest, rather than selecting based on p-values, which creates all sorts of problems.

      • Forking paths – yes, that’s a problem. But imagine for a second that this was a pre-registered single-test evaluation. In that circumstance, we would believe the p-value, so the probability of getting .02% would be unlikely if the true value were -.02%.

        Would “there is variation” still be a valid concern? I don’t think so, in that experiment.

        The difference seems small in relation to other dollar-value figures we encounter in life, but I don’t think that’s relevant to the statistical certainty about whether the effect is positive or not. The effect size would be big enough relative to the noise given the sample size…

        • Lauren:

          Yes, “there is variation” would still be a relevant concern to me, whether or not the design was preregistered. Preregistration affects the p-value but it doesn’t affect the fact that this is one of many comparisons that might be of interest, nor does it affect that a difference in this particular dataset will not necessarily reappear in other data. I don’t see there being any statistical certainty about whether the effect is positive or not, because in a new scenario the difference could easily go in another direction. I’m comparing .0002 not to the standard error but to variation we might see in different settings.

        • Another way to put this is instability of the effect through time. When you’re sampling from a fixed set of items at a particular time, then yes you could say “yes there really probably was at time t=0 a 0.0002 fractional difference”. But things do not stay static in the world, and this is often (almost ALWAYS?) ignored in frequentist analyses with tests because the model is “if it were a random number generator” and random number generators do the same thing every time you call them.

          A better thought model is “during the period t=0 to t=t1 if you pretend that we had a random number generator with constant parameters, how big a difference between the parameters could we detect?” that puts the proper hypothetical thinking into your head. Whereas later in time, you could have literally ANYTHING else. The laws of physics don’t prevent all the men on EBay from simultaneously dying of a virus that infects only males…. I mean, sure it’s unlikely, but the fiction of a random number generator gives us a fake sense of surety about the future, other places in the world, other legal or business circumstances, etc.

        • What evidence are you using to say that “in a new scenario the difference could easily go in another direction”? Is there any evidence for or against this claim in the study? What kind of evidence would qualify?

      • > A positive difference of .02% (that is, .0002) in one dataset could easily be -.0002 in another.

        If they had reported “80% for women and 60% for men, P < 0.05”, would you also say that the +20% difference in one dataset could easily be -20% in another?

        Also, they could have used the variable "percentage non-positive feedback" to describe their data. Would "0.40% for women and 0.42% for men, P<0.05" be equally funny?

        • Carlos:

          In answer to your questions:

          Of course there will be changes from one scenario to another. But a change of .0002 is much easier to envision than a change of .20. It’s a factor of 1000 different. A factor of a thousand is a lot!

          No, a comparison of .0040 to .0042 would not have been as funny. Part of the humor was that they made the comparison hard to see with all those 99’s.

        • I don’t see why the size of the difference is relevant in itself. For example, if the years in eBay (9.02 for women, 9.81 for men) were also reported as p<0.05, why should I expect them to change less "from one scenario to another" (whatever that means). And I don't think that this variable would be even less likely to change if it was given as months in eBay (108 for women, 118 for men) just because the difference is an order of magnitude larger.

          The variable "percent positive feedback” has a distribution which is very concentrated at 100%, with a small proportion of "bad" sellers. The simplest model I can think consistent with the reported mean and variance is that the positive feedback is 100% for good sellers and 85.8% for bad sellers. The rate of bad sellers is 2.82% for women, 2.97% for men. In "another scenario" these results could change, but also the reputation (275 for women, 260 for men) or the duration of the auctions (4.81 for women, 4.66 for men) could change. The p-value for those is lower, so the measures seem to be more precise. But they are not necessarily more stable: one can imagine for example seasonal effects affecting the duration of the auctions while the characteristics of the sellers remain stable. Of course there could be as well seasonal patterns on the sellers propensity to auction items. Omnia mutantur!

        • Carlos:

          The p-value has nothing to do with it. Nothing at all. What I’m saying is that 0.0002 is a tiny difference, much smaller than variation I’d expect across different scenarios. 0.20 is a much bitter difference, a thousand times bigger. Scale matters.

        • Overdose death rates in the US increased from 0.0138% in 2013 to 0.0147% in 2014. That’s a tiny difference, but people are rightly concerned about the issue. Scale matters, but it’s not the only thing that matters. The key point in this discussion is, I think, “the variation that you would expect across different scenarios”. You might expect a large variance relative to the difference for the variable “percent positive feedback”, but it’s far from obvious. The magnitude of the change (relative to the corresponding differences) could be similar for the other variables.

          I don’t see for example why (if I ignore the p-values as you urge me to do) the difference in “percent positive feedback” would be more likely to change its sign across different scenarios than the difference in “reputation”. One could expect those variables to be correlated (at least if men and women have a similar “transaction count”, due to how it is calculated). And one can argue that the “scale” of the differences is similar: reputation is 275 for women, 260 for men (5.8% higher for women) and the “percent non-positive feedback” is 0.40% for women and 0.42% (4.8% lower for women).

        • Carlos: there are two sources of variation of interest

          1) variation that can be explained by sample size effects

          2) variation that can’t be explained by sample size effects

          The p value tells you that the difference is big enough relative to your sample size that if it comes out of a random number generator, then it probably isn’t due to sampling variation alone.

          But we know this doesn’t come out of a random number generator, in fact we know it comes out of hundreds of thousands of human interactions. And we know human interactions are inherently variable, and we know something about the kind of variation. So it’s totally plausible to me at least that the entirety of this size of effect could be something like “during this time period 3 to 5 more women than men were having trouble with their marriage/automobile reliability/mortgage payments/health/insurance claims/unexpected house maintenance/whatever and therefore were less attentive to their EBay customer satisfaction whereas in the next 6 months it could easily be that 3 to 5 more men were having those types of trouble.”

          On the other hand, I don’t find it credible that 20% differences would exist in those kinds of factors, and that those 20% differences would reverse themselves in a different time period, so no we probably wouldn’t blow off the 20% differences.

          The point is, the p value ignores all the background information that we have, and treats the whole thing as a stable fixed random sampling process, which we know it isn’t.

    • The assumption behind the p value is that there are two populations of stuff out there, and you’ve got a random sample from each, and you’re trying to find out if the two populations have the same exact average and that every data point is known to effectively infinite precision. In other words, that you’re just calling two different random number generators and trying to see if they have the same parameters put in to the computer code.

      As Joseph points out below, and Anoneuoid above there are lots of other issues regarding this stuff which can contribute to the differences. In particular, with respect to simple things like roundoff errors in data collection or reporting. Someone could have exported the data from Excel and re-imported it to Stata and the roundoff in the csv file would be the entirety of the difference!

      Also, it’s not differences on the order of $0.02/dollar that we’re talking about, it’s $0.0002/dollar. And the assumption that you have a stable random number generator and that this $0.0002/dollar difference continues through time consistently is totally unwarranted. As Anoneuoid points out above, one person having an extra good week once in a year and getting 10 or 20 extra sales or whatever would account for the whole difference.

      Another point is that probably out there in this dataset is some really important real problem, but it’s a messy noisy problem, and because it’s noisy the p value isn’t small, and so we gloss over it as if it were zero but in fact maybe there’s a 4 or 5% consistent difference in earnings between men and women in some part of the economy, it’s just that the noise in the measurement is also 4 or 5% and is maybe seasonally varying, and bleblabla so we’re ignoring it because its p value is too big.

      So, instead we’re looking at “significant” effects that are of the order of what could easily be caused by pure user error in roundoff when exporting data, and we’re pretending it’s important “significant” stuff!

      • The difference is in proportions of transactions with positive feedback, not dollars.

        The entire sentence is, “Likewise, women had a slightly higher percentage of transactions for which positive feedback had been given in the year preceding the current transaction (99.60% for women and 99.58% for men, P < 0.05)."

        I don't see what the problem is here: they're not engaging in "asterisk statistics," they state the magnitude of the point estimate, note that it is only a "slight" difference, and then give the p-value in parentheses, which seems appropriate to me. They could have said "essentially identical" or something like that rather than "slight," I suppose. I haven't read the rest of the paper and it may well be seriously flawed, but this particular sentence fragment only appears laughable because Andrew has ripped it out of its context.

        • Chris:

          I literally ripped the quote out of context using that screencap but, as I wrote above, the paper has many (inadvertently) funny lines. There are a couple problems here. First is what I wrote at the end of my above post: It should be possible to present the relevant comparisons in the data in some sort of large grid rather than following the playbook of picking out statistically significant comparisons. Second is what I wrote in my above comment, that a positive difference of .02% (that is, .0002) in one dataset could easily be -.0002 in another. Of course it’s better for this .0002 difference to be characterized as “slight” rather than “large,” but I think the problem is in singling it out in the first place. I don’t think this is a good way to learn about the world, to stir a dataset and pull out statistically significant comparisons and then tell stories about them.

        • > I don’t think this is a good way to learn about the world, to stir a dataset and pull out statistically significant comparisons and then tell stories about them.

          But always a good way to get a story purportedly about the world you can claimed you _learned_ about from the data!
          (And seemingly empirical science based – nothing here but us science(s))

  3. I found a lot of these sorts of things when I worked with credit bureau data. When numbers are in the millions, we are generally sure that differences are not due to sampling error. However, there has to be a point where measurement error is a larger source of contamination than sampling error, and p-values are not especially informative in those conditions.

  4. It’s an amusing sentence, but what would be a better way to phrase it? Should the decimal numbers just be rounded off or truncated, or is there a better way to phrase it. I think maybe I would rather have the authors give a tiny bit of unnecessary information and then allow the readers to interpret the meaninglessness of that afterwards, than encourage authors to put less information in the paper.

    • Potatofish:

      It’s not about rewriting a sentence. The problem is larger than that. Here’s what I wrote in my post above: “It should be possible to present the relevant comparisons in the data in some sort of large grid rather than following the playbook of picking out statistically significant comparisons.”

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *