A short questionnaire regarding the subjective assessment of evidence

E. J. Wagenmakers writes:

Remember I briefly talked to you about the subjective assessment of evidence? Together with Richard Morey and myself, Annelies Bartlema created a short questionnaire that can be done online. There are five scenarios and it does not take more than 5 minutes to complete. So far we have collected responses from psychology faculty and psychology students, but we were also keen to get responses from a more statistically savvy crowd: the people who read your blog!

Try it out!

65 thoughts on “A short questionnaire regarding the subjective assessment of evidence

  1. Yes. I got an error for Question 3 also, so couldn’t proceed.

    I also had a problem with Scenario 2, because the instructions said not to think about the questions too much when answering. But the question was asking for numbers associated with different strengths of relative conditional probabilities (without giving any absolute probabilities). The whole point of formalising numerically assessments of evidence is that you stop and think about it – numbers and their definitions are precise.

    • Although still not sure what exactly was causing the problem, I changed some settings, and it seems to work now for most people. For those who encountered problems but still want to finish the survey, please empty your cache and cookies first, or open the link in a different browser, because otherwise you will continue in the old version and might still encounter the error…

      Apologies for the error and thanks to everyone who already filled in the survey!

  2. Same here with question 3
    it keeps asking : “Please fill in a non-negative number.”
    even though my number is positive. I also tried with a negative number and I have the same error

  3. It also works for me on Chrome. We did check it thoroughly and it always worked. Weird. I do apologize and I hope it works for most people!
    Cheers,
    E.J.

  4. “Professor Groberman conducts an experiment to test two competing hypotheses, one of which is true.” This is a nonsense assumption.

    Also what do you do if you don’t buy that all these questions can be decided only knowing the likelihoods?

        • As stated, the statement seems to be asserting that they are exhaustive. i.e. Smoking either changes your cancer risk or it doesn’t. But one of those has to be true. Right?

      • OK, without the whole question my comment may seem a bit too negative. But in the question, the hypotheses can be used to compute likelihoods, so they are statistical hypotheses, none of which is ever precisely true.

        • It sounds like you are making a case for estimation rather than decision between two specific competing hypotheses. That can be a very good argument to make as, in my estimation, moist experimental results are more meaningful when their results are viewed in the light of effect size estimation than when stripped down to a decision between two pre-specified hypotheses.

          Smoking changes the risk of cancer? Sure, that’s interesting, but how much it changes the risk of cancer is far more important.

        • These questions are flawed because it asks you to interpret one piece of evidence without context. As an example think about the below scenario:

          An experiment is performed testing whether mind reading exists. The observed data show that it is Y times more likely to exist than not.

          What is the minimum value of Y for these results to be characterized as:

          a) Moderate evidence mind reading exists
          b) Strong evidence mind reading exists
          c) Very Strong evidence mind reading exists
          d) Extremely strong evidence mind reading exists

          Does anyone here think it is possible to choose a reasonable Y based on this information alone? Clearly the answer depends on the experimental conditions, what has been done regarding systemic error, and the plausibility of the two hypotheses.

          I could not bring myself to answer the questions.

        • An experiment is performed testing the age of a Iphone found in the great pyramid. The observed data show that it is Y times more likely to be 3000 years old than 1 year old.

          What is the minimum value of Y for these results to be characterized as:

          a) Moderate evidence the Iphone is 3000 years old vs 1 year old
          b) Strong evidence mind the Iphone is 3000 years old vs 1 year old
          c) Very Strong evidence the Iphone is 3000 years old vs 1 year old
          d) Extremely strong evidence the Iphone is 3000 years old vs 1 year old

        • You are confusing the evidence in the data with the opinion that might be held once the evidence is taken into account. The evidence is the evidence, and so the implausibility of the hypothesis being tested is not really important to the question.

          (If you are pretty much certain that mind reading exists then even very strong evidence will not much change your opinion.)

          Frequentists might say that the method by which the evidence was obtained is important, but that is not what you seem to be talking about.

        • Michael,

          Sorry the above anon posts were me. I am not confusing the two. How can I attribute a “strength-level” to the evidence other than via opinion regarding the experimental conditions? This logic is identical to that of Fisher.

        • Think about how an opinion is formed. It is (should be?) based on (at least) the evidence and the plausibility of the hypotheses under consideration. If that is true then the evidence and the plausibility are separable components that allow formation of opinion.

          You can attribute ‘strength-level’ to the evidence via the likelihood function. The likelihood principle says that, given a statistical model, the evidence from a dataset relevant to a parameter of interest is (entirely) contained in the relevant likelihood function. Quantitation of the evidence in favour of one hypothetical parameter value over another the obtained as the ratio of the likelihoods at those parameter values. Thus the Wagenmakers’ question is answerable without reference to plausibility.

          The role of likelihood functions in the quantitation of evidence is missing in most statistics courses and textbooks (look up ‘likelihood function’ in the index of your favourite textbook: it may not even appear), but its omission is lamentable as it is essential for understanding inference. David Cox wrote this: “The likelihood function, when it can be calculated, is central to all approaches to statistical inference.” (p. 48, Cox 1978, Austral. J. Statist. 20, 43-59).

        • Michael,

          My understanding is that the likelihood function describes the relative strength of evidence for a set of numerical values. This is different from the strength of evidence for a given hypothesis. The question I satirized asks for eg: “Moderate evidence (for the CURL hypothesis over the SWIRL hypothesis)”

        • Question, you are right about the likelihood function, in so far as that goes, but that doesn’t make the survey question wrong.

          I’m using the principle of charity which requires the text to be read in a manner that maximises its sensibleness along with the survey admonition to not over-think thing. They lead me to suppose that the two hypotheses in question are about parameter values (they could be vectors of values) and also to suppose that the parameter values lie on a single likelihood function. Otherwise most of the questions become nonsense because the ratio of likelihoods asked about has no inferential meaning.

          (Did you look up ‘likelihood function’ in any indices?)

        • “suppose that the two hypotheses in question are about parameter values (they could be vectors of values) and also to suppose that the parameter values lie on a single likelihood function.”

          I interpreted it this way as well, but assigning strength to evidence without regard for systematic error is not possible for me. It is NEVER only about measurement or sampling error. Often that is the least of the problems although it is one that needs to be addressed. Because that is the only aspect of the problem we have algorithmic methods available to address it receives inappropriate focus.

          Even if hypothesis A is simply that “X=3” and hypothesis B is that “x=6”, I still cannot assign a strength of evidence without info on the assumptions made, reliability of the procedure, etc. Simply add extra information into the question and see if you would answer the same.

          “Professor Bumbledorf conducts an experiment. The results show that the observed data are X times more likely under the CURL hypothesis than under the SWIRL hypothesis. Also, a miscalibration occurred that biased the results in the direction predicted by the CURL hypothesis.”

          As to: “(Did you look up ‘likelihood function’ in any indices?)”
          I am not sure what you are asking here.

        • “I interpreted it this way as well, but assigning strength to evidence without regard for systematic error is not possible for me. It is NEVER only about measurement or sampling error. Often that is the least of the problems although it is one that needs to be addressed. Because that is the only aspect of the problem we have algorithmic methods available to address it receives inappropriate focus.”

          Most of the questions are about relative evidential strength.

          We need to produce assessments of relative evidential strength all the time. How do you do it? If you know of a better way of doing it quantitatively than with likelihoods, please share.

        • Anon,

          “We need to produce assessments of relative evidential strength all the time. How do you do it? If you know of a better way of doing it quantitatively than with likelihoods, please share.”

          Attempting to assess the strength of evidence without taking into account the reliability of the process that generated that evidence is a mistake. There simply is no way to do it quantitatively. Calculating the likelihood is only a part of an overall process.

          The researcher needs to investigate the data for sources of systematic error and then report these honestly. Other people should run the experiment to see if they get the same result and report their own ideas on what possible sources of systematic error there may be. Then future experiments try to control for each of the sources of systematic error according to their estimated influence. Any hypothesis is part of a larger framework (the theory) that makes other predictions. If the observations match up to these other predictions as well then it lends strength to the original result (this is where the plausibility of the hypothesis comes into play).

          It also matters whether the data was collected before or after the prediction was made.

        • I thought of another way to express my thoughts. Statistics provide answers to a “spherical cow” type problem. These approaches are useful in that they give an exactly correct number if a bunch of assumptions are true. In reality the set of assumptions are always an approximation to the real data generating process and experimental conditions.

          The deviation from the assumptions of the model and also about experimental design can be quite large. If the influence of the deviation is small we could say that the data is reliable. The questions in the survey do not provide enough information to answer the question of whether the data is reliable. In reality we never have complete information on this, but some estimate must be made to assign strength to the evidence.

        • Christian Hennig: I’m well aware that there are many who deny the likelihood principle. However, I’m not sure that the reasons for denial are sufficiently well founded to require no description beyond “good reasons”.

          Question: Have a close look at the implications of your last comment:
          “The researcher needs to investigate the data for sources of systematic error and then report these honestly. Other people should run the experiment to see if they get the same result and report their own ideas on what possible sources of systematic error there may be. Then future experiments try to control for each of the sources of systematic error according to their estimated influence. Any hypothesis is part of a larger framework (the theory) that makes other predictions. If the observations match up to these other predictions as well then it lends strength to the original result (this is where the plausibility of the hypothesis comes into play).”
          It seems to imply that there is no method that can quantify the evidence within a single dataset for and against any parameter values. That is perhaps more nihilistic than you intend it to be. How do you deal with recursive assessment of the “strength” lent to the original result?


        • “It seems to imply that there is no method that can quantify the evidence within a single dataset for and against any parameter values. That is perhaps more nihilistic than you intend it to be. How do you deal with recursive assessment of the “strength” lent to the original result?”

          When there is no information about systematic error any quantification of evidence is misleading. Once there is further evidence that such errors are minimal then the original evidence can be taken as somewhat reliable. This can be in the form of eg, replication studies, surprisingly correct predictions of the parent theory, or detailed records of all that occurred during the study showing no evidence of “weirdness”.

          I do not think it is nihilistic, only realistic. In the biomed field there have been a few reports indicating >70% of results are not successfully replicated, and zero reports saying otherwise even though it has been nearly a decade. For most papers there is not even an attempt to do so, people just make the error I am lamenting here.

          In my own sub-sub field there are groups that never see the drug “work” and groups that always see the drug “work”, with 5+ studies for each group. People do not want to believe how bad it is to trust a single study, but like I said above sampling error is often the least of the problems.

        • Question: So, if I read your comments correctly, by simply assuming charitably that there are no reasons to suppose that there are systematic errors in the data that is being used to provide evidence in the survey questions then those questions should, in principle, be answerable. Can you answer them? Can you answer them without recourse to likelihoods?

        • Michael,

          I am ahead of you there, I’m glad you concluded the same thing should be done. I attempted to answer the questions and found that in the absence of any other evidence and if a decision must be made I would not assign strength, but would simply go with the hypothesis that had the evidence in its favor.

          For example with the judge/murder question. If there was only DNA evidence I would say let both go free regardless, if one must go to jail it would be the one the evidence pointed to. Assigning a strength to the evidence did not enter into my thought process, it was either -1,0,+1.

        • Question: OK, that is a perfectly frequentist approach to evidence. It’s my opinion that it is a most unfortunate approach because it amounts to pretending that there is no such thing as evidence in a particular dataset that is relevant to a particular hypothesised parameter value.

        • Michael,

          My reasoning is not frequentist at all. It is just that I refuse to put a more exact value when I can’t justify it. If I need to make a decision *right now* based on one piece of evidence what difference does it make if I am 50% sure or 1% sure? Clearly that depends on the cost-benefit, information also missing from these questions. My answer to all of them is essentially 1/inf.

        • Michael (if you happen to notice this)

          > role of likelihood functions in the quantitation of evidence is missing in most statistics

          And I would say in many statistician’s thinking including many Bayesians who _only_ think of it as either a black box transformations of prior into posterior or the Frequentist route to an MLE and its SE.

          I have an appendix here that attempts to give an overview of likelihood functions including a straightforward (sans measure theory) account of why only relative likelihood functions are needed.

          http://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

        • Anonymous (reply to 24 April 5:49; we have exhausted the comment hierarchy) – is it you, Michael?

          Anyway, all my comments in this thread (and elsewhere unless I make stupid mistakes) are under my proper name; I’m not identical to “question”, and the comment that you cite and apparently think was mine isn’t mine.
          It is not too bad though.
          Actually I think that one can come up with all kinds of measurements of evidence but which of these make sense in a given situation is a very tricky one. And as I wrote before, in most of the situations in the survey I’d like to know more before deciding what I think should be used there.
          I wrote about this, by the way:
          http://www.univie.ac.at/constructivism/journal/5/1/039.hennig

        • Christian Hennig: Sorry about my several anonymous posts. The system puts the cursor in the text box by default and so when I forget to back-track to enter my details (over and over again) my comments are anonymous. I didn’t think you were responsible for the comments of “question”, and where I wrote “Question”” it was my intention that the comment applied to the person calling him- or herself “question”.

          I will read your article with interest, but I feel it necessary to point out the fact that a Bayesian posterior does not quantify evidence, as it is the product of the likelihood function and the prior. You paper seems to be partly based on a false premise.

          You might be interested in reading an extended riff on P-values and evidence that I’m trying to get published: http://arxiv.org/abs/1311.0081
          The conclusions are that P-values are useful indices of evidence and their use persists because of their utility to scientists.
          (So far the referees reports have been strongly negative but their comments are mostly devoid of any real substance. It’s as if the conclusions are unpalatable and so the referees have to devise shortcomings to attack.)

        • Michael: ” I feel it necessary to point out the fact that a Bayesian posterior does not quantify evidence, as it is the product of the likelihood function and the prior. You paper seems to be partly based on a false premise.”
          The posterior measures to what extent a hypothesis is confirmed/to be believed post data. I know what you mean; the posterior doesn’t measure evidence in the data alone, but one can take it as measuring the available evidence post data in the sense above.

          The paper grew out of a presentation that made reference to discussions that took place in UCLs evidence project over the years and a number of Bayesians there took this view. In my paper I don’t say that this is how you should measure evidence, I’m rather discussing that this is used as evidence measurement by some people, and I think that this is a legitimate use of the term, even though I understand why this is not the case from your point of view.

        • PS: I’ll have a look at your paper, too, but chances are it’s not going to make me a follower of the likelihood principle.

        • Christian: Out of interest, which particular variant of the likelihood principle do you feel resistant to and which of its urgings do you think are inappropriate?

    • I believe the purpose of this is simply to take the question as though it were true in some idealistic sense and respond from there. That was my assumption anyway. Of course if that’s the case, that direction should be included in the survey.

      I get the point what they are attempting to gather though, and it does merit consideration. Will be curious to see the results.

      • You’re probably right. However, I’m always interested in what messages are implicitly communicated by survey questions, and the message here seems to be that such situations really exists, so I object to it. (As well as I object to the message that based on the given informations, it should be possible to give a reasonable answer to all the questions.)

        • Christian,

          You are correct. It is not possible for a thinking person to provide reasonable answers these question, you did the survey wrong. That’s why they tell you not to “overthink” in the instructions. Later tonight I am going to attempt reducing my thinking to a level capable of no more than placing my cursor in the box and typing numbers the questions make me think up.

    • We are curious about the results too! :-). We already have a lot of responses from psychology faculty and from students (psychology and psychobiology). We will let Andrew know when we have a document that summarizes the results, and I hope he’s willing to post it on his blog. As for the objections that other people here have raised to the way specific questions were stated: yes, I definitely agree that the questions could be improved. However, I’d like to counter that the questions were the end product of a long process of continual refinement. At some point, however, we had to strike a balance between the examples being exhaustive and them being interesting; lengthy descriptions do not increase people’s motivation.
      Cheers,
      E.J.

      • @E.J.

        Can you track where the participants are getting to the survey from? I can’t help but think one would get different results if you asked these questions of psychologists, lawyers and physicists, even after considering differences in statistical proficiency. Each profession has to deal with probabilistic statements, but will differ on average as to what constitutes a strong evidence. It’s not as if there is a “right” answer to any of these questions.

        In addition, do you mind if any of us repost the link to the survey?

        • @West

          We know whether the responses came from the faculty, the students, or this blog. However, there is only one link for this blog and we cannot track where people are getting the survey from. However, we can use the date. So it is perfectly fine for you to repost the link — it was never a secret anyway.

          Cheers,
          E.J.

  5. I was bothered by the question (I’m paraphrasing from memory): John and Richard are suspected of murder, and at the trial of John DNA evidence is presented showing that John is X times more likely than Richard to have committed the murder. John is convicted and sentenced to 20 years in jail. What is the minimum value of X justifying this conviction and sentence?

    As I see it, there are two issues here. One is “how much evidence do you need to justify jailing someone for 20 years”, and seems, based on the other questions, to be the focus of the survey. But the other is “what sorts of values for X is DNA evidence likely to produce?” If John and Richard are identical twins, 1 is a likely value. If they’re unrelated, I hope DNA evidence would distinguish between them with very high confidence, such that if (hypothetically) 40-to-1 odds justified jailing someone for 20 years, I might discount DNA evidence suggesting 40-to-1 odds anyway, on the grounds that such low confidence from a DNA test meant the test had been bungled.

    • That question seems to hope but not assume you know that DNA evidence can be misleading and that it can be presented in ways that make the odds astronomical when in fact there may be 100 people who fit or 10 in that area, etc.

      I thought that question, like the others, were ways of getting you to say how you weight proof. I particularly liked the last: the punishment is a smaller piece of pie so I don’t care that much about whether mom analyzed the data right.

      • I had problems with the murder conviction thing too, because it said that “one of them IS the murderer” now we have to weigh the consequences of wrongly convicting with the consequences of potentially letting someone go who is definitely the murderer. In most murder situations the prior probability that one of the other 6 billion people on the planet did the murder is not exactly zero. In fact it’s pretty high.

        Also I wasn’t clear on whether “likely” was used as a technical term (likelihood) or as an everyday word meaning “probable” and because of that it wasn’t clear whether prior information should be considered to have been included in the evaluation.

        • Weighing the consequences of mistakes is necessary to any rational approach to justice, but is not part of the determination of evidence. The question asks about the evidence, not about consequences of erroneous decisions.

          I would say that the word likelihood in the questions is used in its technical sense, but unlike most statistical technical terms the technical meaning of likelihood is essentially the same as its non-technical meaning.

        • It’s not asking about the evidence, it’s asking about what constitutes strong evidence *within the context of the decision to be made*. This is why I personally allowed a much lower ratio for the larger slice of pie than for the murder investigation. I think pretty much everyone is going to agree with me that something like 2:1 ratio might be ok for a slice of pie, but would be hardly sufficent for “beyond a reasonable doubt” in a murder investigation.

        • “2:1 might be ok for a slice of pie”?

          If the children were (slightly) biased coins – named “0.45” and “0.55” – and the evidence came from 50 tosses, then, were 0.45 truly guilty, you’d wrongfully convict 0.55 roughly one in 8 times. You monster! Just imagine the psychological harm!

  6. Like some others, I also found the sense that I was choosing always between two hypotheses “one of which is true” (of simply evaluating evidence for “one over the other”) to be, I guess, unrealistic. In that situation, I wouldn’t need very high ratios to let something count as evidence, whereas if I wasn’t sure either hypothesis is true, I’d need something stronger.

    In any case, this survey showed me how vague my statistical intuitions are. I’m curious to know whether I was even in the ballpark with my answers. I put “moderate” evidence at 1.2:1 and strong at 2:1 for the hypothesis testing. But when it came to rewards and punishments I wanted something even stronger like 5:1 (for the pie) and 10:1 (for the murder). But sometimes I wondered whether I shouldn’t be up in the 100s or 1000s. (Sometimes when people report results of even very implausible research, they say that the odds of the result being reached by “chance” are millions to one. But is that the sort of thing this survey was looking for?)

    • Very good! It does make a difference for answering whether one unrealistically takes for granted that one of the hypotheses is true, or whether one rather uses these as approximate idealisations, which they are in reality.
      (No new point in this posting, just thumbs up to Thomas for making this clear.)

    • If it’s reassuring at all, it sounds like we gave very similar responses, although I did make the leap to go for 1000 times more likely for the murder trial (hopefully this is not the only piece of evidence!).

  7. I didn’t much like question 1 and gave up at question 2. Suppose we construct a test for comparing (CURL-SWIRL) using a frequentist approach and get a p value of 5%, i.e. borderline between significant and non-significant, then how does this result translate into a value of X. I would need to think about this. We could try using the likelihood ratio but need a model. And we cannot give an answer without knowing the amount of replication. If X=100 it might indicate moderate evidence for a low level of replication but extreme evidence with high replication. So, without additional information the question cannot be answered.

    Because I gave up my contribution to the survey will not be included in the results. I am interested to see the results but I have no idea what they will tell us – not very much I guess.

  8. Are they mutually exclusive? 6% likely is 2 times as likely as 3%, but does not convince me of conviction someone of crimes like murder and pie-biting. If we’re talking 60% vs 30% (also twice as likely), then that’s much more evidence. I’m assuming similarly sized confidence intervals for both. Without absolute numbers (and an error margin), I found it hard to answer these questions…

    • In the murder case they say that one of the two suspects IS the murderer, so from a probability perspective we can say that p+q=1, this is one of the objections I raised above. Very rare for murder investigations. It’d have to be something like the murder was committed in a jammed elevator where we know that only the 3 occupants were involved… or something equally unusual.

  9. Given there is no widely agreed upon way to calibrate Bayes factors or likelihood ratios (other than maybe comparing two unique points in the same parameter space) it will be interesting to find out what they were looking for and found.

    • K?: You are correct that the only way to use a likelihood is as part of a ratio of likelihoods that are part of the same likelihood function. However, many interesting inferential issues do boil down to comparisons of points in the same parameter space and so such a restriction is not always a limitation.

      It is notable that many of the alleged counter-examples to the likelihood principle consist of attempts to compare likelihoods that are not points of the same parameter space.

Leave a Reply to The Wind. Cancel reply

Your email address will not be published. Required fields are marked *