Most would argue that the strength of evidence is dependent on the level of significance. According to the Oxford Dictionary online, ‘Evidence is the available body of facts or information indicating whether a belief or proposition is true or valid’.
In science significance levels are measured using P values to indicate the strength of evidence against the null hypotheses. As psychologists we use the P value of P<0.05, which means there is a 5% chance our results are due to chance. Other areas of science may use P<0.10 or P<0.01 when carrying out research.
For example, if I were to test drugs that could save lives I would want my P value to be smaller than if I were carrying out an experiment studying children catching a ball. So by using a smaller P- value such as P<0.01 this would mean that there was only a one percent chance that my results were due to chance. Which means the smaller the P value, the greater evidence there is said to be against the null hypothesis.
However, Royall (1997) disagreed and argued that significant levels are NOT in general good measures of the strength of evidence. He put forward the problem of researchers confusing the strength of evidence with the probability of gaining that evidence, which are two completely different things.
Firstly, probability is determined by factors and can be calculated by its P value as seen above where there was a one percent chance of my results occurring down to chance the probability was P<0.01. In other words, probability depends on a set of things that could happen but did not. These factors are often subjective such as intentions of the experimenter, which can lead to experimenter bias; this can be something as simple as the amount of participants an experimenter decides to use or the removal of outliers.
To explain this more I have used example used by Dienes (2008). So let’s say that an experimenter decides to use 30 participants, but the P- value is P=0.07, so she might decide to add 20 more participants. By adding these extra participants the P-value will automatically change. But to calculate the significance the experimenter needs to account for the fact that he could have stopped at 30 participants, but if he had choose not to then his P- value would be less than 0.05. Which means that depending on whether the researcher’s decisions about the amount of participants he decided to use in his results can either be significant or not. In this case she can run the risk of making a Type I or Type II error.
Royall goes even further to argue that P-values cannot determine what the experimenter did but instead what he observes. So this indicates that P- values are dependent on the observation that occurred. As Royall remarked ‘ P-values are not determined just by what the experimenter did and what he observed. They depend also on what he would have done had the observations been different-not on what he says he would have done, or even on what he thinks he would have done, but on what he really would have done (which is, of course, unknowable)’
Although Royall’s argument is sound, there are ways to prevent the significance levels increasing and decreasing due to different additions of variables and running the risk of making a Type I or Type II error.For instance we often use post hoc tests after running and Anova, which include a number of tests looking at significance levels.
The Bonferroni t (Dunn’s test), which is used to compare possible levels of the variable, meaning that even when increasing the number of comparisons it prevents the growth reducing estimate t-value depending on the number of comparisons.
The Newman-Keuls test is another test which orders levels of variables from smallest to largest then compares the smallest mean to the rest of the subgroup. By doing this it ultimately prevents the significance levels increasing by reducing the number of comparisons needed.
But are these tests precise enough that Royall’s argument can be dismissed or should it be taken further into account and we should find other methods for measuring the strength of evidence?
Royall, Richard M. 1997. Statistical evidence: A likelihood paradigm. New York: Chapman and Hall.
Understanding Psychology as a Science, An introduction to scientific and statistical inference by Zoltan Dienes.