## Why significance does not indicate evidence?

Most would argue that the strength of evidence is dependent on the level of significance. According to the Oxford Dictionary online, ‘Evidence is the available body of facts or information indicating whether a belief or proposition is true or valid’.

In science significance levels are measured using P values to indicate the strength of evidence against the null hypotheses. As psychologists we use the P value of P<0.05, which means there is a 5% chance our results are due to chance. Other areas of science may use P<0.10 or P<0.01 when carrying out research.

For example, if I were to test drugs that could save lives I would want my P value to be smaller than if I were carrying out an experiment studying children catching a ball. So by using a smaller P- value such as P<0.01 this would mean that there was only a one percent chance that my results were due to chance.  Which means the smaller the P value, the greater evidence there is said to be against the null hypothesis.

However, Royall (1997) disagreed and argued that significant levels are NOT in general good measures of the strength of evidence. He put forward the problem of researchers confusing the strength of evidence with the probability of gaining that evidence, which are two completely different things.

Firstly, probability is determined by factors and can be calculated by its P value as seen above where there was a one percent chance of my results occurring down to chance the probability was P<0.01. In other words, probability depends on a set of things that could happen but did not. These factors are often subjective such as intentions of the experimenter, which can lead to experimenter bias; this can be something as simple as the amount of participants an experimenter decides to use or the removal of outliers.

To explain this more I have used  example used by Dienes (2008). So let’s say that an experimenter decides to use 30 participants, but the P- value is P=0.07, so she might decide to add 20 more participants. By adding these extra participants the P-value will automatically change. But to calculate the significance the experimenter needs to account for the fact that he could have stopped at 30 participants, but if he had choose not to then his P- value would be less than 0.05. Which means that depending on whether the researcher’s decisions about the amount of participants he decided to use in his results can either be significant or not. In this case she can run the risk of making a Type I or Type II error.

Royall goes even further to argue that P-values cannot determine what the experimenter did but instead what he observes. So this indicates that P- values are dependent on the observation that occurred. As Royall remarked ‘ P-values are not determined just by what the experimenter did and what he observed. They depend also on what he would have done had the observations been different-not on what he says he would have done, or even on what he thinks he would have done, but on what he really would have done (which is, of course, unknowable)’

Although Royall’s argument is sound, there are ways to prevent the significance levels increasing and decreasing due to different additions of variables and running the risk of making a Type I or Type II error.For instance we often use post hoc tests after running and Anova, which include a number of tests looking at significance levels.

The Bonferroni t (Dunn’s test), which is used to compare possible levels of the variable, meaning that even when increasing the number of comparisons it prevents the growth reducing estimate t-value depending on the number of comparisons.

The Newman-Keuls test is another test which orders levels of variables from smallest to largest then compares the smallest mean to the rest of the subgroup. By doing this it ultimately prevents the significance levels increasing by reducing the number of comparisons needed.

But are these tests precise enough that Royall’s argument can be dismissed or should it be taken further into account and we should find other methods for measuring the strength of evidence?

Royall, Richard M. 1997. Statistical evidence: A likelihood paradigm. New York: Chapman and Hall.

http://oxforddictionaries.com/definition/evidence

http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_newman_keuls.htm

Understanding Psychology as a Science, An introduction to scientific and statistical inference by Zoltan Dienes.

This entry was posted in Uncategorized. Bookmark the permalink.

### 4 Responses to Why significance does not indicate evidence?

1. psychmja1 says:

A good blog this week 🙂 you’ve presented plenty of research but I wondered if you had considered that p-values are very subjective at times. This especially comes into play when you consider the probability of making a Type I or Type II error. A Type I error is when we report that we have found a significant result when in fact there is no effect. A Type II error on the other hand is when we fail to report a significant effect when actually there is one. I know you mentioned these above but I wanted to expand on what you have said by proposing that these two types of errors can be detrimental to research. So which is worse when it comes to reporting research findings? Using the prison example, if the probability of making a Type I error and putting an innocent individual in jail was 20% would this be an acceptable probability level to use? Obviously not as we would stand a 1 in 5 chance of sending an innocent person to prison. Something I wouldn’t want to take a chance on! This works the same way in psychology. It is worse to present something as true when in fact it is not. Take the Thalidomide disaster in the early 1960’s. Researchers had tested the drug on animals and proposed it to be safe for pregnant women to take to relieve morning sickness. The result that came from this was that over 10,000 children were born with deformities and other abnormalities such as those of the eyes, ears, heart, kidneys etc. Unfortunately researchers committed the detrimental error and concluded that the drug was safe for use as a cure for morning sickness, when in fact they should have conducted further tests. In the present day there are much stricter rules for drug testing and we have newer techniques that mean we can test drugs using in vitro methods (in test tubes or petri dishes etc). We can relate this to a Type I error as researchers concluded that it would have a positive effect on morning sickness, when in fact it would lead to serious problems for their babies. If researchers had made a Type II error in the testing of Thalidomide on animals and concluded no beneficial effect then the results of the studies would not have led to such disaster.

Although you suggest that significance does not mean evidence I would be far more likely to trust something that has a 1 in 10,000 (.01%) chance of being wrong than something with a 1 in 5 (20%) chance of being wrong. Wouldn’t you? It may not mean evidence but it certainly does give us an indication that evidence may be there.

http://medpharm.blogspot.co.uk/2008/07/thalidomide-disaster.html

Marx, U., & Sandig, V. (2007). Drug Testing In Vitro: Breakthroughs and Trends in Cell Culture Technology.

2. This is a very interesting topic, and something that I mostly have strong views on and sometimes I just sit on the fence about.
I’m not sure if it is because university was the first place that I encountered statistics, or whether it is just the fact that me and numbers do not mix well. But I sometimes get a bit well, I can get on my soap box shall we say, and I don’t get how a significant difference can actually show evidence for something that the researcher was measuring, I mean how does he not know it was a confounding variable that has given him the significant value? This is something that you have mentioned, and I was quite oddly nodding in agreement (luckily I’m on my own; otherwise people would think I’m mad).
What bothers me is something that you mentioned; experimenter bias. I’m just always a bit dubious about whether the significance really represents what they claimed to have tested.
You’ve argued this topic really well, but I think I will most likely continue to be dubious about certain aspects of significance, but it’s funny I never feel this way when I’m analysing data.