A recent exchange on twitter the Thesiswhisperer wondered about why effects were disappearing as mentioned in this New York Times article. The feeling is this should not happen with the modern scientific culture, and yet I suspect modern academic scientific culture is partly to blame. To explain why I have to introduce a little known but rather simple statistical fact which may be called Regression towards the mean.

Let me do it by given a common example used in teaching regression towards the mean. The lecturer tells a class that he has the ability to improve people psychic ability. He then writes on a piece of paper a number between 1 and 100 without showing the class. He then asks the class to write down what they think the number is. He then reveals the number and asks to class to tell him how well they did.

Now the experiment starts. He takes a sample from the class of people who got furthest from his value, say the worst half. He points out these are obviously the less psychic to show the effect. With these he performs some ritual, perhaps to stand up turn round three times and say “esn-esn-on” (“nonsense” backwards).

Then he repeats the experiment but this time with only this worst half and low and behold, they perform better. That is there average guess is closer to his value than they were in the previous study.

The lecturer then admits there is no psychic ability involved in this, so what is going on. The trick is in the selection. Indeed if he looked at the Standard deviation for the whole class at the start and the standard deviation for the sample in the second they should be of approximately the same size. People are pretty randomly choosing their number, those who guess badly at first do that pretty randomly and actually if he had taken the full class the performance would have been much the same as before, only the ones who did better would have differed.

Regression to the mean basically means that if there is a selection bias in a distribution as fresh data is produced this will tend to go back towards the mean.

So what has this to do with non-repeatability. For starters I am not belittling this phenomena I have been involved in studies which aimed to replicate a previously carried out study. The prior study reported a huge effect so the power calculation required a small sample size, indeed so small we upped the numbers just to persuade the ethics committee that this was a genuine attempt to replicate. This only to find when we have the data collected that there is no effect visible in the data. So I have experienced non-repeatability.

Nor am I accusing researchers of bad practice. They are honestly reporting the results they get. It is the ability to report the results, i.e. the selection process by journals that produces the phenomena!

Published results and accepted results aren’t just a random sample of all results. They are selected for the results that demonstrate a genuine effect. They particularly like those results that are significant at p=.05 level. However gaining a p-value of less than .05 (or any value) is no guarantee that you have a true effect. For a start off with p=.05, one in twenty of results where there is genuinely no effect get reported as having an effect. Now that isn’t one in twenty of reported results (it might be lots higher or lower) but one in twenty of results where there is NOTHING is genuinely happening. Unfortunately we don’t know where these one in twenty are. It looks like a result even though it there isn’t an effect. We know there are type 1 error, our selection for criteria for publication reports allows one in twenty papers through where there is genuinely no effect.

But what if there is an effect? Well we are more likely to detect it if our sample happens to over-estimate the effect than if it under estimates the effect. In other words there are studies out there where there was a genuine effect but it did not get published because they drew a sample that performed badly. On the other hand on well designed experiments all the studies that draw fortunate samples are likely to significant. So the tendency is to over estimate actual effects because the selection criteria for publication favours those who draw fortunate samples.

This is not news, I have not suddenly found this inspiration, look there are learned reviews exploring this very topic. There are approaches when results start behaving in this way, one is to look at the sample size it would take to detect a difference between the original results and the fresh results, and if that is larger than the two studies then it may well be just the result of regression to the mean. It is also why clinicians are moving towards Meta Analysis rather than the results from just one study, but meta-analysis itself is hampered by the publishing bias.

I also want to sound a warning, skewed data (data where a few people produce very high scores) can quite easily produce odd ball results. This causes problems when sampling, there are statistical methods for analysing this but I have rarely seen them applied out side the statistical class room.

So yes I would expect the published results of effects on the whole to be over-estimates. The over-estimation is a product of the current scientific publishing culture. There are some approaches to alleviate this problem but at present no cure because the cure involves a change of culture.