Scientific results lose their vigor
3 January, 2011 | Adie Chan |
|
|
In a New Yorker article published last month and recently evaluated by F1000, Jonah Lehrer describes an effect that has plagued a variety of scientific disciplines: The more times researchers try to replicate a given result, the less robust the effect. The phenomenon draws into question not only the dwindling findings themselves, but the scientific methods used to obtain them, Lehrer writes — and, of course, the findings of all the experiments that have yet to be repeated.
Faculty Member Daniel Beard of the Medical College of Wisconsin argues that the problem stems from interpretation of the statistical methods used. Models are, by definition, simplifications of complex events. As long as this caveat is kept in mind, the scientific method itself is not at fault.
What do you think? Why are scientists increasingly unable to replicate their results? And what does this mean for the future of science?
–Jef Akst, Associate Editor, The Scientist
|
It seems to me that mostly it is an indication that very few people have a deep appreciation of statistics, or the depth, complexity and interrelatedness of the systems they are studying.
One of the first things my stats lecturer drummed into me was – correlation does not imply causation.
It is never sufficient simply to find a correlation. That is sufficient only to warrant further investigation. Sufficiency is only satisfied when a mechanism is fully delineated and understood in its full context.
Often times such things are not simple, as there are many different factors involved, sometimes thousands of them, all with contributing probability functions across a range of dimensions.
I’ve been out of touch with this kind of research for many years now, but recall a dictum related by someone working in pharmaceutical discovery pharmacology –
a postulated biological effect isn’t demonstrated until the experiment has been done on three completely independent occasions. However, the result of each replicate doesn’t need to be statistically significant when considered in isolation.
I have often been impressed by the way that significant findings seem to diminish with replications, and I was under the impression that most scientists, especially life scientists, have noticed the phenomenon. Over the years I have discussed a number of examples with some of my colleagues. Most would be of a somewhat narrow nature related to a specific field of research that would be difficult to convey briefly. In general, I do not think that it is related to a flaw or flaws in the scientific method, but it is related to the complexity of problems studied by the life sciences.
The Flynn effect can serve as a general example. This is the seemingly 3 or more-point per decade increase in average IQ scores. It is not, however, limited to IQ tests but affects every test. I say seemingly, because it is unclear if has stopped, reversed, or is continuing, and it may be doing all three in different places. The cause of it is equally unclear, and so I won’t touch on the many theories now devoted to it. The implication of the effect, and others like it, is that they take a generation to detect and so are hard to control for even when they are recognized by the best designed experiments. The effect is not unimportant because cognitive tests have a bearing on evaluating many social issues including deciding to invest in early childhood education, setting educational standards, apportioning international aid, etc. In one area in which I work, treating brain injury, the currency of assessment is standard cognitive testing. Subjects are evaluated over periods as long as a generation, and so assessments are prone to many unknown errors. It becomes difficult for a single generation of investigators to know whether treatments devised to help the brain injured really do work.
The same complexity is likely related to drug treatments. It is now known that drugs change the brain to such an extent that taking even a single dose of a drug has a huge effect on the potency of a subsequent dose of the same drug, irrespective of the time interval. Many drugs also have an effect that is interchangeable with stress and even cognitive styles. Therefore, it does not really surprise me that pharmacological treatments for psychological disorders are hard to predict from dose to dose, decade to decade, and person to person. The problem is, that most of the drug treatments devised for psychological disorders were discovered more than 50 years ago and the reasons they worked were completely unknown then and they still are mainly are unknown. Yet, 50 years later we have discovered enough about their neural basis. No one should be surprised that in the face of changing treatment populations, improvements in experimental methods, and greater understanding of drug effects, that treatments once thought to work may no longer appear to do so.
Now it is possible that a good course in statistics and experimental design can be helpful. It might be particularly helpful in helping students understand that some of the assumptions underlying statistics can’t help one control for all of the complexities of scientific problems. But students should also be taught not to dismiss these strange effects, there may be a scientific reason for them.
I will use the Crabbe results, as example, because I ran a mouse testing company at the time and had to convince my clients that we behaviorists were not untrustworthy. Remember, in this experiment, mice given cocaine in Edmonton proved unusually active. At about the same time a colleague in Michigan, Terry Robinson, discovered that giving a single dose of cocaine to a mouse resulted in a mouse that was hyperactive when next given exactly the same dose. He also discovered that you can substitute a stressful event, such as being briefly wrapped in a towel, for that first dose of cocaine. And again at the same time, my colleague Bryan Kolb, discovered that batches of rats ordered from Quebec breeding farms produced strange behavioral results, an effect he was later able to attribute to the plane ride. This is what most likely happened to the Edmonton mice – they had had a similar unfortunate ride were so were now hyperresponsive to a dose of cocaine that only moderately affected the mice tested in two other laboratories. So, it was not a matter of bad stats or poor experimental methods, it is a matter of that is the way it is in the life sciences. There are often unanticipated complexities.
I can’t blame Jonah Lehrer with coming up with a sexier conclusion (there is something wrong with science) than I would have liked. Newspapers, like scientific journals, have their own strange rules. But we can thank him tying some of these interesting findings together for an interesting read. For me, a big part of the brain sciences is discovering these strange events, which I think are really interesting, and then finding out why they happen.
An example of non-replication is in some antipsychotic drug trials. Although antipsychotics are the largest drug market ($15 bilion/year, exceeding that for statins), and although antipsychotic drugs have over 50 years been repeatedly found effective in alleviating psychosis in over 80-90% of patients with schizophrenia, there has been at least one very large clinical trial finding that olanzapine (market of $3 billion per year) had little effect. Possible reasons are that the new clinical criteria for diagnosing patients has changed, and placebo patients may be getting better.
Because many of the experiments are looking for small effects in a sea of randomness, or confounding parameters, the original experiment may have found the false result by statistical chance, or by a small unknown bias. Since it is a near certainty that subsequent replications cannot exactly duplicate the original experiment, especially the experimenter’s prior knowledge of the previous experiments, they would overtly or unwittingly control for an increasing number of unwanted biasing variables, thus producing a move toward the true result, i.e. no effect.
I’m always skeptical of biological science that relies solely on statistics to “prove” an effect. As I tell my students: “just because it’s statistically significant doesn’t necessarily mean that it’s biologically significant.”
Several biases are inherent in scientific studies. Most obvious is the bias against negative results. Consider a study in which an effect is only considered significant and therefore publishable when p0.05) are not publishable, whereas the one that occurs by chance is now significant and publishable. As additional studies are performed the results appear to be not replicable, but this is an artifact of the original bias.
It seems to me a stretch to overgeneralize from medical trials and psychology experiments to sciences (including the physical sciences) in general. In the former types of studies there will always be a large number of unavoidably uncontrolled variables, not the least of which is the enormous variation among individuals. In the latter case (e.g, the physical sciences) many investigations are performed on well-known and homogeneous systems for which is it is relatively simple to control most experimental variables, and establish robust controls. Such experiments are quite easy to reproduce, and they are frequently so. The more unavoidably uncontrolled variables, the more variation (and irreproducibility) one might expect. I don’t find the revelation of irreproducibility of certain scientific studies particularly surprising, nor do I think it is a significant threat to the scientific method, which is, after all, self-correcting over time.
I think lack of replication is due to lack of rigor in the experimental execution. My own training in Biochemistry meant understanding that simple things such as the shape of shake flask and volume in it will effect the proteins produced by the growing bacteria. Now with many kits available, students are not taught the methodology and the importance of why experiments are required to be performed in a specific manner every time.
I agree with the philosophy of three repetitions of the laboratory procedure.
I developed analytical methodology for FDA and USDA, and that was my practice.
Scientists in foreign countries always commented , that my procedure worked very well for them. That gave me great satisfaction.
Some of the problem might be associated with the desire of new scientists to establish themselves in highly competitive fields. In several recent retraction cases, PIs could not replicate the work of their post-docs. Research results might have been overstated in an attempt to publish in a more prestigious journal. Attempts to replicate poorly represented results cannot help but fail.
I agree with Bob Hurst that publication bias towards the statistical anomaly is certainly a problem.
There also is the problem of sloppy science. I personally avoid anything presented as a ‘representative’ study. There rarely are sound reasons for not presenting the data and statistical analysis from all multiple independent replicates conducted as part of a study. However, some ‘descriptive’ science still rely on rather objective assessments that are not readily quantifiable and are prone to misinterpretation by a researcher with his/her own biases. I recommend that one reserves judgement on such claims until they have been repeated amply by the original authors or others.
The problem remains one of cleaning up the statistical/interpretive errors. This is compounded by a secondary publication bias against findings that do no repeat the original findings of others. What if it became common practice for journals to allow the authors and others to post data online as an addendum to original findings? These should not be thought of as errata since supporting evidence must also be posted. We’re talking about an entirely new category of publication, the ‘Update’. Updates also should include links to figures published in other journals that specifically repeat the original data.
Hi
I posted following on a listserv when this came up a few weeks ago:
1. The decline effect is nothing to worry about as it should disappear
with replication!
2. There are some truly egregious examples included here … really,
Rhine and ESP illustrates the decline effect?
3. What proportion of scientific phenomena do the examples represent?
That is, how ubiquitous is this effect?
4. There are innumerable areas of science where in fact replication
did converge on a correct value for some physical quantity. Hence, is
it not rather ridiculous to ask “Is there something wrong with the
scientific method?” on the basis of some few (and unknown proportion of
all studies) phenomena showing decline versus the almost unlimited array
of successful science?
5. Closing provides solice to those who want to ignore science and
believe whatever they want to believe. The ideologues will be
thrilled.
“The decline effect is troubling because it reminds us how difficult it
is to prove anything. We like to pretend that our experiments define the
truth for us. But that*s often not the case. Just because an idea is
true doesn*t mean it can be proved. And just because an idea can be
proved doesn*t mean it*s true. When the experiments are done, we
still have to choose what to believe.”
6. Of course, I’m only a regular guy, not like the author Lehrer whose
internet blurbs refer to his “profound understanding of the human
mind.”
I would only add to this, what does it say about reporting of science when “science” publications are citing the New York Times? Surely one would hope at the very least for more evidence of a problem than that some public media type claims that scientists are increasingly unable to replicate their effects? Indeed, is it even a question that lends itself to being answered about science as a whole?
Take care
Jim
If you think about it, accepting p=0.05 means that on average 1 in 20 results should be a false positive, and non-replicable. So the really interesting question is why (and in which areas of research) the false positive rate may be higher. My colleague Hanno Würbel and I have published several papers on this ‘false positive phenomenon’, potential reasons why it occurs, and effective solutions to guard against it. e.g.
http://dx.doi.org/10.1038/NMETH.1312
http://dx.doi.org/10.1038/nmeth0310-167
I agree that the prevailing bias against negative results is probably to blame in most cases. Negative results are just not popular, although they should be reported just as frequently as positive results. Can you imagine publication of an article entitled “No correlation found between age of Helix snails and mortality due to predation.” The reviewers would nab that one and direct the authors to study the subject until they find something “worth reporting.” I remember trying to publish a study that clearly demonstrated a lack of persistence of certain kinds of memory in spiders. That one didn’t even get to the reviewers, the editors screened it out as “not of interest.” Many researchers selectively report the trials that “work,” and drop those that do not. Then there is the faking of data. Some of my colleagues are convinced that, given the high-pressure stakes of academic and industrial research today, faked or altered data has become a widespread factor in the “scientific” community.
I have had the experience of obtaining a particular result in a chromatography separation and replicating the same result many, many times over many years, but found that other labs usually had difficulty in reproducing the separation (and sometimes, albeit rarely, we had problems even in our own lab). I never quite understood this, but I suspect that subtle differences in technique must be responsible. I wonder if this sort of thing doesn’t happen more frequently than one might expect.
I am not trying to make excuses or say that there is nothing to this article but in my experience many difficult experiments are ‘difficult’ to replicate because there are so many variables that could affect the outcome and these vary from lab to lab and even cell line to cell line. Some individual researchers may lack the experience to properly carry out a technique or a cell line that is thought to be the same is not or has genetically drifted over time. Sometimes there is a key element to an experiment that has been simply overlooked. I would offer that the article itself is not objective and meant to be sensationalistic. It seems popular to justify ignoring science because it is inherently flawed somehow or those evil scientists have some personal agenda. But if it helps people sleep at night I guess it’s good for something.
I think that there are a multitude of reasons for lack of reproducibility between labs, many have already been well stated by others in their comments. I also agree with Fred Schaufele that publishing web updates (since we now can) would consitute a good approach to helping solve this issue. While scientific fraud and skewing data to survive in highly competetive environments are of grave concern to all of us, I offer the following observations from my 20+ years in the biological sciences:
I will say this: I am classically trained, and know that the devil is in the details. As Pamela David Gerecht also said, to paraphrase, technical details matter. Most of all, in the biological sciences, where there are inherent daily variables to contend with already. If you take most students, even postdocs, aside in many labs, they *don’t* understand the significance of many things they do, or why they are done. This is partly due to the use of kits (meant to take the chore out of doing research, and to speed up matters). What this had led to is people doing things without understanding where “slack ” (for want of a better term) will ride and where it will kill your experiments or cause them to turn out poorly or not at all. (Not that I’m ever advocating sloppiness!!)
Another thing that I learned from my postdoc advisor is that he personally trained each new person in the most critical techniques. He understands perfectly that differences can be innocently introduced from lack of rigor, and did not want any bad habits that had crept in to be passed along. When PIs under pressure are holed up in their offices cranking out proposals like queen bees laying eggs, they don’t / can’t consider this investment of time, although it would repay them in spades.
Unfortunately, I’ve walked into other PIs’ labs and found cell culture practices that would have made my advisor blow his lid; also, routine use of antibiotics can mask sloppy cell culture technique, and using cells in poor states of health (or with undetected mycoplasma contamination) is a recipe for unexpected results, problems or no data at all.
I worked with mice for my dissertation, and my advisor routinely had mice kept for a week (YES, a week) in animal quarters so they could settle down and have their stress hormones return to normal. Otherwise, we’d get unexpected deaths and sky-high values for parameters we measured in our reasearch protocols. Can you just imagine many cash-strapped PIs doing that, given the expense of housing animals? Or, maybe no one ever told them that it was necessary to obtain valid results? The other variable is that lab chows have changed over time; the compostion of lab chow today is not what it was in the 80’s, for example.
Another phenomenon that I didn’t see mentioned is epigenetics – our environment “tweaks” our gene expression – and that is heritable. So, What findings were reported as true 20, 30 years ago may be slightly or altogether different today.
As for statistics, yes, they are valuable; but remember that statistical signifance may not be biologically significant, and vice versa. Hear, hear, Larry Kane! Students need to hear that more often. A famous example is that during the action potential, the intracellular concentration of sodium ions does not change (statistically) but the minute change that happens leads to an action potential.
With all the criticisms of science, I still feel that a good majority of scientists do strive to run as clean a shop as they can – their career is their life. However, I would like to see this topic brought front and center when training students and post-docs. As scientists, I feel we owe it to each other – and to the public – to be involved in the conversation re: scientific studies. Otherwise, a few well-intentioned but badly conceptualized and understood scientific facts can be splashed onto blogs and print media by journalists- undermining the public’s trust in science and in the people who work in the field.
All the biases mentioned in the article should be obvious to any real scientist. The primary problem is using psychology as a case study, with the researchers foolish enough to be surprised by this phenomenon. The field had nothing to do with science, even before the acceptance of the ESP paper.
http://bioblog.biotunes.org/bioblog/2011/01/04/the-scientific-method-still-works-if-we-give-it-a-chance/