Dark Data

Thomas Goetz at Wired on Freeing the Dark Data of Failed Scientific Experiments

"In 1981, the New England Journal of Medicine published a Harvard study that showed an unexpected link between drinking coffee and pancreatic cancer. As it happened, researchers were anticipating a connection between alcohol or tobacco and cancer. But according to the survey of several hundred patients, booze and cigarettes didn't seem to increase your risk. Then came a surprise: An incidental survey question suggested that coffee did increase the chances of pancreatic cancer. So that's what got published.

Those positive results, alas, were entirely anomalous; 20 years of follow-up research showed the coffee-cancer connection to be bunk. Nonetheless, it's a textbook example of so-called publication bias, where science gets skewed because only positive correlations see the light of day. After all, the surprising findings are what makes the news (and careers).

So what happens to all the research that doesn't yield a dramatic outcome — or, worse, the opposite of what researchers had hoped? It ends up stuffed in some lab drawer. The result is a vast body of squandered knowledge that represents a waste of resources and a drag on scientific progress. This information — call it dark data — must be set free...

Freeing up dark data could represent one of the biggest boons to research in decades, fueling advances in genetics, neuroscience, and biotech.

So why doesn't it happen? In part, it's a logistics problem: Advocating the release of dark data is one thing, but it's quite another to actually collect it, juggling different formats and standards. And, of course, there's the issue of storage. These days, an astronomical study of quasars or an ambitious bioinformatics project can generate several terabytes of data. Few have the capacity to store that, let alone analyze it...

Technology is actually the simple part. The tougher problem lies in the culture of science. More and more, research is funded by commercial entities, which deem any results proprietary. And even among fair-minded academics, the pressures of time, tender, and tenure can make openness an afterthought. If their research is successful, many academics guard their data like Gollum, wringing all the publication opportunities they can out of it over years. If the research doesn't pan out, there's a strong incentive to move on, ASAP, and a disincentive to linger in eddies that may not advance one's job prospects...

Getting science comfortable with exposing its dark data is really just the beginning. Once you start looking for it, dark data is everywhere: It's locked away in out-of-print books and orphaned art, the stuff that Creative Commons and Google Book Search have been bringing to light. Speaking of which: Hey, Google! Know all those research projects your employees do that the company will never green-light? How about letting the rest of the world take a crack at them?"

