I’m going to be talking a lot about data sharing on my blog, so it’s worth investigating why I believe sharing data is beneficial. There are plenty of reasons for and against sharing–I will highlight some of them in this post–but I believe that the overall balance comes out in favor of sharing.
Reasons for sharing
Many of the reasons for sharing are driven by the desire to make science more transparent. Data sharing helps ensure that we are conducting research properly and that our analyses are reproducible. Freely available datasets (and code!) allow others to test data for anomalies and analyses for validity (example). It is frightening to let others delve into our data to look for errors, but this ultimately makes our science better.
Another reasons for data sharing is the ability to conduct novel analyses on datasets. For example, meta-research brings together a variety of datasets, looking for connections that can’t be found in one dataset alone (example). It’s also possible that your data is useful ways you’ve never dreamed. Freely available data lets researchers create interesting mash-ups that can lead to new science.
Lest you think that data sharing is only good for others, there is also evidence that data sharing increases article citation counts. Data sharing also gives researchers a way to get credit for traditionally unpublishable results. Just because your dataset isn’t interesting enough to publish doesn’t mean it doesn’t have value.
Finally, data sharing benefits society as a whole. Data sharing represents a return on the public’s investment in federally supported research. It also lowers the barrier of entry into research for non-scientists. Finally, data sharing without cost or barrier to access can help spread scientific ideas faster.
Reasons against sharing
One of the biggest concerns about data sharing is being scooped. If I share my data, the fear is that someone will use my data to publish my study before me. There are two points to make here. The first is that data sharing should not be expected before the data’s corresponding paper is published, preventing others from publishing your data before you. Secondly, when someone uses your research outputs (ideas or data) without proper attribution, that person is committing a transgression. I don’t think that we should avoid doing something beneficial just because some people will never follow the rules.
Another major concern about data sharing is that it hurts researchers who invest significant time and effort into their datasets. A large dataset that takes years to acquire and may be used for several papers is not easy to freely share. I don’t have a good answer here other than it is worth having a discussion on embargoing data for a short period of time.
Finally, many datasets have issues that prevent sharing, such as human subject information and health information. This is a valid concern and such data should not be shared as is. It is possible to deidentify some datasets, but I recognize that there are other datasets that just can’t be shared.
Why I believe in data sharing
I think it’s important to remember in the context of data sharing that early scientists were secretive and did not publish their results in journal articles. This changed in 1665 with the arrival of the first scientific journal. Since then, the journal article has become the currency upon which scientific exchange is based–but the journal article is only the norm because we as scientists have made it so. If we find value in shared data, we researchers can change our norms to make data another research currency.
So why would we place research data at the same level as the journal article? We should because sharing data, with some limitations as for privacy, benefits the greatest number of people. Scientists benefit through reproducibility, novel analyses, and more citations while non-scientists benefit through better access to science and the ability to access the results that our tax dollars have paid for. Technology has enabled us to share data with unprecedented ease and, by doing so, we can dramatically further the cause of science.
This post represents the highlights of why I think data sharing is beneficial. I understand that not everyone agrees with my view and for that reason we should move toward more data sharing in a smart and measured way. There is definite momentum the direction of sharing original research data and I’m looking forward to having more discussions about it on this blog.