Why Should I Share My Data?

I’m going to be talking a lot about data sharing on my blog, so it’s worth investigating why I believe sharing data is beneficial. There are plenty of reasons for and against sharing–I will highlight some of them in this post–but I believe that the overall balance comes out in favor of sharing.

Reasons for sharing

Many of the reasons for sharing are driven by the desire to make science more transparent. Data sharing helps ensure that we are conducting research properly and that our analyses are reproducible. Freely available datasets (and code!) allow others to test data for anomalies and analyses for validity (example). It is frightening to let others delve into our data to look for errors, but this ultimately makes our science better.

Another reasons for data sharing is the ability to conduct novel analyses on datasets. For example, meta-research brings together a variety of datasets, looking for connections that can’t be found in one dataset alone (example). It’s also possible that your data is useful ways you’ve never dreamed.  Freely available data lets researchers create interesting mash-ups that can lead to new science.

Lest you think that data sharing is only good for others, there is also evidence that data sharing increases article citation counts. Data sharing also gives researchers a way to get credit for traditionally unpublishable results. Just because your dataset isn’t interesting enough to publish doesn’t mean it doesn’t have value.

Finally, data sharing benefits society as a whole. Data sharing represents a return on the public’s investment in federally supported research. It also lowers the barrier of entry into research for non-scientists. Finally, data sharing without cost or barrier to access can help spread scientific ideas faster.

Reasons against sharing

One of the biggest concerns about data sharing is being scooped. If I share my data, the fear is that someone will use my data to publish my study before me. There are two points to make here. The first is that data sharing should not be expected before the data’s corresponding paper is published, preventing others from publishing your data before you. Secondly, when someone uses your research outputs (ideas or data) without proper attribution, that person is committing a transgression. I don’t think that we should avoid doing something beneficial just because some people will never follow the rules.

Another major concern about data sharing is that it hurts researchers who invest significant time and effort into their datasets. A large dataset that takes years to acquire and may be used for several papers is not easy to freely share. I don’t have a good answer here other than it is worth having a discussion on embargoing data for a short period of time.

Finally, many datasets have issues that prevent sharing, such as human subject information and health information. This is a valid concern and such data should not be shared as is. It is possible to deidentify some datasets, but I recognize that there are other datasets that just can’t be shared.

Why I believe in data sharing 

I think it’s important to remember in the context of data sharing that early scientists were secretive and did not publish their results in journal articles. This changed in 1665 with the arrival of the first scientific journal. Since then, the journal article has become the currency upon which scientific exchange is based–but the journal article is only the norm because we as scientists have made it so. If we find value in shared data, we researchers can change our norms to make data another research currency.

So why would we place research data at the same level as the journal article? We should because sharing data, with some limitations as for privacy, benefits the greatest number of people. Scientists benefit through reproducibility, novel analyses, and more citations while non-scientists benefit through better access to science and the ability to access the results that our tax dollars have paid for. Technology has enabled us to share data with unprecedented ease and, by doing so, we can dramatically further the cause of science.

This post represents the highlights of why I think data sharing is beneficial. I understand that not everyone agrees with my view and for that reason we should move toward more data sharing in a smart and measured way. There is definite momentum the direction of sharing original research data and I’m looking forward to having more discussions about it on this blog.

Posted in openData | Leave a comment

The Problem with Paper Notebooks

The laboratory notebook is one of the most important tools for data management in the laboratory and in its paper format, it’s also one of the most problematic.

The paper laboratory notebook has historically been the place to record all of the information about an experiment: experimental data, experimental observations, the researcher’s thoughts, etc. Much of this is still true today, with the exception that most of our research data is digital and doesn’t fit nicely into a paper format. Instead, we print out tables and graphs and tape them into our notebooks as a bad approximation of a complete laboratory record.

Because of digital research, we’ve fundamentally divided our data from the document that records the context of that data. This is a problem because data without context are useless, as is context without the data. So we do our best to partner the two disparate systems of paper and electrons in order to have a usable laboratory record. This is frustrating, difficult to do well, and is having a major impact on the way we manage our data.

The best solution to the paper-digital divide is to fundamentally change the way we record information in the laboratory by using electronic lab notebooks. Having both the data and their context be digital and stored together will dramatically improve organization and searching. Additionally, e-lab notebook software is finally becoming viable and such systems are slowly being integrated into laboratories around the world. The benefits (and drawbacks) of e-lab notebooks require their own separate post, which I promise to write soon.

In the absence of an e-lab notebook, here are several suggestions for bridging the paper-digital divide:

  • Use one organization scheme. If your notebook is organized chronologically then your digital files should be organized in the same way. This will make it easier to find things.
  • Organize your digital files with respect to the notebook that they belong with. This may involve keeping a separate folder for each notebook you use.
  • Record the computer on which the digital files are stored (but remember that files may move).
  • Utilize indexes. You should definitely have one for your paper notebook and it would be useful to have an electronic version.
  • Keep your data with your lab notebook by writing all of the relevant digital files to disk and tucking that disk inside the cover of the paper notebook.
  • Digitally back up your paper notebooks by scanning them and storing the notes in the same folder as the data.

There are several other issues with paper lab notebooks (legibility, fragility, difficulty of searching, effort to back up), but the paper-digital divide is one of the biggest obstacles to to good data management in the laboratory. This problem is solvable by transitioning entirely to digital, but we need to be sure to do this in a smart way that ensures access to our data for years to come. In the meantime, small changes can make a big difference.

Posted in documentation, labNotebooks | Leave a comment

Who Owns My Data?

A recent post on the Retraction Watch blog–concerning a grad student retracting a sole-author paper when her advisor claimed ownership of her data–highlights the complicated nature of data ownership. This area is so complex that the only answer to the question of “who owns my data?” that I can provide is: it depends.

There are a lot of parties with an interest in your research data, not limited to: your funder, your institution, your boss/advisor, your collaborators, and you. Each of these entities may have a policy (explicit or not) on who has ownership of and who gets access to the data. My alma mater, for example, has a policy that establishes the university as the owner of the data but the PI as the steward of the data, meaning that PIs get to make most all of the decisions about data generated in their labs. Funders, on the other hand, may not exert ownership over your data but may instead require you to share it.

My best advice in this area is to assume that you don’t have ownership of your research data–especially if you are a grad student–until you look into your local policies. Even if you do all of the research, the entities who provide the equipment, laboratory space, and money may still lay claim to the data.

When it comes to data ownership, it’s much better to be conservative than to unintentionally burn bridges, end up with a retraction, or be just another bit of news  in the scientific blogosphere.

Posted in ownership | 1 Comment

Big Changes in Data Ahead

The White House made an announcement on Friday that will significantly impact the way that you disseminate your research. The memorandum (pdf), covering all federal granting agencies with over $100 million in annual R&D expenditures (a list of agencies can be found in this Scholarly Kitchen blog post), directs granting agencies to make the products of their grants–both publications and data–accessible to the general public. Here’s an overview of the announcement and what it means for you, the researcher.

Changes in Publications

The publication portion of this directive is akin to what the NIH currently mandates. Articles will be published through normal channels but made freely available to the public after a 12-month embargo. The NIH houses these publications in PubMed Central, though I expect that other granting agencies will make use of different repositories. What this means is that in the short term you will publish articles as normal but in the long term more people will be able to read your articles.

Changes in Data

The publications announcement alone is really big news but, as this is a blog about data, I’m more interested in the data portion of the memo. Here’s where things get a little more complex because there isn’t nearly as much infrastructure to support data sharing.

The memo does not give an outright mandate for data sharing, instead it seeks to “maximize access” to digital research data. I definitely view this as a step in the direction of full sharing; it gives time for researchers to adjust to the new model and for the rest of us to build infrastructure and work out the kinks.

It should be noted that even this ‘maximized access’ has it’s limitation. Among other things, researchers will not be forced to share classified data, data with privacy and confidentiality concerns, lab notebooks, preliminary analysis, and physical objects (see memo section 4 for the full list). Researchers can also exert intellectual property rights over data to prevent sharing and I’m looking forward to seeing how this manifests as policy.

Beyond encouraging data sharing, the memo’s directives model the current NSF mandate for data management plans. All researchers on federal grants will be required to create data management plans and agencies should monetarily support these plans, as appropriate.

How the memo effects you as a researcher is less clear for data than for publications. It is clear that you will be writing more data management plans and more resources will likely be put into supporting and evaluating these plans; I expect research institutions will be stepping up in this area as well as funding agencies. Researchers will also start feeling more pressure to share their digital data, but avenues for sharing data will increase and sharing (and citing) datasets will become easier over time. Any further requirements on researchers will become defined once the agencies develop their data policies.

Time Scale

Granting agencies will have 6 months from the time of publication of the White House memo to develop new policies that comply with the memo. These agencies will also need to do this work within their existing budgets. For these reasons, the changes outlined in the memorandum will not happen right away but you should be aware that they are coming soon.

My Final Thoughts

The White House memo is a measured step toward opening up research because it makes use of existing NSF and NIH policies that have been proven feasible. For this reason, I expect the publication and data management plan portions of the memo to proceed somewhat smoothly. As for data sharing, we are at the beginning of a big change in what we do with digital research data and it will take some time to settle into a new system of sharing. I don’t think that this memo is the last word in data sharing and any further changes are likely to be as measured and deliberate as this memo.

In the end, I am personally very excited about the White House announcement and will spend a future blog post discussing why opening up research is a good thing for science.

Posted in fundingAgencies, government, openAccess, openData | 2 Comments

Starting Small: File Naming Conventions

Managing research data well can feel like an overwhelming task, so it’s important to start small. It’s much easier to make several small changes over time than to change the whole system at once.

One of the easier and more helpful changes you can make in the laboratory is to utilize consistent file naming. A file naming convention add standardization to your files, making them easier to organize and locate. It will also help your coworkers sort through your files should you fall ill or leave your job; your naming scheme should be documented in your laboratory notebook (preferably at the front or back for easy access) for this reason.

There are a lot of conventions available for you to choose from, though you will probably want to customize one for your own purposes. Here are a few general tips for naming files:

The Basics

  • Files should be named consistently
  • Files names should be descriptive but short (<25 characters)
  • Use underscores instead of spaces
  • Avoid these characters: “ / \ : * ? ‘ < > [ ] & $

Dates

  • Date your files using the convention YYYY-MM-DD

I think that dating your files is one of the best ways to help organize your data, particularly because paper lab notebooks are organized by date. So if you have only the file, you can look at the date in its name and immediately know where to search for the corresponding notes in your notebook. (You can also reinforce this file-notebook connection by organizing your computer’s folders with reference to each notebook.)

Versions

  • For analyzed data, use version numbers
  • Save files often to a new version
  • Label the final version FINAL

Versioning can be imminently helpful when you are manipulating data. If you make a change to your data that you don’t want to keep, it’s simple to go back to an earlier version of the file. The same is true if a file gets corrupted or if you simply want to change your analysis method. The key to making versioning work is being consistent with version names, periodically saving to new versions, and documenting the differences between versions.

These are some general thoughts on file naming. Feel free to leave a comment about a system that has worked well for you!

 

Sources:

http://researchdata.wisc.edu/manage-your-data/file-naming-and-versioning/

Posted in digitalFiles | Leave a comment

The Blog I Wish I Had

When I was a practicing chemist, I wore many hats. One day I was a programmer, the next a plumber, and only sometimes was I actually a chemist. While I enjoyed the day-to-day variety, I found it hard to adapt to roles in which I had very little training. One such area was dealing with my research data.

As an undergraduate, I learned a little bit about keeping a laboratory notebook, but this wasn’t enough to prepare me for managing and organizing my data once I got to graduate school. Left on my own, I cobbled together a system that wasn’t very good and was absolutely no help to anyone else. I knew my system could be better but I didn’t have the knowledge to actually make it better.

I’ve learned a lot about dealing with data since then and I wish I could go back and share that information with my former self. Instead, I’m sharing that information with you.

My plan for this blog is to write about managing research data in the lab, in addition to discussing how data are becoming first class citizens in the world of research. There are a lot of exciting things going on around research data at the moment and I can’t wait to talk about them with you!

Posted in admin | Leave a comment