Retraction Watch wrote an interesting post this week about the partial retraction of a 2007 article because the authors could not provide the original research data. The retraction concerns a figure in which two panels appear to show the same information. Because the authors could not provide the original data to either prove or disprove this concern, the figure was retracted.
What matters here is not whether or not there was an actual error (we don’t know the answer), but that the absence of original research data led to a retraction. In how many other cases has there been concern about research results that can’t be addressed for lack of original data? Retraction is one possible outcome in such situations and this should raise an alarm for many researchers.
All of this leads to a very interesting question: for how long does a researcher need to retain their original data after the corresponding article is published?
The article in question would put that time at at least 6 years, but it’s not always clear what the ideal number should be. Even when you can find explicit retention policies, you’ll likely find different time durations in each policy. The many examples below illustrate how varied data retention policies are.
Data Retention Policies
One source for retention policies is universities, which are perhaps the most heterogeneous retention policy type. For example, the University of Kentucky expects their researchers to keep data “for a period of five years after publication or submission of the final report […], whichever is longer” (source), while Northwestern University data “must be retained for a minimum of three years after the financial report for the project period has been submitted” (source). These university policies really run the gamut. I’ve seen universities that mandate data retention for at least 7 years, universities have no research data policies at all, and universities that do not provide a clear retention time even though they have other data policies.
Funders are another source for data retention policies. The Engineering directorate of the NSF states that the mandated retention time “is three years after conclusion of the award or three years after public release, whichever is later” (pdf source). NIH also expects its fundees to keep records for at least three years after the final grant report is submitted (source). Outside the US, many UK funders have required retention times of at least 10 years (pdf source) and the Australian Code for the Responsible Conduct of Research stipulates a general data retention time of 5 years, which increases to 15 years for clinical data (pdf source).
Finally, several US government groups have policies on research data retention. The OMB circular A-110 from the White House states that data “shall be retained for a period of three years from the date of submission of the final expenditure report” (source). The Office of Research Integrity (ORI) states that three years is a commonly cited number, but it’s often not that simple (source). Case in point, a recent ORI investigation apparently set a 6-year limit on addressing research misconduct (source), meaning that you should really keep your data around for at least 6 years post-publication.
In some sense, journals have the ultimate power in the data retention issue because, even if your funder states a retention time of three years, the journal may expect you to produce the data 6 or more years post-publication or risk retraction. So even though the majority of journals do not have explicit data retention policies, they’re the entity setting the retention duration by asking you to prove the original data many years post-publication. This means that you should err on the side of longer retention times.
Some Final Thoughts
What conclusions can we draw from this mess of numbers? First, if you are conducting federally funded research in the US, you must keep data for three years after the completion of the grant at the absolute minimum. In practice, you should keep your data for longer.
Based on the highlighted retraction and the policies described above, I would say that the minimum time you need to keep your data around is currently somewhere between 6 and 10 years post-publication. This may change in the future. I will also note that any data being used in misconduct investigations or ligation (as for patents, etc.) should be retained for a longer period of time.
I would love to put a definite number on data retention, but there is obviously no easy answer. What is clear is that keeping your data on hand can help dispel charges of scientific misconduct and will satisfy requirements from your funder/university/etc. Plus, you never know when you will need to use a piece of old research data for a current project.
As the value of data continues to increase and data sharing becomes more prevalent, I suspect that long-term retention of digital data will just become a normal part of the research process. Until then, you should keep your research data on hand for a good long time.