The Absolute Most Important Things to Know in Order to Create a Data Management Plan (Part 2)

The last time I wrote about data management plans, I covered reasons that funders are starting to expect researchers to manage their data better and the 5 topics that make up most data management plans. The topics are as follows:

What types of data will I create?
What standards will I use to document the data?
How will I archive and preserve the data?
How will I protect private/secure/confidential data?
How will I provide access to and allow reuse of the data?

In this post, I want to dig into each of these topics a bit more.

1. What data will I create?

This is the background section of your data management plan, where you will provide an overview of your data and some of the most basic information on managing it. In general, you’ll want to answer the following questions:

What data will be collected?
Are my data unique? Are my data derived from existing data and are those data still available?
How big will my data be? How fast will my data grow?
How will my data be stored?
Who owns and is responsible for the data?

A lot of the content of this section will be specific to that particular project, but there are some common themes to look out for here.

First, you should consider how unique your data are. The management of observational data, for example, should be prioritized because that type of data are so tied to a time and a place that they cannot be recreated if lost. Simulation data, on the other hand, are easy to recreate; the management of this data should focus on its corresponding code over the data themselves.

When storing data, the motto is: Lots of Copies Keeps Stuff Safe (or LOCKSS, which is also a preservation tool). Plan to follow the rule of 3, which dictates 2 onsite copies and 1 offsite copy. Automate your backups whenever you can.

Finally, you’ll want to designate someone who will be responsible for the data; usually this is the PI. Be aware that the responsible party might not be the data owner, as I mentioned in an earlier post on how complicated data ownership can be.

2. What standards will I use to document the data?

Documentation is a key part of data management, as I have mentioned several times on this blog already. In this section of your plan, you will want to cover:

Are there any community standards for documentation, such as an ontology or metadata schema?
How will I document and organize my data? What metadata schema will I use?
How will I document my methods and other information needed for reproducibility?

You’ll need some sort of documentation system no matter what, but you should really consider using a formal schema if you want to or are required to share your data. Formal metadata schemas document the context of a dataset in a standardized way, allowing datasets to be easily shared and interpreted by other parties.

If you decide to use a formal schema to document your data, it’s best to choose the schema before you collect your data so you know exactly what information to record. This is especially important if you know that you’ll be depositing your data in a particular repository. Take 2 minutes to look up this information before you acquire your data and save yourself a huge headache later when you go to deposit your data.

Besides looking at a disciplinary repository for the best documentation scheme, you can also consult your peers and your subject librarian. Be aware that your field might have not only metadata schemas, but also ontologies or taxonomies that will help you classify your datasets.

Finally, you should think about the other information that lets you understand your data and the method by which you collected and interpreted it; things like: code, surveys, codebooks, data dictionaries, etc. This information not only adds context to your dataset but also makes it more trustworthy.

3. How will I protect private/secure/confidential data?

This section will not apply to all data plans, but is critical if it applies to you. Some of the issues you will need to address are:

What regulations apply to my data (HIPAA, FERPA, FISMA, etc.)?
What security measures will I put in place to protect my data?
Who is allowed access to my data?
Who will be responsible for data security?
Will my data lead to a patent or other intellectual property claim?

The best thing to do if you have data that falls under one of the listed policies, local IRB constraints, or intellectual property claims is to talk to someone at your local institution. Most all research institutions have policies as well as support systems for dealing with these issues. Data security is not the place to you want to cobble something together and hope it works (that can ruin careers).

Find your local experts. Cite your local policy. Make someone to keep on top of this.

4. How will I archive and preserve the data?

This section addresses one of the main reasons researchers are being asked to create data management plans: so their data outlive publication of the corresponding research article. The topics you should discuss are the following:

How long will I retain the data?
What file formats will I use? Do I need to preserve any software?
Where will I archive my data?
Who will be responsible for my data in the long term?

I addressed retention times and how to preserve data in my previous two blog posts, so I won’t go into those topic here.

What I will say is that usually the best method of preserving your data is to find a trustworthy partner to do it for you. A few good options are a disciplinary data repository, an institutional repository, or a journal that accepts data. Local servers come and go, whereas a repository’s mission is to keep things for a long period of time. You worry about the science and let them worry about the data.

5. How will I provide access to and allow reuse of the data?

The final portion of a data management plan is necessary for grant programs that require data sharing. If that condition applies to you, you should address the following questions:

Is there a relevant sharing policy?
Who is the audience for my data?
When and where will I make my data available? Do I have resources for hosting the data myself?

In addition to looking at funding agency and directorate policies that require sharing, there are a growing number of journals that require data sharing as a condition for publication.

I will give the same advice for data sharing as I did for data preservation: let someone else worry about this. It takes much more work (and is also more expensive) to make your data available by request or on your website than handing it over to a repository to manage for you. Additionally, your data will be easier to find in a repository than on your website, making it more likely to be cited!

Data Management Plan Checklist

In addition to blogging about these key questions, I have also made them into a handy .pdf checklist to use while working through your data management plans. The checklist is intended for researchers at my institution but is still useful for others. It’s CC-BY licensed, so feel free to use and share!

Final Thoughts on Data Management Plans

Data management plans are going to be a standard part of any federally funded grant application due to stipulations from the recent White House public access memo. The exact requirements for each plan will vary between agencies and directorates, but there will be some common themes between plans—themes that have been elaborated above.

The one thing I haven’t been able to touch on deeply in this post are the actual data management practices that underpin a good data management plan. But that topic requires a whole blog to cover. I will say that if you answer the above questions and customize your plan to the project at hand, you’ll have a good start on your data management plan.

I hope these two post have clarified the growing importance of data management plans and what goes into them. Data management plans are here to stay but will become easier to write as we get more used to preserving and sharing digital data.