The Absolute Most Important Things to Know in Order to Create a Data Management Plan (Part 1)

I’m currently developing a workshop on creating a data management plan (DMP) and, as part of this development process, I want to identify the absolute most important things to know in order to create a DMP.  Part of the reason for this is that I have a finite amount of time to address DMPs in this session but also because I don’t want to waste people’s time covering less important information.

To start my development process, I’ve come up with a list of some things a researcher might want to know when create a data management plan:

  • What is a DMP?
  • Why create a DMP?
  • What are the benefits of a DMP (other than getting funding)?
  • What are key parts of a DMP?
  • What information do I need to know for each part of a DMP?
  • What are the specific DMP requirements for my grant program?
  • Where can I find an example DMP from my field?
  • Where can I get help on my DMP?
  • Do I really have to share my data?
  • How will my DMP be assessed?
  • I don’t have NSF funding, why should I care about a DMP?
  • Are there any tools/resources I can use to create my DMP?

From this list, it’s clear that some of these points may be better addressed on a webpage of resources than during an in-person session (ie. finding DMP requirements, finding example plans, and a list of DMP tools/resources). Other points are simply not a priority to cover.

This leaves me with, what I think, are the most important things to know for creating a data management plan:

  • Why are researchers being asked to create a DMP (why create a DMP/benefits of a DMP)?
  • What are the key parts of a DMP?
  • How do I apply each of these key parts to my research?

These points also translate nicely into working through an outline DMP during my planned session, meaning researchers will leave the session with something usable and concrete.

With these three points identified, let’s dig into each one a bit more. I’ll cover the first two points in this post and the third in another post in a couple weeks.

 

Why Are Researchers Being Asked to Create a Data Management Plan?

Researchers with NSF and NEH Digital Humanities Directorate (pdf link) funding are currently required to create a data management plan as part of their grant applications. In the next few years, the other federal funders will add similar requirements for DMPs in response to the recent White House OSTP Public Access memo (pdf link). So everyone is getting on the DMP wagon, but the question is why?

From the funder perspective, data represents significant scholarly products that are not being utilized to their full potential (this is especially troubling to funders in the current financial environment). For this reason we are seeing funder mandates for data sharing; the eventual goal is to have massive data sharing akin to the distribution of scholarly articles. The barrier to reaching this goal is the fact that most research data are not well managed and often aren’t maintained past the publication of the associated article. So data management plans are really the first step toward a new way of conducting research because well managed data are more easily shared data.

From the researcher perspective, DMPs are a requirement but also an aid to the research process. I’ve talked about it on this blog before, but deliberate management of data makes it easier to conduct research. Good data management means that researchers are less likely to lose data, more likely to find data when they need it, and can more easily use the data due to better organization and documentation. I’ve even heard it said that one minute of data planning at the start of the project will save 10 minutes of headache later in the project.

The bottom line is: yes, you’re being asked to jump through another hoop in to get funding, but if you’re already creating a plan why wouldn’t you use it to make your research easier?

 

What Are the Key Parts of a Data Management Plan?

An NSF data management plan must include the following information:

  • The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project
  • The standards to be used for data and metadata format and content
  • Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements
  • Policies and provisions for re-use, re-distribution, and the production of derivatives
  • Plans for archiving data, samples, and other research products, and for preservation of access to them

The actual DMP requirements will vary from agency-to-agency and even between directorates within one particular agency, so you’ll want to look up the requirements for your particular grant before you write up your DMP. Still, we can distill NSF’s requirements into some common themes for the composition of a DMP. Basically, your plan should answer the following questions:

  • What types of data will I create?
  • What standards will I use to document the data?
  • How will I archive and preserve the data?
  • How will I protect private/secure/confidential data?
  • How will I provide access to and allow reuse of the data?

These are the key questions you need to ask yourself when creating any data management plan. They represent the many aspects of managing data from creation and documentation through preservation and reuse. By answering these questions, you will come up will a way to manage your data throughout the project.

I’ll go into these 5 questions more in my next post and discuss how to apply each question to your individual research project.

Posted in dataManagement, dataManagementPlans, fundingAgencies | 1 Comment

What Do You Mean By ‘Data Services’?

It’s been a very big month for me: I graduated from the University of Wisconsin-Madison in May and promptly got a job as Data Services Librarian at the University of Wisconsin-Milwaukee. The combination represents big life changes and a lot of new information to process. The result has been a lot of thinking about data but not being able to organize these thoughts until recently. Now that I have a better understanding of my new job, I think it will be good for me to put some of my recent thoughts into a (hopefully) coherent post.

One thing I should say is that I’m coming into an environment where there is a lot of interest in data management but no one person who is centrally responsible for data management on campus. This is an entirely new position and I am the first person on campus whose whole job it is to address data management issues. I’ll therefore play a large role in shaping the so-called ‘data services’ from which my job title derives.

So what are ‘data services’ exactly? Well, they can be a lot of things. A recent white paper (pdf) from the Association of College & Research Libraries (ACRL) surveyed libraries on the types of services they offering around research data. The services covered in this report include:

  • Consulting on data management plans
  • Consulting on data and metadata standards
  • Outreach with other data service providers on campus
  • Providing reference support in finding and citing data sets
  • Creating guides for finding data
  • Directly participating in research projects
  • Discussing data services with others on campus
  • Training librarians and others on campus
  • Providing repository services

A surprisingly large percentage of the surveyed libraries already provide some or all of these data services or plan to do so in the near future. And it isn’t just the large doctoral institutions that are doing this, though they are more likely to offer data services than other types of universities/colleges. It’s quite possible that your institution offers something similar, though be aware that it may not be through the library.

That’s the thing about data services—it’s not just a library issue. Certainly, there are particular data services that the library is in a unique position to provide (such as assistance with finding and citing data sets), but dealing with research data involves other stakeholders, such as IT and the campus divisions that support research and the faculty whose data we’re supporting. For this reason, I’ve spent a lot of time at my new job meeting with a wide range of people from across campus. I’m not sure what the campus-wide efforts around research data will be, but I can say what I’m looking to do initially in my role as Data Services Librarian.

First, I’m focused on grant compliance. The requirement for NSF proposals to include data management plans means that this is a clear nucleus for discussion of data management on campus. Additionally, the White House OSTP memo’s promise that data plans will become standard for all federal grants means that the need for data management plan assistance will only grow in the coming years.

The other area I’m focusing on is training in data management and writing data management plans. If I only do one thing in this position, it will be to give researchers the tools they need to manage data well. These sessions will be aimed at faculty, students, and staff, though I must say that I have a soft place in my heart for working with grad students in this area.

I’m still working out the details of these two services and the best ways to advertise them on campus, but those are my current thoughts. I wholly expect my ideas to evolve over time, just as I expect data services to evolve over time. Because data services isn’t a static, one-size-fits-all kind of thing. Such services must meet the needs of the individual university and, especially because it’s data, those needs are likely to be continually changing.

So those are my current thoughts on my new position and how I’m approaching data services at UW-Milwaukee. I hope these thoughts were coherent enough and you have a better sense of the types of things I will be working on. I’ll be sure to share any new and interesting things from the job as they arise!

Posted in dataManagement | Leave a comment

Reinhart and Rogoff

I can’t tell you how happy I am to be back to this blog, talking about data. I’ve actually spent a lot of the last month writing about data issues, but for my last class of my Master’s degree in library and information studies instead of this blog. On that front, I’m happy to report that I graduated this past weekend!

IMG_5169

My last assignment for my degree involved writing on data sharing. While all of my thoughts on the topic are too numerous to write about in a single blog post, there is one particular thread of the assignment worth elaborating upon here: the recent Reinhart and Rogoff news.

If you missed it, Reinhart and Rogoff are two Harvard economics professors who published a study (pdf) examining economic growth for countries with high debt-to-GDP ratios. Their finding have been used as evidence for austerity measures in both America and Europe. Unfortunately, their conclusions are wrong because their analysis is flawed.

The errors were discovered by a UMass-Amherst grad student Thomas Herndon who read the paper and tried to reproduce the analysis. Failing to do so, he contacted the authors and was given access to the spreadsheet containing their data and analysis. Upon examining the spreadsheet, Herdon found data points erroneously discarded and coding errors. When the errors were fixed, the conclusions of the original paper were not supported.

This story is important for a few reasons. First, the article has had significant and most likely negative impact on the American and European economies. Second, it was only through the sharing of the original data and analysis that the errors were conclusively discovered and proven. Third, had the original authors not chosen to share their data (which is still not a common practice) the errors and resulting economic policies could have persisted for years.

I find this story to be one of the best examples of the power of data sharing, between the paper’s significant impact and the fact that a careful reading of the article was not enough to conclusively prove mistakes. Stanford statistic professor David Donoho once likened (pdf) journal articles to the advertising of scholarship, whereas the data and analysis are the actual scholarship. That perfectly encapsulates the issues here.

Science values reproducible work but reproducibility often can’t be proven from articles alone. Thankfully, checking for reproducibility becomes easier if data sharing is a part of the standard research process. Scientists can go directly to the data and analysis if they have questions with the work.

The ultimate goal is to have an accurate scientific record, preventing more studies like Reinhart and Rogoff’s from causing harm. And as evidenced from the Reinhart and Rogoff story, data sharing can play an important role in reaching this goal.

 

Resources:

Reinhart, Rogoff… and Herndon: The student who caught out the profs

Influential Reinhart-Rogoff economics paper suffers spreadsheet error

What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine

Reinhart, Rogoff Backing Furiously Away From Austerity Movement

Posted in dataAnalysis | 2 Comments

The Hidden Costs of Cloud Storage

Cloud storage is an increasingly popular way to store research data. Being able to upload and access files from any location is useful and makes transfer between computers much easier. But for all of the upsides of cloud storage, there are also a few downsides.

Data Ownership

While most of us don’t usually read terms of service agreement, it’s worth doing a little digging when it comes to your cloud reader. For example, Google Drive’s terms of service includes this little tidbit:

When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works… communicate, publish, publicly perform, publicly display and distribute such content.

You retain intellectual properties rights over the content you put on Drive but Google can still do a lot of things with your content. This should make you a little worried about any research data you put on Drive.

There are some ways around this problem. One example comes from UW-Madison, which has negotiated Google Drive terms of service for faculty and students where Google has no ownership or use permissions. The other option is just to pick a cloud storage provider that won’t use your data, but even that isn’t always perfect. Dropbox, for example, doesn’t take quite the same liberties Google does with your data but does it spell out in its terms of service how it can use your personal information (name, address, log-in information, etc) or provide your files to law enforcement.

My best advice? Read the terms of service before choosing a cloud storage provider for your data.

Security

The other natural concern when giving your data to a third party is security. This is especially important when putting sensitive information or student information (covered under FERPA) in the cloud. You need to take a lot of extra precautions in the cloud if your data is sensitive.

One secure cloud storage option I’ve run across is SpiderOak. Unlike other cloud storage options, SpiderOak cannot actually read any of your data because it gets encrypted before it even arrives at the SpiderOak servers. And in this Ars Technica review , SpiderOak favorably compares with other popular cloud services like Dropbox and SugarSync.

So unless you find such a service like SpiderOak that guarantees security, the cloud is not the best place for your sensitive data.

The Limitations of the Cloud

Cloud storage can be a blessing in the laboratory, but putting your data in the cloud does not automatically mean that your data is well backed-up nor well managed. This is because your data is outside of your control when you give it to another entity. If your cloud storage provider folds or suddenly changes their terms of service (as seen in the recent Instagram debacle), you could suddenly be in a tight spot. For safety’s sake, it’s better to have other back-ups besides your cloud drive.

I’m in no way saying that you should not use cloud storage for research. Instead, you should be smart about choosing a service provider and know that service’s limitations. With a little bit of forethought, cloud storage can be a valuable asset in the laboratory instead of a potential security hole.

Posted in dataStorage, digitalFiles | 2 Comments

The Proper Pen

Have you ever wondered what the scientifically optimal writing utensil is to use in your lab notebook? No? Well, this post contains the answer to a record-keeping question you never thought to ask.

The answer comes from one of my favorite books on managing laboratory records, Writing the Laboratory Notebook by Howard Kanare. It was published in the 1980’s (making the section on electronic record keeping highly entertaining) by the ACS and thoroughly covers the how’s and why’s of keeping a proper notebook.

This book is so thorough, in fact, that it spends 6 pages (p. 11-16) on the proper type of paper and ink to use. Kanare even conducts experiments with 15 different types of pens to determine the most colorfast and solvent-fast inks. I found his experiments so interesting that I thought it worth sharing the highlights with you.

Just say no to pencils

First, I should say that pencils are right out. They’re erasable, they smudge, and they don’t copy well when you’re backing up your notebook. If you want to be sure that data hasn’t been changed or lost to illegibility, it’s better to stick with a pen.

Ink color

The choice of ink color comes down to lightfastness, since modern inks no longer contain the harsh acids that eat through paper over time–a historic problem. Kanare tested ink under both fluorescent light and sunlight and found that red inks fade most easily, blue ink fades some (the amount of fading depends on the pen type), and black inks fade the least.

Pen type

Felt-tip pens have a few things going against them from the start. Their inks are water based, making the ink more likely to bleed and less permanent. On the positive side, these porous-tip pens held up to Kanare’s solvent tests (using water, hexane, HCl, acetone, and methanol) about as well as the ballpoint pens.

The other main option, a ballpoint pen, does pretty well under Kanare’s solvent tests and the pen’s solvent-based ink makes writing more permanent. Kanare’s only warning about these pens is that the ink can coagulate or settle during long-term storage, leading to performance problems in older pens.

Kanare also brings up the option of using archival quality pens, but it’s not clear without testing if it’s worth the added expense over the long term.

And the winner is…

You can’t go wrong with a humble black ballpoint pen when writing in your lab notebook. This ink will stand up the most to fading and spills and provide good permanence, making your records readable for a long time.

Posted in labNotebooks | Leave a comment