A Note on NonCommercial Licenses

I wrote about the best licenses for datasets in my previous post and I want to add to that information by pointing out two potentially problematic Creative Commons licenses for research products, data and publications alike: CC Attribution-NonCommercial (CC BY-NC) and CC Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)*. These are the two Creative Commons noncommercial licenses.

There are a couple reasons to think twice before using this class of licenses. The first is that the meaning of noncommercial is unclear. You are excluding anyone from freely using CC-NC content for profit, which obviously covers corporations but might also include groups like nonprofits. For example, a nonprofit may use your content for promotional material intended to increase their membership. An increase in member dues can be considered a financial gain that is not allowed under a CC-NC license. There are a lot of ambiguities here and they are better laid out in this article on noncommercial licensing in biology. The important thing is to be aware that you are excluding more uses than you may realize using a noncommercial license.

The other reason to hesitate before applying a noncommercial license is that publishers can still make profit on this content even though it is “open access”. Licensing something under a CC-NC licence doesn’t mean that it can’t be used commercially, only that it can’t be used commercially for free. This content can still be used in a commercial setting if you pay for permission, just like with traditional content.  At least one publisher is guiding its authors toward this “open access” license while simultaneously charging others to commercially use this content; the profit rarely goes back to the author.

Noncommercial licenses are not recommended for data, for reasons expressed in my previous post, but are probably not ideal for your other research products either. I’m not saying that you shouldn’t use a noncommercial license, only that you should be aware of the limitations of these licenses before consciously applying them to your research products.

 

* “Share Alike” licenses require any derivative products to be similarly licensed. This is a “copyleft” style license, meant to make the content and all its derivatives free in perpetuity.

Posted in copyright | Leave a comment

On Data and Copyright

As scientists, we aren’t necessarily trained in copyright. For a long time this hasn’t been a problem, as practices for distributing our scholarly work have been fairly standardized. Open access publishing and data sharing are changing things and providing researchers with a multitude of copyright options beyond just signing over our rights in order to be published. This post looks at some of those options for data.

Data and Copyright

Copyright is confusing, but it becomes even weirder when you apply it to research data. That’s because data are often considered facts, which do not fall under copyright in many countries. Unless you create a creative compilation of those facts, a situation in which copyright then applies in countries like the US. Such variations in copyright law from country to country make it difficult to determine if you need to worry about copyright on your research data.

In the US, the distinction between facts and a creative compilation of facts was laid out in the case “Feist Publications, Inc. v. Rural Telephone Service Company, Inc.” This case applied to the compilation of telephone numbers (ie. a phone book), which was not deemed a creative arrangement. There must be some original selection and rejection in the compilation in order to justify copyright.  So a curated database containing research data could be considered a creative arrangement, even though individual facts are not eligible for copyright.

The heterogeneous nature of data adds to the copyright confusion. It’s unclear if original research data that aren’t a stereotypical set of numbers (like image data, video data, etc.) are eligible for copyright, though my inclination is that in some situations they might be.

Confused yet?

Let’s take a step back from this muddle and talk about the two things I think you should know about copyright on datasets (caveat: I am not a copyright expert, so this does not constitute legal advice).

  1. You should recognize that your original research data may not be copyrightable, especially if you are based in the US.
  2. To avoid any copyright confusion, I strongly recommend applying a clear license to any datasets you share—preferably the CC0 license described below.

I recommend using a Creative Commons license because these licenses are easy to apply, legally enforceable, and becoming popular in scholarly publishing. Creative Commons (CC) itself is a nonprofit organization founded in 2001. They took the idea of the GNU GPL license for open source software and applied it to creative works. They offer several licenses, but I want to look at the two most often discussed for datasets.

Creative Commons Attribution (CC BY)

The Creative Commons Attribution license is the most basic of the CC licenses and the one that is often used on open access articles. If you license something under CC BY, you allow anyone to use and modify your content for any purpose, so long as you are given attribution. Because of the freedom to mine content, CC BY is often considered the best license for open access journal articles and for that reason is required by some funding agencies.

On the surface, CC BY seems like a great license for data because it enables data reuse while still requiring citation, which is always important in research. The problem with this license appears when you aggregate datasets. For example, if you are analyzing a group of 100 datasets to find patterns, under a CC-BY license you would need to cite every last dataset in your published article. If you have a particularly large database, citation becomes even more difficult because you need to sort through which parts of the database were actually included in the analysis. Using CC BY datasets in aggregate is obviously problematic.

The limitations of CC BY licensed data are becoming more apparent as data mining emerges as an important research tool. Ironically, data mining is one of the reasons to want openly licensed data in the first place. So in order to enable easier data mining and reuse, Creative Commons does not recommend the use of CC BY for data.

Creative Commons Zero (CC0)

The Creative Commons Zero license is the only Creative Commons license intended for data. Using a CC0 license means that you revoke all of your rights over a dataset, including the attribution requirement which hinders data mining. This may seem counter-intuitive but recognize that you probably didn’t have those rights to begin with in countries like the US.

The strength of CC0 is that it is explicitly intended for content that is copyrightable in some jurisdictions and not others. It does this by removing all copyright claims universally. In the words of Creative Commons:

CC0 should not be used to mark works already free of known copyright and database restrictions and in the public domain throughout the world. However, it can be used to waive copyright and database rights to the extent you may have these rights in your work under the laws of at least one jurisdiction, even if your work is free of restrictions in others. Doing so clarifies the status of your work unambiguously worldwide and facilitates reuse.

CC0 clears away the confusion on whether a dataset is copyrightable, noncopyrightable, or copyrightable in some countries by applying an open license that is unambiguous and usable worldwide. It also allows for data mining and reuse, which makes it the best license for research datasets.

The other big consideration when using a CC0 license is attribution. Attribution is not required with this license, but that doesn’t mean that you should not cite a dataset. Data Dryad addresses this issue nicely in their FAQ:

CC0 does not exempt those who reuse the data from following community norms for scholarly communication, in particular from citation of the original data authors. On the contrary, by removing unenforceable legal barriers, CC0 facilitates the discovery, reuse, and citation of that data. Any publication that makes substantive reuse of the data is expected to cite both the data package and the original publication from which it was derived.

So while CC0 does not require attribution, community norms do. Community norms and the corresponding ethics of doing research are powerful motivators even when there are no comparable legal requirements in place. All this means is that despite not being required to attribute a dataset, you will still be expected to.

Finally, I will note that CC0 fits into a broader idea that scientific data should be open to encourage the scholarly process, an idea which is outlined by the Panton Principles. The Panton Principles identify CC0 and the Public Domain Dedication and License (PDDL) license as the two acceptable options for licensing datasets.

Final Thoughts

Many data repositories are already using the CC0 license: Dryad, figshare (which licenses data under CC0 and all other materials under CC BY), and, just announced this week, BioMed Central, among others. There is definitely growing consensus within the scientific community that CC0 is the preferred license for shared datasets.

Using a CC0 license removes any potential copyright ambiguity and makes it clear that someone else can freely use the licensed dataset. For US-based researchers, it is likely you never had copyright over your data to begin with, but it’s still best to be as explicit as possible that you are not exerting these rights. It makes data that much easier to share and reuse. Just remember that if you come across a CC0-licensed dataset you would like to use, you should cite the data creator even if it is not technically required.

 

Resources:

Elliott, R. (2005). Who owns scientific data? The impact of intellectual property rights on the scientific publication chain. Learned Publishing, 18(2), 91-94.

Murray-Rust, P. (2008). Open Data in Science. Serials Review, 34(1), 52-64.

Posted in copyright | 1 Comment

The Absolute Most Important Things to Know in Order to Create a Data Management Plan (Part 2)

The last time I wrote about data management plans, I covered reasons that funders are starting to expect researchers to manage their data better and the 5 topics that make up most data management plans. The topics are as follows:

  • What types of data will I create?
  • What standards will I use to document the data?
  • How will I archive and preserve the data?
  • How will I protect private/secure/confidential data?
  • How will I provide access to and allow reuse of the data?

In this post, I want to dig into each of these topics a bit more.

 

1. What data will I create?

This is the background section of your data management plan, where you will provide an overview of your data and some of the most basic information on managing it. In general, you’ll want to answer the following questions:

  • What data will be collected?
  • Are my data unique? Are my data derived from existing data and are those data still available?
  • How big will my data be? How fast will my data grow?
  • How will my data be stored?
  • Who owns and is responsible for the data?

A lot of the content of this section will be specific to that particular project, but there are some common themes to look out for here.

First, you should consider how unique your data are. The management of observational data, for example, should be prioritized because that type of data are so tied to a time and a place that they cannot be recreated if lost. Simulation data, on the other hand, are easy to recreate; the management of this data should focus on its corresponding code over the data themselves.

When storing data, the motto is: Lots of Copies Keeps Stuff Safe (or LOCKSS, which is also a preservation tool). Plan to follow the rule of 3, which dictates 2 onsite copies and 1 offsite copy. Automate your backups whenever you can.

Finally, you’ll want to designate someone who will be responsible for the data; usually this is the PI. Be aware that the responsible party might not be the data owner, as I mentioned in an earlier post on how complicated data ownership can be.

 

2. What standards will I use to document the data?

Documentation is a key part of data management, as I have mentioned several times on this blog already. In this section of your plan, you will want to cover:

  • Are there any community standards for documentation, such as an ontology or metadata schema?
  • How will I document and organize my data? What metadata schema will I use?
  • How will I document my methods and other information needed for reproducibility?

You’ll need some sort of documentation system no matter what, but you should really consider using a formal schema if you want to or are required to share your data. Formal metadata schemas document the context of a dataset in a standardized way, allowing datasets to be easily shared and interpreted by other parties.

If you decide to use a formal schema to document your data, it’s best to choose the schema before you collect your data so you know exactly what information to record. This is especially important if you know that you’ll be depositing your data in a particular repository. Take 2 minutes to look up this information before you acquire your data and save yourself a huge headache later when you go to deposit your data.

Besides looking at a disciplinary repository for the best documentation scheme, you can also consult your peers and your subject librarian. Be aware that your field might have not only metadata schemas, but also ontologies or taxonomies that will help you classify your datasets.

Finally, you should think about the other information that lets you understand your data and the method by which you collected and interpreted it; things like: code, surveys, codebooks, data dictionaries, etc. This information not only adds context to your dataset but also makes it more trustworthy.

 

3. How will I protect private/secure/confidential data?

This section will not apply to all data plans, but is critical if it applies to you. Some of the issues you will need to address are:

  • What regulations apply to my data (HIPAA, FERPA, FISMA, etc.)?
  • What security measures will I put in place to protect my data?
  • Who is allowed access to my data?
  • Who will be responsible for data security?
  • Will my data lead to a patent or other intellectual property claim?

The best thing to do if you have data that falls under one of the listed policies, local IRB constraints, or intellectual property claims is to talk to someone at your local institution. Most all research institutions have policies as well as support systems for dealing with these issues. Data security is not the place to you want to cobble something together and hope it works (that can ruin careers).

Find your local experts. Cite your local policy. Make someone to keep on top of this.

 

4. How will I archive and preserve the data?

This section addresses one of the main reasons researchers are being asked to create data management plans: so their data outlive publication of the corresponding research article. The topics you should discuss are the following:

  • How long will I retain the data?
  • What file formats will I use? Do I need to preserve any software?
  • Where will I archive my data?
  • Who will be responsible for my data in the long term?

I addressed retention times and how to preserve data in my previous two blog posts, so I won’t go into those topic here.

What I will say is that usually the best method of preserving your data is to find a trustworthy partner to do it for you. A few good options are a disciplinary data repository, an institutional repository, or a journal that accepts data. Local servers come and go, whereas a repository’s mission is to keep things for a long period of time. You worry about the science and let them worry about the data.

 

5. How will I provide access to and allow reuse of the data?

The final portion of a data management plan is necessary for grant programs that require data sharing. If that condition applies to you, you should address the following questions:

  • Is there a relevant sharing policy?
  • Who is the audience for my data?
  • When and where will I make my data available? Do I have resources for hosting the data myself?

In addition to looking at funding agency and directorate policies that require sharing, there are a growing number of journals that require data sharing as a condition for publication.

I will give the same advice for data sharing as I did for data preservation: let someone else worry about this. It takes much more work (and is also more expensive) to make your data available by request or on your website than handing it over to a repository to manage for you. Additionally, your data will be easier to find in a repository than on your website, making it more likely to be cited!

 

Data Management Plan Checklist

In addition to blogging about these key questions, I have also made them into a handy .pdf checklist to use while working through your data management plans. The checklist is intended for researchers at my institution but is still useful for others. It’s CC-BY licensed, so feel free to use and share!

 

Final Thoughts on Data Management Plans

Data management plans are going to be a standard part of any federally funded grant application due to stipulations from the recent White House public access memo. The exact requirements for each plan will vary between agencies and directorates, but there will be some common themes between plans—themes that have been elaborated above.

The one thing I haven’t been able to touch on deeply in this post are the actual data management practices that underpin a good data management plan. But that topic requires a whole blog to cover. I will say that if you answer the above questions and customize your plan to the project at hand, you’ll have a good start on your data management plan.

I hope these two post have clarified the growing importance of data management plans and what goes into them. Data management plans are here to stay but will become easier to write as we get more used to preserving and sharing digital data.

Posted in dataManagementPlans | Leave a comment

Keeping Data Long-Term

Last week, I discussed some of the policy requirements for retaining research data and I’d like to follow up by discussing how one goes about retaining research data for 10+ years. It’s a sad fact that many of us have digital files from 10 years ago that we are no longer able to open or read. For how common it is to have unreadable old files, we should not have to accept this fate for our research data.

The problem is that digital information is not like a book, which can be put on a shelf for 10 years and forgotten yet still be readable when you come back to it. Digital information requires upkeep so you can actually open and use your files 10 years into the future. Digital preservation also requires a little planning up front. The rule of thumb about data management really holds true here: 1 minute of planning now will save you 10 minutes of headache later.

There is a whole field dedicated to digital preservation, but I’d like to discuss a couple easy practices that will make it much easier for you to use your data 10 years from now. Because, as this recent study evidences, you never know when you’ll be out drinking with your research buddies and realize that the data you took 10 years ago could be repurposed for an awesome study (true story).

Do Immediately: Convert File Formats

One of the easiest things you can do to save your files for the future is to convert them to open file formats. The best formats are open, standardized, well-documented, and in wide use; examples include: .csv, .tiff, .txt, .dbf, and .pdf. Avoid proprietary file types whenever you can.

By choosing an open file format, you’re doing a lot to ensure that your files are readable down the road. For example, a lot of people have invested their information in .pdf’s, meaning that there will be a need to read .pdf files well into the future. Your .pdf data will be safer because of this. Likewise, saving spreadsheet data as .csv instead of .xslx means that your files aren’t tied to the fate of one particular software package.

I will also note that even when you convert your files, it’s a good idea to keep copies in both the old and new formats, just in case. You can sometimes lose functionality and formatting through the conversion, so it’s preferable to have the original files on hand if you can still read them.

Sometimes, it’s just not possible to convert to an open file format or retain the desired functionality. In this case, you’ll need to preserve any software and hardware necessary to open and interpret your digital files. This is more work than simple conversion, but can definitely save you a headache down the road.

The most important thing about converting file formats is to do it now instead of later. For example, even though I finished my PhD only a few years ago, a lot of my dissertation data is inaccessible to me because I no longer have access to the software program Igor Pro. If I had only converted my files to .csv’s before I left the lab, I would still be able to use my data if I needed to.

Do Periodically: Update Your Media

Beyond converting files to open formats, it’s also important to periodically examine the media that those files live on. It’s no use having a bunch of converted .txt files if they’re living on a floppy disk. I could probably track down floppy disk drive if I had to, but it would have been a lot easier to use those .txt files if I had moved them to a CD a few years ago.

Updating your media is not something you’ll need to do frequently but you should pay attention to the general ebb and flow of storage types. Being aware of such things will remind you to transfer your video interview data, for example, from VHS to DVD before you loose all access to a VCR.

When in doubt, there are places to send your old media to for recovery, but be aware that it will cost you. It’s much easier–and cheaper–just to update your media periodically as you go along.

Always a Good Idea: Documentation

Finally, documenting your data goes a long way toward ensuring that it is usable in the future. This is because scanning through a file is usually not enough to understand what you’re looking at or how the data were acquired. Documentation provides the context of a dataset and allows for the data to truly be usable.

Documentation becomes more important when preserving data for the long term because you’re likely to forget the context of a dataset in 10 years. You’re much more likely to have the information you need if you document your dataset while you are acquiring it. So while it’s not directly related to the logistics of keeping files readable, documentation is a critical part of preserving data for future use.

For all that I have a whole post about documentation, I’m sure I will keep talking about documentation on this blog because it’s so important to good data management.

A Final Thought

I will end this post with this final thought: you may be required to keep your data on hand for 7 years post-publication or post-grant, but what is the point of keeping them if you have no way of reading them? By doing the simple steps of converting your file to open file formats, periodically moving everything to more modern media, and maintaining good documentation habits, you’ll be doing yourself a huge favor for when you need your data 10 years after you acquired them.

Posted in dataStorage, digitalFiles, digitalPreservation | 1 Comment

How Long Should You Keep Data?

Retraction Watch wrote an interesting post this week about the partial retraction of a 2007 article because the authors could not provide the original research data. The retraction concerns a figure in which two panels appear to show the same information. Because the authors could not provide the original data to either prove or disprove this concern, the figure was retracted.

What matters here is not whether or not there was an actual error (we don’t know the answer), but that the absence of original research data led to a retraction. In how many other cases has there been concern about research results that can’t be addressed for lack of original data? Retraction is one possible outcome in such situations and this should raise an alarm for many researchers.

All of this leads to a very interesting question: for how long does a researcher need to retain their original data after the corresponding article is published?

The article in question would put that time at at least 6 years, but it’s not always clear what the ideal number should be. Even when you can find explicit retention policies, you’ll likely find different time durations in each policy. The many examples below illustrate how varied data retention policies are.

Data Retention Policies

One source for retention policies is universities, which are perhaps the most heterogeneous retention policy type. For example, the University of Kentucky expects their researchers to keep data “for a period of five years after publication or submission of the final report […], whichever is longer” (source), while Northwestern University data “must be retained for a minimum of three years after the financial report for the project period has been submitted” (source). These university policies really run the gamut. I’ve seen universities that mandate data retention for at least 7 years, universities have no research data policies at all, and universities that do not provide a clear retention time even though they have other data policies.

Funders are another source for data retention policies. The Engineering directorate of the NSF states that the mandated retention time “is three years after conclusion of the award or three years after public release, whichever is later” (pdf source). NIH also expects its fundees to keep records for at least three years after the final grant report is submitted (source). Outside the US, many UK funders have required retention times of at least 10 years (pdf source) and the Australian Code for the Responsible Conduct of Research stipulates a general data retention time of 5 years, which increases to 15 years for clinical data (pdf source).

Finally, several US government groups have policies on research data retention. The OMB circular A-110 from the White House states that data “shall be retained for a period of three years from the date of submission of the final expenditure report” (source). The Office of Research Integrity (ORI) states that three years is a commonly cited number, but it’s often not that simple (source). Case in point, a recent ORI investigation apparently set a 6-year limit on addressing research misconduct (source), meaning that you should really keep your data around for at least 6 years post-publication.

In some sense, journals have the ultimate power in the data retention issue because, even if your funder states a retention time of three years, the journal may expect you to produce the data 6 or more years post-publication or risk retraction. So even though the majority of journals do not have explicit data retention policies, they’re the entity setting the retention duration by asking you to prove the original data many years post-publication. This means that you should err on the side of longer retention times.

Some Final Thoughts

What conclusions can we draw from this mess of numbers? First, if you are conducting federally funded research in the US, you must keep data for three years after the completion of the grant at the absolute minimum. In practice, you should keep your data for longer.

Based on the highlighted retraction and the policies described above, I would say that the minimum time you need to keep your data around is currently somewhere between 6 and 10 years post-publication. This may change in the future. I will also note that any data being used in misconduct investigations or ligation (as for patents, etc.) should be retained for a longer period of time.

I would love to put a definite number on data retention, but there is obviously no easy answer. What is clear is that keeping your data on hand can help dispel charges of scientific misconduct and will satisfy requirements from your funder/university/etc. Plus, you never know when you will need to use a piece of old research data for a current project.

As the value of data continues to increase and data sharing becomes more prevalent, I suspect that long-term retention of digital data will just become a normal part of the research process. Until then, you should keep your research data on hand for a good long time.

Posted in dataManagement, dataStorage | 2 Comments

The Absolute Most Important Things to Know in Order to Create a Data Management Plan (Part 1)

I’m currently developing a workshop on creating a data management plan (DMP) and, as part of this development process, I want to identify the absolute most important things to know in order to create a DMP.  Part of the reason for this is that I have a finite amount of time to address DMPs in this session but also because I don’t want to waste people’s time covering less important information.

To start my development process, I’ve come up with a list of some things a researcher might want to know when create a data management plan:

  • What is a DMP?
  • Why create a DMP?
  • What are the benefits of a DMP (other than getting funding)?
  • What are key parts of a DMP?
  • What information do I need to know for each part of a DMP?
  • What are the specific DMP requirements for my grant program?
  • Where can I find an example DMP from my field?
  • Where can I get help on my DMP?
  • Do I really have to share my data?
  • How will my DMP be assessed?
  • I don’t have NSF funding, why should I care about a DMP?
  • Are there any tools/resources I can use to create my DMP?

From this list, it’s clear that some of these points may be better addressed on a webpage of resources than during an in-person session (ie. finding DMP requirements, finding example plans, and a list of DMP tools/resources). Other points are simply not a priority to cover.

This leaves me with, what I think, are the most important things to know for creating a data management plan:

  • Why are researchers being asked to create a DMP (why create a DMP/benefits of a DMP)?
  • What are the key parts of a DMP?
  • How do I apply each of these key parts to my research?

These points also translate nicely into working through an outline DMP during my planned session, meaning researchers will leave the session with something usable and concrete.

With these three points identified, let’s dig into each one a bit more. I’ll cover the first two points in this post and the third in another post in a couple weeks.

 

Why Are Researchers Being Asked to Create a Data Management Plan?

Researchers with NSF and NEH Digital Humanities Directorate (pdf link) funding are currently required to create a data management plan as part of their grant applications. In the next few years, the other federal funders will add similar requirements for DMPs in response to the recent White House OSTP Public Access memo (pdf link). So everyone is getting on the DMP wagon, but the question is why?

From the funder perspective, data represents significant scholarly products that are not being utilized to their full potential (this is especially troubling to funders in the current financial environment). For this reason we are seeing funder mandates for data sharing; the eventual goal is to have massive data sharing akin to the distribution of scholarly articles. The barrier to reaching this goal is the fact that most research data are not well managed and often aren’t maintained past the publication of the associated article. So data management plans are really the first step toward a new way of conducting research because well managed data are more easily shared data.

From the researcher perspective, DMPs are a requirement but also an aid to the research process. I’ve talked about it on this blog before, but deliberate management of data makes it easier to conduct research. Good data management means that researchers are less likely to lose data, more likely to find data when they need it, and can more easily use the data due to better organization and documentation. I’ve even heard it said that one minute of data planning at the start of the project will save 10 minutes of headache later in the project.

The bottom line is: yes, you’re being asked to jump through another hoop in to get funding, but if you’re already creating a plan why wouldn’t you use it to make your research easier?

 

What Are the Key Parts of a Data Management Plan?

An NSF data management plan must include the following information:

  • The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project
  • The standards to be used for data and metadata format and content
  • Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements
  • Policies and provisions for re-use, re-distribution, and the production of derivatives
  • Plans for archiving data, samples, and other research products, and for preservation of access to them

The actual DMP requirements will vary from agency-to-agency and even between directorates within one particular agency, so you’ll want to look up the requirements for your particular grant before you write up your DMP. Still, we can distill NSF’s requirements into some common themes for the composition of a DMP. Basically, your plan should answer the following questions:

  • What types of data will I create?
  • What standards will I use to document the data?
  • How will I archive and preserve the data?
  • How will I protect private/secure/confidential data?
  • How will I provide access to and allow reuse of the data?

These are the key questions you need to ask yourself when creating any data management plan. They represent the many aspects of managing data from creation and documentation through preservation and reuse. By answering these questions, you will come up will a way to manage your data throughout the project.

I’ll go into these 5 questions more in my next post and discuss how to apply each question to your individual research project.

Posted in dataManagement, dataManagementPlans, fundingAgencies | 1 Comment