On Data and Copyright

As scientists, we aren’t necessarily trained in copyright. For a long time this hasn’t been a problem, as practices for distributing our scholarly work have been fairly standardized. Open access publishing and data sharing are changing things and providing researchers with a multitude of copyright options beyond just signing over our rights in order to be published. This post looks at some of those options for data.

Data and Copyright

Copyright is confusing, but it becomes even weirder when you apply it to research data. That’s because data are often considered facts, which do not fall under copyright in many countries. Unless you create a creative compilation of those facts, a situation in which copyright then applies in countries like the US. Such variations in copyright law from country to country make it difficult to determine if you need to worry about copyright on your research data.

In the US, the distinction between facts and a creative compilation of facts was laid out in the case “Feist Publications, Inc. v. Rural Telephone Service Company, Inc.” This case applied to the compilation of telephone numbers (ie. a phone book), which was not deemed a creative arrangement. There must be some original selection and rejection in the compilation in order to justify copyright. So a curated database containing research data could be considered a creative arrangement, even though individual facts are not eligible for copyright.

The heterogeneous nature of data adds to the copyright confusion. It’s unclear if original research data that aren’t a stereotypical set of numbers (like image data, video data, etc.) are eligible for copyright, though my inclination is that in some situations they might be.

Confused yet?

Let’s take a step back from this muddle and talk about the two things I think you should know about copyright on datasets (caveat: I am not a copyright expert, so this does not constitute legal advice).

You should recognize that your original research data may not be copyrightable, especially if you are based in the US.
To avoid any copyright confusion, I strongly recommend applying a clear license to any datasets you share—preferably the CC0 license described below.

I recommend using a Creative Commons license because these licenses are easy to apply, legally enforceable, and becoming popular in scholarly publishing. Creative Commons (CC) itself is a nonprofit organization founded in 2001. They took the idea of the GNU GPL license for open source software and applied it to creative works. They offer several licenses, but I want to look at the two most often discussed for datasets.

Creative Commons Attribution (CC BY)

The Creative Commons Attribution license is the most basic of the CC licenses and the one that is often used on open access articles. If you license something under CC BY, you allow anyone to use and modify your content for any purpose, so long as you are given attribution. Because of the freedom to mine content, CC BY is often considered the best license for open access journal articles and for that reason is required by some funding agencies.

On the surface, CC BY seems like a great license for data because it enables data reuse while still requiring citation, which is always important in research. The problem with this license appears when you aggregate datasets. For example, if you are analyzing a group of 100 datasets to find patterns, under a CC-BY license you would need to cite every last dataset in your published article. If you have a particularly large database, citation becomes even more difficult because you need to sort through which parts of the database were actually included in the analysis. Using CC BY datasets in aggregate is obviously problematic.

The limitations of CC BY licensed data are becoming more apparent as data mining emerges as an important research tool. Ironically, data mining is one of the reasons to want openly licensed data in the first place. So in order to enable easier data mining and reuse, Creative Commons does not recommend the use of CC BY for data.

Creative Commons Zero (CC0)

The Creative Commons Zero license is the only Creative Commons license intended for data. Using a CC0 license means that you revoke all of your rights over a dataset, including the attribution requirement which hinders data mining. This may seem counter-intuitive but recognize that you probably didn’t have those rights to begin with in countries like the US.

The strength of CC0 is that it is explicitly intended for content that is copyrightable in some jurisdictions and not others. It does this by removing all copyright claims universally. In the words of Creative Commons:

CC0 should not be used to mark works already free of known copyright and database restrictions and in the public domain throughout the world. However, it can be used to waive copyright and database rights to the extent you may have these rights in your work under the laws of at least one jurisdiction, even if your work is free of restrictions in others. Doing so clarifies the status of your work unambiguously worldwide and facilitates reuse.

CC0 clears away the confusion on whether a dataset is copyrightable, noncopyrightable, or copyrightable in some countries by applying an open license that is unambiguous and usable worldwide. It also allows for data mining and reuse, which makes it the best license for research datasets.

The other big consideration when using a CC0 license is attribution. Attribution is not required with this license, but that doesn’t mean that you should not cite a dataset. Data Dryad addresses this issue nicely in their FAQ:

CC0 does not exempt those who reuse the data from following community norms for scholarly communication, in particular from citation of the original data authors. On the contrary, by removing unenforceable legal barriers, CC0 facilitates the discovery, reuse, and citation of that data. Any publication that makes substantive reuse of the data is expected to cite both the data package and the original publication from which it was derived.

So while CC0 does not require attribution, community norms do. Community norms and the corresponding ethics of doing research are powerful motivators even when there are no comparable legal requirements in place. All this means is that despite not being required to attribute a dataset, you will still be expected to.

Finally, I will note that CC0 fits into a broader idea that scientific data should be open to encourage the scholarly process, an idea which is outlined by the Panton Principles. The Panton Principles identify CC0 and the Public Domain Dedication and License (PDDL) license as the two acceptable options for licensing datasets.

Final Thoughts

Many data repositories are already using the CC0 license: Dryad, figshare (which licenses data under CC0 and all other materials under CC BY), and, just announced this week, BioMed Central, among others. There is definitely growing consensus within the scientific community that CC0 is the preferred license for shared datasets.

Using a CC0 license removes any potential copyright ambiguity and makes it clear that someone else can freely use the licensed dataset. For US-based researchers, it is likely you never had copyright over your data to begin with, but it’s still best to be as explicit as possible that you are not exerting these rights. It makes data that much easier to share and reuse. Just remember that if you come across a CC0-licensed dataset you would like to use, you should cite the data creator even if it is not technically required.

Resources:

Elliott, R. (2005). Who owns scientific data? The impact of intellectual property rights on the scientific publication chain. Learned Publishing, 18(2), 91-94.

Murray-Rust, P. (2008). Open Data in Science. Serials Review, 34(1), 52-64.