Data Ab Initio

Defining Data

by Kristin Briney Posted on 2013-09-18

I’m surprised that I haven’t discussed this on the blog yet, but there is a pretty fundamental question that needs to be addressed in order to discuss data management: what does “data” even mean?

Coming from a scientific background, it’s easy to imagine large tables of numbers as data or files filled with the repetitive A, T, G, and C’s of genetic code, but data regularly defies these stereotypes (particularly in non-science disciplines). Data can be videos, large collections of text, images, tweets, geospatial information, etc. What can be used as data is only limited by the research question and the creativity of the researcher.

Though research data can be a lot of things, it is still useful to define the term so we know what information needs management. So here is my working definition of data: anything that you can perform analysis upon. It’s a wide definition, but there are so many types of research out there that anything narrower won’t apply. Despite the broad definition it is still possible to break the diversity of data into four general types.

What Data Are

Data is often categorized into the four following groups: observational data, experimental data, simulation data, and derived/compiled data. Not only are the data in each group different, but the way that you should manage each type differs. Let’s go through each group now.

Observational data are tied to a time and place and are a record of something that occurred there. This type of data includes everything from bird counts, to polling data, to weather sensor data, to recordings of dance performances. The proper management of this data is critical because this information is not reproducible.

Experimental data are created under a particular set of conditions that are (hopefully) reproducible. This type of data covers everything from gene sequences, to chromatography data, to measurements from the Large Hadron Collider, to psychology studies. Good data management is important here too because, depending on the experiment, it can be very expensive to reproduce data. Experimental data also requires good documentation so that, should the need arise, the data can be accurately reproduced.

Simulation data are created using models and code. This type of data covers everything from climate models, to economic models, to simulations of experiments/experimental data. In this group, it is more important to preserve the code that created the data than the data themselves, as the data can be recreated from the code.

Finally, derived/compiled data are compilations of other datasets that can be used for new types of analysis. This type of data covers databases, large corpora used for text mining, collections of images, etc. Standard data management applies, but you’re also more likely to run up against data size concerns and licensing/copyright issues with this data type than the others.

What Data Aren’t

I defined what data are but I think it’s also important to talk about what data aren’t. The OMB Circular A-110 contains a nice round-up of what the government does not consider to be research data.

Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This “recorded” material excludes physical objects (e.g., laboratory samples). Research data also do not include:

(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and

(B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

That doesn’t mean that these things aren’t important to manage and preserve, just that funders don’t consider them to be data for the purposes of sharing and other policies. I will also point out that while lab notebooks are not technically data nor do they need to be shared, they contain important information that gives context to data and should therefore be preserved alongside any data that they describe. The ultimate point is that research is built on multiple information sources, each with its own information management need, but not all of these sources fall under the umbrella of “data”.

Final Thoughts

It’s important to recognize that the term “data” is more broadly applicable than you may think. Something you would not consider to be data can be the critical foundation for research in another field. But that doesn’t mean that the term “data” applies to all research materials.

The broad definition of data goes hand in hand with the realization that not all data should be managed in the same way. Understanding the diversity and nuances of data allows us to make good management decisions to better preserve data and make research more reproducible.

Posted in dataManagement | Leave a comment

A Note on NonCommercial Licenses

by Kristin Briney Posted on 2013-08-28

I wrote about the best licenses for datasets in my previous post and I want to add to that information by pointing out two potentially problematic Creative Commons licenses for research products, data and publications alike: CC Attribution-NonCommercial (CC BY-NC) and CC Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)*. These are the two Creative Commons noncommercial licenses.

There are a couple reasons to think twice before using this class of licenses. The first is that the meaning of noncommercial is unclear. You are excluding anyone from freely using CC-NC content for profit, which obviously covers corporations but might also include groups like nonprofits. For example, a nonprofit may use your content for promotional material intended to increase their membership. An increase in member dues can be considered a financial gain that is not allowed under a CC-NC license. There are a lot of ambiguities here and they are better laid out in this article on noncommercial licensing in biology. The important thing is to be aware that you are excluding more uses than you may realize using a noncommercial license.

The other reason to hesitate before applying a noncommercial license is that publishers can still make profit on this content even though it is “open access”. Licensing something under a CC-NC licence doesn’t mean that it can’t be used commercially, only that it can’t be used commercially for free. This content can still be used in a commercial setting if you pay for permission, just like with traditional content. At least one publisher is guiding its authors toward this “open access” license while simultaneously charging others to commercially use this content; the profit rarely goes back to the author.

Noncommercial licenses are not recommended for data, for reasons expressed in my previous post, but are probably not ideal for your other research products either. I’m not saying that you shouldn’t use a noncommercial license, only that you should be aware of the limitations of these licenses before consciously applying them to your research products.

* “Share Alike” licenses require any derivative products to be similarly licensed. This is a “copyleft” style license, meant to make the content and all its derivatives free in perpetuity.

Posted in copyright | Leave a comment

On Data and Copyright

by Kristin Briney Posted on 2013-08-26

As scientists, we aren’t necessarily trained in copyright. For a long time this hasn’t been a problem, as practices for distributing our scholarly work have been fairly standardized. Open access publishing and data sharing are changing things and providing researchers with a multitude of copyright options beyond just signing over our rights in order to be published. This post looks at some of those options for data.

Data and Copyright

Copyright is confusing, but it becomes even weirder when you apply it to research data. That’s because data are often considered facts, which do not fall under copyright in many countries. Unless you create a creative compilation of those facts, a situation in which copyright then applies in countries like the US. Such variations in copyright law from country to country make it difficult to determine if you need to worry about copyright on your research data.

In the US, the distinction between facts and a creative compilation of facts was laid out in the case “Feist Publications, Inc. v. Rural Telephone Service Company, Inc.” This case applied to the compilation of telephone numbers (ie. a phone book), which was not deemed a creative arrangement. There must be some original selection and rejection in the compilation in order to justify copyright. So a curated database containing research data could be considered a creative arrangement, even though individual facts are not eligible for copyright.

The heterogeneous nature of data adds to the copyright confusion. It’s unclear if original research data that aren’t a stereotypical set of numbers (like image data, video data, etc.) are eligible for copyright, though my inclination is that in some situations they might be.

Confused yet?

Let’s take a step back from this muddle and talk about the two things I think you should know about copyright on datasets (caveat: I am not a copyright expert, so this does not constitute legal advice).

You should recognize that your original research data may not be copyrightable, especially if you are based in the US.
To avoid any copyright confusion, I strongly recommend applying a clear license to any datasets you share—preferably the CC0 license described below.

I recommend using a Creative Commons license because these licenses are easy to apply, legally enforceable, and becoming popular in scholarly publishing. Creative Commons (CC) itself is a nonprofit organization founded in 2001. They took the idea of the GNU GPL license for open source software and applied it to creative works. They offer several licenses, but I want to look at the two most often discussed for datasets.

Creative Commons Attribution (CC BY)

The Creative Commons Attribution license is the most basic of the CC licenses and the one that is often used on open access articles. If you license something under CC BY, you allow anyone to use and modify your content for any purpose, so long as you are given attribution. Because of the freedom to mine content, CC BY is often considered the best license for open access journal articles and for that reason is required by some funding agencies.

On the surface, CC BY seems like a great license for data because it enables data reuse while still requiring citation, which is always important in research. The problem with this license appears when you aggregate datasets. For example, if you are analyzing a group of 100 datasets to find patterns, under a CC-BY license you would need to cite every last dataset in your published article. If you have a particularly large database, citation becomes even more difficult because you need to sort through which parts of the database were actually included in the analysis. Using CC BY datasets in aggregate is obviously problematic.

The limitations of CC BY licensed data are becoming more apparent as data mining emerges as an important research tool. Ironically, data mining is one of the reasons to want openly licensed data in the first place. So in order to enable easier data mining and reuse, Creative Commons does not recommend the use of CC BY for data.

Creative Commons Zero (CC0)

The Creative Commons Zero license is the only Creative Commons license intended for data. Using a CC0 license means that you revoke all of your rights over a dataset, including the attribution requirement which hinders data mining. This may seem counter-intuitive but recognize that you probably didn’t have those rights to begin with in countries like the US.

The strength of CC0 is that it is explicitly intended for content that is copyrightable in some jurisdictions and not others. It does this by removing all copyright claims universally. In the words of Creative Commons:

CC0 should not be used to mark works already free of known copyright and database restrictions and in the public domain throughout the world. However, it can be used to waive copyright and database rights to the extent you may have these rights in your work under the laws of at least one jurisdiction, even if your work is free of restrictions in others. Doing so clarifies the status of your work unambiguously worldwide and facilitates reuse.

CC0 clears away the confusion on whether a dataset is copyrightable, noncopyrightable, or copyrightable in some countries by applying an open license that is unambiguous and usable worldwide. It also allows for data mining and reuse, which makes it the best license for research datasets.

The other big consideration when using a CC0 license is attribution. Attribution is not required with this license, but that doesn’t mean that you should not cite a dataset. Data Dryad addresses this issue nicely in their FAQ:

CC0 does not exempt those who reuse the data from following community norms for scholarly communication, in particular from citation of the original data authors. On the contrary, by removing unenforceable legal barriers, CC0 facilitates the discovery, reuse, and citation of that data. Any publication that makes substantive reuse of the data is expected to cite both the data package and the original publication from which it was derived.

So while CC0 does not require attribution, community norms do. Community norms and the corresponding ethics of doing research are powerful motivators even when there are no comparable legal requirements in place. All this means is that despite not being required to attribute a dataset, you will still be expected to.

Finally, I will note that CC0 fits into a broader idea that scientific data should be open to encourage the scholarly process, an idea which is outlined by the Panton Principles. The Panton Principles identify CC0 and the Public Domain Dedication and License (PDDL) license as the two acceptable options for licensing datasets.

Final Thoughts

Many data repositories are already using the CC0 license: Dryad, figshare (which licenses data under CC0 and all other materials under CC BY), and, just announced this week, BioMed Central, among others. There is definitely growing consensus within the scientific community that CC0 is the preferred license for shared datasets.

Using a CC0 license removes any potential copyright ambiguity and makes it clear that someone else can freely use the licensed dataset. For US-based researchers, it is likely you never had copyright over your data to begin with, but it’s still best to be as explicit as possible that you are not exerting these rights. It makes data that much easier to share and reuse. Just remember that if you come across a CC0-licensed dataset you would like to use, you should cite the data creator even if it is not technically required.

Resources:

Elliott, R. (2005). Who owns scientific data? The impact of intellectual property rights on the scientific publication chain. Learned Publishing, 18(2), 91-94.

Murray-Rust, P. (2008). Open Data in Science. Serials Review, 34(1), 52-64.

Posted in copyright | 1 Comment

The Absolute Most Important Things to Know in Order to Create a Data Management Plan (Part 2)

by Kristin Briney Posted on 2013-08-13

The last time I wrote about data management plans, I covered reasons that funders are starting to expect researchers to manage their data better and the 5 topics that make up most data management plans. The topics are as follows:

What types of data will I create?
What standards will I use to document the data?
How will I archive and preserve the data?
How will I protect private/secure/confidential data?
How will I provide access to and allow reuse of the data?

In this post, I want to dig into each of these topics a bit more.

1. What data will I create?

This is the background section of your data management plan, where you will provide an overview of your data and some of the most basic information on managing it. In general, you’ll want to answer the following questions:

What data will be collected?
Are my data unique? Are my data derived from existing data and are those data still available?
How big will my data be? How fast will my data grow?
How will my data be stored?
Who owns and is responsible for the data?

A lot of the content of this section will be specific to that particular project, but there are some common themes to look out for here.

First, you should consider how unique your data are. The management of observational data, for example, should be prioritized because that type of data are so tied to a time and a place that they cannot be recreated if lost. Simulation data, on the other hand, are easy to recreate; the management of this data should focus on its corresponding code over the data themselves.

When storing data, the motto is: Lots of Copies Keeps Stuff Safe (or LOCKSS, which is also a preservation tool). Plan to follow the rule of 3, which dictates 2 onsite copies and 1 offsite copy. Automate your backups whenever you can.

Finally, you’ll want to designate someone who will be responsible for the data; usually this is the PI. Be aware that the responsible party might not be the data owner, as I mentioned in an earlier post on how complicated data ownership can be.

2. What standards will I use to document the data?

Documentation is a key part of data management, as I have mentioned several times on this blog already. In this section of your plan, you will want to cover:

Are there any community standards for documentation, such as an ontology or metadata schema?
How will I document and organize my data? What metadata schema will I use?
How will I document my methods and other information needed for reproducibility?

You’ll need some sort of documentation system no matter what, but you should really consider using a formal schema if you want to or are required to share your data. Formal metadata schemas document the context of a dataset in a standardized way, allowing datasets to be easily shared and interpreted by other parties.

If you decide to use a formal schema to document your data, it’s best to choose the schema before you collect your data so you know exactly what information to record. This is especially important if you know that you’ll be depositing your data in a particular repository. Take 2 minutes to look up this information before you acquire your data and save yourself a huge headache later when you go to deposit your data.

Besides looking at a disciplinary repository for the best documentation scheme, you can also consult your peers and your subject librarian. Be aware that your field might have not only metadata schemas, but also ontologies or taxonomies that will help you classify your datasets.

Finally, you should think about the other information that lets you understand your data and the method by which you collected and interpreted it; things like: code, surveys, codebooks, data dictionaries, etc. This information not only adds context to your dataset but also makes it more trustworthy.

3. How will I protect private/secure/confidential data?

This section will not apply to all data plans, but is critical if it applies to you. Some of the issues you will need to address are:

What regulations apply to my data (HIPAA, FERPA, FISMA, etc.)?
What security measures will I put in place to protect my data?
Who is allowed access to my data?
Who will be responsible for data security?
Will my data lead to a patent or other intellectual property claim?

The best thing to do if you have data that falls under one of the listed policies, local IRB constraints, or intellectual property claims is to talk to someone at your local institution. Most all research institutions have policies as well as support systems for dealing with these issues. Data security is not the place to you want to cobble something together and hope it works (that can ruin careers).

Find your local experts. Cite your local policy. Make someone to keep on top of this.

4. How will I archive and preserve the data?

This section addresses one of the main reasons researchers are being asked to create data management plans: so their data outlive publication of the corresponding research article. The topics you should discuss are the following:

How long will I retain the data?
What file formats will I use? Do I need to preserve any software?
Where will I archive my data?
Who will be responsible for my data in the long term?

I addressed retention times and how to preserve data in my previous two blog posts, so I won’t go into those topic here.

What I will say is that usually the best method of preserving your data is to find a trustworthy partner to do it for you. A few good options are a disciplinary data repository, an institutional repository, or a journal that accepts data. Local servers come and go, whereas a repository’s mission is to keep things for a long period of time. You worry about the science and let them worry about the data.

5. How will I provide access to and allow reuse of the data?

The final portion of a data management plan is necessary for grant programs that require data sharing. If that condition applies to you, you should address the following questions:

Is there a relevant sharing policy?
Who is the audience for my data?
When and where will I make my data available? Do I have resources for hosting the data myself?

In addition to looking at funding agency and directorate policies that require sharing, there are a growing number of journals that require data sharing as a condition for publication.

I will give the same advice for data sharing as I did for data preservation: let someone else worry about this. It takes much more work (and is also more expensive) to make your data available by request or on your website than handing it over to a repository to manage for you. Additionally, your data will be easier to find in a repository than on your website, making it more likely to be cited!

Data Management Plan Checklist

In addition to blogging about these key questions, I have also made them into a handy .pdf checklist to use while working through your data management plans. The checklist is intended for researchers at my institution but is still useful for others. It’s CC-BY licensed, so feel free to use and share!

Final Thoughts on Data Management Plans

Data management plans are going to be a standard part of any federally funded grant application due to stipulations from the recent White House public access memo. The exact requirements for each plan will vary between agencies and directorates, but there will be some common themes between plans—themes that have been elaborated above.

The one thing I haven’t been able to touch on deeply in this post are the actual data management practices that underpin a good data management plan. But that topic requires a whole blog to cover. I will say that if you answer the above questions and customize your plan to the project at hand, you’ll have a good start on your data management plan.

I hope these two post have clarified the growing importance of data management plans and what goes into them. Data management plans are here to stay but will become easier to write as we get more used to preserving and sharing digital data.

Posted in dataManagementPlans | Leave a comment

Keeping Data Long-Term

by Kristin Briney Posted on 2013-07-30

Last week, I discussed some of the policy requirements for retaining research data and I’d like to follow up by discussing how one goes about retaining research data for 10+ years. It’s a sad fact that many of us have digital files from 10 years ago that we are no longer able to open or read. For how common it is to have unreadable old files, we should not have to accept this fate for our research data.

The problem is that digital information is not like a book, which can be put on a shelf for 10 years and forgotten yet still be readable when you come back to it. Digital information requires upkeep so you can actually open and use your files 10 years into the future. Digital preservation also requires a little planning up front. The rule of thumb about data management really holds true here: 1 minute of planning now will save you 10 minutes of headache later.

There is a whole field dedicated to digital preservation, but I’d like to discuss a couple easy practices that will make it much easier for you to use your data 10 years from now. Because, as this recent study evidences, you never know when you’ll be out drinking with your research buddies and realize that the data you took 10 years ago could be repurposed for an awesome study (true story).

Do Immediately: Convert File Formats

One of the easiest things you can do to save your files for the future is to convert them to open file formats. The best formats are open, standardized, well-documented, and in wide use; examples include: .csv, .tiff, .txt, .dbf, and .pdf. Avoid proprietary file types whenever you can.

By choosing an open file format, you’re doing a lot to ensure that your files are readable down the road. For example, a lot of people have invested their information in .pdf’s, meaning that there will be a need to read .pdf files well into the future. Your .pdf data will be safer because of this. Likewise, saving spreadsheet data as .csv instead of .xslx means that your files aren’t tied to the fate of one particular software package.

I will also note that even when you convert your files, it’s a good idea to keep copies in both the old and new formats, just in case. You can sometimes lose functionality and formatting through the conversion, so it’s preferable to have the original files on hand if you can still read them.

Sometimes, it’s just not possible to convert to an open file format or retain the desired functionality. In this case, you’ll need to preserve any software and hardware necessary to open and interpret your digital files. This is more work than simple conversion, but can definitely save you a headache down the road.

The most important thing about converting file formats is to do it now instead of later. For example, even though I finished my PhD only a few years ago, a lot of my dissertation data is inaccessible to me because I no longer have access to the software program Igor Pro. If I had only converted my files to .csv’s before I left the lab, I would still be able to use my data if I needed to.

Do Periodically: Update Your Media

Beyond converting files to open formats, it’s also important to periodically examine the media that those files live on. It’s no use having a bunch of converted .txt files if they’re living on a floppy disk. I could probably track down floppy disk drive if I had to, but it would have been a lot easier to use those .txt files if I had moved them to a CD a few years ago.

Updating your media is not something you’ll need to do frequently but you should pay attention to the general ebb and flow of storage types. Being aware of such things will remind you to transfer your video interview data, for example, from VHS to DVD before you loose all access to a VCR.

When in doubt, there are places to send your old media to for recovery, but be aware that it will cost you. It’s much easier–and cheaper–just to update your media periodically as you go along.

Always a Good Idea: Documentation

Finally, documenting your data goes a long way toward ensuring that it is usable in the future. This is because scanning through a file is usually not enough to understand what you’re looking at or how the data were acquired. Documentation provides the context of a dataset and allows for the data to truly be usable.

Documentation becomes more important when preserving data for the long term because you’re likely to forget the context of a dataset in 10 years. You’re much more likely to have the information you need if you document your dataset while you are acquiring it. So while it’s not directly related to the logistics of keeping files readable, documentation is a critical part of preserving data for future use.

For all that I have a whole post about documentation, I’m sure I will keep talking about documentation on this blog because it’s so important to good data management.

A Final Thought

I will end this post with this final thought: you may be required to keep your data on hand for 7 years post-publication or post-grant, but what is the point of keeping them if you have no way of reading them? By doing the simple steps of converting your file to open file formats, periodically moving everything to more modern media, and maintaining good documentation habits, you’ll be doing yourself a huge favor for when you need your data 10 years after you acquired them.

Posted in dataStorage, digitalFiles, digitalPreservation | 1 Comment

How Long Should You Keep Data?

by Kristin Briney Posted on 2013-07-22

Retraction Watch wrote an interesting post this week about the partial retraction of a 2007 article because the authors could not provide the original research data. The retraction concerns a figure in which two panels appear to show the same information. Because the authors could not provide the original data to either prove or disprove this concern, the figure was retracted.

What matters here is not whether or not there was an actual error (we don’t know the answer), but that the absence of original research data led to a retraction. In how many other cases has there been concern about research results that can’t be addressed for lack of original data? Retraction is one possible outcome in such situations and this should raise an alarm for many researchers.

All of this leads to a very interesting question: for how long does a researcher need to retain their original data after the corresponding article is published?

The article in question would put that time at at least 6 years, but it’s not always clear what the ideal number should be. Even when you can find explicit retention policies, you’ll likely find different time durations in each policy. The many examples below illustrate how varied data retention policies are.

Data Retention Policies

One source for retention policies is universities, which are perhaps the most heterogeneous retention policy type. For example, the University of Kentucky expects their researchers to keep data “for a period of five years after publication or submission of the final report […], whichever is longer” (source), while Northwestern University data “must be retained for a minimum of three years after the financial report for the project period has been submitted” (source). These university policies really run the gamut. I’ve seen universities that mandate data retention for at least 7 years, universities have no research data policies at all, and universities that do not provide a clear retention time even though they have other data policies.

Funders are another source for data retention policies. The Engineering directorate of the NSF states that the mandated retention time “is three years after conclusion of the award or three years after public release, whichever is later” (pdf source). NIH also expects its fundees to keep records for at least three years after the final grant report is submitted (source). Outside the US, many UK funders have required retention times of at least 10 years (pdf source) and the Australian Code for the Responsible Conduct of Research stipulates a general data retention time of 5 years, which increases to 15 years for clinical data (pdf source).

Finally, several US government groups have policies on research data retention. The OMB circular A-110 from the White House states that data “shall be retained for a period of three years from the date of submission of the final expenditure report” (source). The Office of Research Integrity (ORI) states that three years is a commonly cited number, but it’s often not that simple (source). Case in point, a recent ORI investigation apparently set a 6-year limit on addressing research misconduct (source), meaning that you should really keep your data around for at least 6 years post-publication.

In some sense, journals have the ultimate power in the data retention issue because, even if your funder states a retention time of three years, the journal may expect you to produce the data 6 or more years post-publication or risk retraction. So even though the majority of journals do not have explicit data retention policies, they’re the entity setting the retention duration by asking you to prove the original data many years post-publication. This means that you should err on the side of longer retention times.

Some Final Thoughts

What conclusions can we draw from this mess of numbers? First, if you are conducting federally funded research in the US, you must keep data for three years after the completion of the grant at the absolute minimum. In practice, you should keep your data for longer.

Based on the highlighted retraction and the policies described above, I would say that the minimum time you need to keep your data around is currently somewhere between 6 and 10 years post-publication. This may change in the future. I will also note that any data being used in misconduct investigations or ligation (as for patents, etc.) should be retained for a longer period of time.

I would love to put a definite number on data retention, but there is obviously no easy answer. What is clear is that keeping your data on hand can help dispel charges of scientific misconduct and will satisfy requirements from your funder/university/etc. Plus, you never know when you will need to use a piece of old research data for a current project.

As the value of data continues to increase and data sharing becomes more prevalent, I suspect that long-term retention of digital data will just become a normal part of the research process. Until then, you should keep your research data on hand for a good long time.

Posted in dataManagement, dataStorage | 2 Comments