Defining Data

I’m surprised that I haven’t discussed this on the blog yet, but there is a pretty fundamental question that needs to be addressed in order to discuss data management: what does “data” even mean?

Coming from a scientific background, it’s easy to imagine large tables of numbers as data or files filled with the repetitive A, T, G, and C’s of genetic code, but data regularly defies these stereotypes (particularly in non-science disciplines). Data can be videos, large collections of text, images, tweets, geospatial information, etc. What can be used as data is only limited by the research question and the creativity of the researcher.

Though research data can be a lot of things, it is still useful to define the term so we know what information needs management. So here is my working definition of data: anything that you can perform analysis upon. It’s a wide definition, but there are so many types of research out there that anything narrower won’t apply. Despite the broad definition it is still possible to break the diversity of data into four general types.

What Data Are

Data is often categorized into the four following groups: observational data, experimental data, simulation data, and derived/compiled data. Not only are the data in each group different, but the way that you should manage each type differs. Let’s go through each group now.

Observational data are tied to a time and place and are a record of something that occurred there. This type of data includes everything from bird counts, to polling data, to weather sensor data, to recordings of dance performances. The proper management of this data is critical because this information is not reproducible.

Experimental data are created under a particular set of conditions that are (hopefully) reproducible. This type of data covers everything from gene sequences, to chromatography data, to measurements from the Large Hadron Collider, to psychology studies. Good data management is important here too because, depending on the experiment, it can be very expensive to reproduce data. Experimental data also requires good documentation so that, should the need arise, the data can be accurately reproduced.

Simulation data are created using models and code. This type of data covers everything from climate models, to economic models, to simulations of experiments/experimental data.  In this group, it is more important to preserve the code that created the data than the data themselves, as the data can be recreated from the code.

Finally, derived/compiled data are compilations of other datasets that can be used for new types of analysis. This type of data covers databases, large corpora used for text mining, collections of images, etc. Standard data management applies, but you’re also more likely to run up against data size concerns and licensing/copyright issues with this data type than the others.

What Data Aren’t

I defined what data are but I think it’s also important to talk about what data aren’t. The OMB Circular A-110 contains a nice round-up of what the government does not consider to be research data.

Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This “recorded” material excludes physical objects (e.g., laboratory samples). Research data also do not include:

(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and

(B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

That doesn’t mean that these things aren’t important to manage and preserve, just that funders don’t consider them to be data for the purposes of sharing and other policies. I will also point out that while lab notebooks are not technically data nor do they need to be shared, they contain important information that gives context to data and should therefore be preserved alongside any data that they describe. The ultimate point is that research is built on multiple information sources, each with its own information management need, but not all of these sources fall under the umbrella of “data”.

Final Thoughts

It’s important to recognize that the term “data” is more broadly applicable than you may think. Something you would not consider to be data can be the critical foundation for research in another field. But that doesn’t mean that the term “data” applies to all research materials.

The broad definition of data goes hand in hand with the realization that not all data should be managed in the same way. Understanding the diversity and nuances of data allows us to make good management decisions to better preserve data and make research more reproducible.

This entry was posted in dataManagement. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *