Have you ever had that experience where you’re analyzing a dataset and realize you should have written some extra bit of information down when you took the data? Sometimes this is just a minor inconvenience, but sometimes this makes your data unusable. The experience of missing information supports the fact that the context of our data is just as important as the data themselves. Data aren’t usable on their own.
One of the best things you can do to manage your data better is simply to describe it better. Recording information such as acquisition conditions, acquisition date, a brief description of the data, and file name makes it easier for you to find and use that data later.
(In the library realm, we call any information about a dataset that isn’t the dataset itself ‘metadata’. Metadata literally means data about data. It’s a weird word, but the concept is not wholly unfamiliar–most researchers already record metadata in their research/laboratory notebooks. It’s whatever you write down that isn’t your actual dataset.)
Contextual information about data runs the spectrum from the informal scribblings in a laboratory notebook to the very standardize schemas written in XML that accompany digital data files. While both informal and formal have their places in research, I’m an advocate for some amount of standardization.
Standardizing recorded information about your data helps you in several ways. First, it reminds you to record all of the necessary information about your data. Second, it helps you find datasets because it’s easier to search through organized information. Thirdly, standardization helps your colleagues understand your data, which is useful during collaboration and should you leave a laboratory. Finally, standardization can be personalized and doesn’t have to be rigid. Standardization should easily fit into your workflow and should be adaptable enough to respond to any changes in your research.
To standardize the information you record about your data, you need to reflect on what is important about your data. This is likely to be different for different types of data. Once you’ve done some brainstorming, write down a list of the things that you should record each time you acquire that type of data. You can then type this up into a table, make a bunch of copies, and use them in your notebook or post a cheat sheet on your lab bench near where you take notes. Don’t be afraid to adapt this list as your needs change.
Another standardization option is using a formal list, or schema, from your field. There are a ton of schemas out there, so I recommend consulting with your colleagues or your local reference librarian on what people use in your field. The nice thing about using such a formal list is that it identifies information that your community finds useful. It’s likely that you’ll find this information useful as well.
As an example of how to use a standard list, let’s look at the generic schema called Dublin Core. I really like Dublin Core because it lays out the most basic information one should record about an object. Here are Dublin Core’s categories:
- contributor
- coverage
- creator
- date
- description
- format
- identifier
- language
- publisher
- relation
- rights
- source
- subject
- title
- type
This selection of categories works for images, physical samples, spreadsheet data, text, and whatever else you need to describe. Some of these categories may be less useful depending on the project, but it’s still a nice starting point.
So let’s take this list and apply it to a fictitious microscope image:
- contributor – Jane Collaborator
- creator – Kristin Briney
- date – 2013 Apr 15
- description – A microscopy image of cancerous breast tissues under 20x zoom. This image is my control, so it has only the standard staining describe on 2013 Feb 2 in my notebook.
- format – JPEG
- identifier – IMG00057.jpg
- relation – Same sample as images IMG00056.jpg and IMG00055.jpg
- subject – Breast cancer
- title – Cancerous breast tissue control
Even without using all of the Dublin Core categories, this gives you a pretty good sense of what is in my fictitious dataset. (If this were my real dataset, I would probably expand on the acquisition conditions using a subject-specific schema like OME-XML or DICOM.) Good data descriptions should stand on their own, meaning you shouldn’t have to look at the data to know what they are. The goal is to record all of the necessary information so that someone, myself include, can find and understand the data later. A formal list makes complete description easier and is not difficult to implement in the laboratory.
Nothing is more frustrating that trying to reconstruct older datasets from partial notes. This happened to me in grad school and it’s now my goal for no one else to experience this frustration. The way to prevent it is by properly describing your datasets when you collect them. Standardization makes this easier, but any good data description will enable you to better manage and use your data.
Pingback: Keeping Data Long-Term » Data Ab Initio