Zero v. Null

Zero
“Null Schatten / Zero shade” by Winfried, https://www.flickr.com/photos/w-tommerdich/3630153810 (CC BY-NC-SA)

An important aspect of data management is performing quality control on your data. This means checking your data for errors, ensuring consistent formatting, documenting the meaning of your variables, etc. Also under the umbrella of quality control is how to represent the absence of value in a dataset.

The absence of value can mean many things in a dataset including a true zero, that the data point is missing, that the data point is not applicable to this entry, etc. Unfortunately, many of these cases end up with the same label (or several different labels for any one case!) in a dataset, either “0”, a blank entry, “NA” or something else. And when it comes to calculating values like averages, there is a big difference between a “0” that is a true zero and one that is a placeholder for missing data. Therefore, we need to establish some best practices around absence of value.

The first rule is that “0” always represents true zero and nothing else. This means that you’ve made a measurement and that measurement happens to be zero. Only using “0” for this case makes your subsequent calculations accurate.

The second rule is to pick a good null label. This label will represent a lack of measurement. One of the best null labels is the blank entry, which most programs will interpret as null (so long as you’re careful not use a space instead of a blank). A secondary option is to use the null value preferred by your primary analysis program: “NA” in R, “NULL” in SQL, and “None” in Python, etc. (see Table 1 of White, et al.). However, this option is less ideal as it can result in unexpected problems if you don’t modify the nulls in your dataset before using it in a different program. So it’s best to stick with the blank entry for all of your null data points.

The third rule is to be consistent. There is no point in standardizing something if you’re not going to be consistent about it and, in this case, consistency makes for accurate calculations. So pick a system and stick with it!

Finally, you should document anything that isn’t standard. Want to use blanks for missing data and “NA” for not applicable data points? You can do it, so long as you are clear and upfront (and consistent) about the system you use.

Keeping zero and null straight is not difficult, but it takes a little conscious effort to be sure that everything is accurate. This effort is worth it in the long run, as your datasets are streamlined and your calculations turn out correct.

 

References:

This post is about nothing*, Practical Data Management for Bug Counters. 30 Jan 2014.

White, et al. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2). 2013. http://library.queensu.ca/ojs/index.php/IEE/article/view/4608/4898

Posted in dataAnalysis | Leave a comment

Test Your Backups

A lot of data management is about risk prevention and near the top of that list is having a good backup system (or two). But what I want to talk about today is that it’s not enough to have a backup system in place, you also need to know that it’s working.

There are an incredible number of stories about loosing data due to poor backups, but one of the best comes from the makers of Toy Story 2.

The moral of the story is to check your backups.

You should check your backups for two reasons. First, you need to know that they are working properly. A backup that is not working is not a backup at all. You should test your backups periodically, say once or twice a year, and any time you make changes to your backup system. If your data are particularly complex to back up or particularly valuable, considering testing your backups more frequently.

The second reason to test your backups is to know how to restore from backup. Believe me, you don’t want to be learning how to restore from backup when you’re already in a panic over loosing the main copy of your data. Knowing how to restore from backup ahead of time will make the data recovery process go much more smoothly.

It’s a small thing to periodically test restore from backup, but it will give you the piece of mind that your data are being properly backed up and that you will be able to recover everything if something happens to your main copy. On balance, that’s definitely worth taking a few minutes out of your day for.

Posted in dataStorage | Leave a comment

README.txt

I mentioned README.txt files in my previous post and I wanted to expand on this concept because README’s are one of my favorite data management tools.  The reason is that many of us keep notes separate from our digital data files, so our digital data is not always well documented or understandable at a glance. README.txt files cover this gap and allow you to add notes about the organization and content of your digital files and folders. This helps coworkers and your future-self navigate through your data.

README.txt files originated with computer code, where it is the first file someone should look at in order to understand the code (as implied by the name, README). Being a .txt file makes this information readable on a number of systems because of the simple file type. The simplicity and portability make README’s a great tool to coopt for data management.

I strongly recommend that you use a README.txt file at the top level of your project folder to explain the purpose of the project, the relevant summary and contact details, and general organization of your files. This is equivalent to using the first page of your laboratory notebook to give a general description of your project.

Here is an example of a top-level README.txt file for an imaginary chemistry project:

Project: Kristin’s important chemistry project
Date: June 2013-April 2014
Description: Description of my awesome project here
Funder: Department of Energy, grant no: XXXXXX
Contact: Kristin Briney, kristin@myemail.com

ORGANIZATION

All files live in the ‘ImportantProject’ folder, with content organized into subfolders as follows:

- ‘RawData’: All raw data goes into this folder, with subfolders organized by date
- ‘AnalyzedData’: Data analysis files
- ‘PaperDrafts’: Draft of paper, including text, figures, outlines, reference library, etc.
- ‘Documentation’: Scanned copies of my written research notes and other research notes
- ‘Miscellaneous’: Other information that relates to this project

 NAMING

Raw data files will be named as follows:

“YYYYMMDD_experiment_sample_ExpNum”
(ex: “20140224_UVVis_KMnO4_2.csv”)

STORAGE

All files will be stored on my computer and backed up daily to the shared department server. I will also keep a backup copy in the cloud using SpiderOak.

If I hand someone this project folder, the README.txt contains enough information to understand the project and do basic navigation through the subfolders. Plus, I tell you where all of the copies of my data live if one should accidentally be lost. While not extensive, this information is invaluable to someone unfamiliar with my work trying to find and use my files, such as a boss or coworker.

Besides having one top-level README.txt file, I also recommend using these text files throughout your digital file structure whenever you need them. If you cannot tell, at a glance, what all of the files and subfolder contain, you should create a README.txt (and possibly rename your files and folders!).

Here is an example of a low-level README.txt, which documents the differences between several different versions of analyzed dataset:

Description of files in the “Analysis/ReactionTime/KMnO4” folder

- KMnO4rxn_v01: Organizing raw data into one spreadsheet
- KMnO4rxn_v02: Trying out first-order reaction rate
- KMnO4rxn_v03: Trying out second-order reaction rate
- KMnO4rxn_v04: Revert back to v02/first-order fitting and refining analysis
- KMnO4rxn_FINAL: Final fit and numbers for reaction rate

The graphs corresponding to each file version are in the ‘Graphs’ subfolder, with correspondence explained by the README.txt contained therein.

You can see that README’s don’t have to be large files. Instead, they just need to contain enough information to know what you’re looking at.

README.txt files are ostensibly for other people who might use your data, but they are also useful for you, the data creator, if and when you come back to an older set of data. We tend to forget small details over time and a good README.txt serves as a reminder about those details and an easy way to reacclimate ourselves with our older data.

It takes a small amount of time to create README.txt, but they fill an important documentation gap and are incredibly useful for data given to others and data with long-term value. I encourage you to create a few README.txt files and improve your data management!

Posted in dataManagement, digitalFiles | Leave a comment

Wrapping Up A Project

Data managers talk a lot about doing data management before your project starts, but there is another important point in a project that is critical to data management: when a project ends. My recent post on managing thesis data got me thinking about this critical project point, along with a recent tweet from Robin Rice, Data Librarian at the University of Edinburgh, on what usually happens to data post-thesis:

While all project data are susceptible to such loss, thesis data are particularly fragile because data are often handed off to a PI when the student leaves the university. This puts someone who does not have much knowledge of the data in charge of caring for the data in the long term. The truth is that your PI will be much happier, and you will be happier with your own access in the long term, if you prep your data a bit before this hand off.

You have a distinctive opportunity to care for your data at the time when you are wrapping up a project. Not only is the data still fresh in your mind, but you probably already perform some management actions, like backing up your data and storing your notebook, when wrapping up a project. Adding a few simple steps to this process will let you enjoy the products of your work well after you finish the actual project.

Back Up Your Written Notes

People always think to back up their digital data (which you should definitely do) but few ever remember to back up written notes. This is a shame because data without the corresponding notes are often unusable. Not only does a backup copy address the possibility of a lost notebook, but it also helps the dissertators who hand over their notes at the end of their degree. If those researchers want access to their written notes after they leave their university, they must make a copy for themselves before the handover.

You can back up your notes by making physical photocopies, but any more I recommend digital scans. The benefit of scans is that you can store them directly alongside your digital data, which saves you from having to track down stray notes later. It does take time to scan a notebook, but the reward is ensuring access to your notes and maintaining the usefulness of data going forward.

Convert to Open File Formats

This is the one that has defeated me personally. Even though I have all my files from graduate school, most of my data is locked up in a proprietary format that I no longer have software to open. Don’t get stuck in the trap where you have your data but cannot read or use them!

If you haven’t done so already, wrapping up a project is a great time to convert files to an open format. Look for formats that are open, standardized, well-documented, and in wide use, such as: .csv, .tiff, .txt, .dbf, and .pdf. These formats can be opened by many programs, meaning lots of options for getting back your data when you need them.

If there isn’t a good open format for your data type, or you will lose important information during conversion, you’ll want to plan on how you’ll maintain access to the necessary software into the future. Realize that this option takes much more effort, so opt for open file format if you at all can.

Utilize “README.txt” Files

I cannot recommend “README.txt” files enough for making sense of digital files and file organization. These simple text files answer the very important questions of “What the heck am I looking at?” and “Where do I find X?” in your project file folders. This information is useful at every level of your project, from the main project folder on down to the folder containing sets of data. Plan to create one README.txt file per folder in as many folders as you can.

By their name alone, README.txt files announce that they are the first file to open when you or someone else is looking through your old data. Their job is to provide a map for exploring your files. For example, a top-level README.txt should give the general project information and a very coarse overview of file contents and locations. A low-level README.txt would be more specific as to what each file contains. These files need not be large, but their contents should provide a framework for easy navigation through your digital files and folders.

When wrapping up a project, you should create a README.txt file for at least your top-level folder and your most important project folders. This is doubly important if you are handing off your data for someone else to maintain, as good README’s make it exponentially easier for someone unfamiliar with the data figure out what’s what. Still, this system is useful to you, the data creator, in the event you come back to the data in the future.

Keep Everything Together

Finally, you will want to track down stray files and folders when you wrap up a project. It is much easier to manage all of your data if it is in one place (or two places if you have both physical and digital collections). Note that this does not include backups, which are separate and can exist offsite. Don’t forget to include things like reference libraries and relevant paper drafts in this pile; you want to save everything related to the project in the same place.

Once you have everything together, save it to an appropriate place and back it up. Keep track of your files and backups and move everything to new media every few years or so. You don’t want to be that researcher looking for Zip disk readers in 5 years. Remember that just because your project is complete, doesn’t mean that you can now ignore your data.

Final Thoughts

Researchers are often anxious to move onto the next thing when wrapping up a project, but you must resist the temptation to speed through the data preparation process. Taking an extra day to prepare your data properly can mean the difference between being able to use your data in 3 years and not having access to it at all. Between all of the time and effort you have invested in that data, and possibility that you may need it again in the future, it is worth taking a few extra steps to wrap up a project properly.

Posted in dataManagement, dataStorage | 1 Comment

The Declining Availability of Data

The journal Current Biology published a paper yesterday that proves what may be obvious to many of us: we’re really bad at keeping track of old data. Not only is it difficult to maintain data, particularly digital data, for many years but researchers are not trained in how to preserve our information. The result is a decay of data availability over time.

Data Availability Plot
Vines et al., The Availability of Research Data Declines Rapidly with Article Age, Current Biology (2014), http://dx.doi.org/10.1016/j.cub.2013.11.014

This decay not only hurts us, the original data producers, by limiting opportunities for our own data but it also hurts others in our field. The Nature commentary on the original article provides a great example of why this is, citing an ecologist who works with a plant studied 40 years ago by another scientist. Because the older data are now lost, the first ecologist cannot make any useful conclusions about the plant over the long term.

In the Nature commentary example, the original scientist is now dead but his data are still valuable, meaning that data are often assets to be cared for long after we are alive and need them. To address this, one scientist has suggested we develop scientific wills, of sorts, to identify datasets of value in the long term and who will care for them. No matter what, we need to start thinking about our data in the long term.

I’m not saying that every scientist needs to be an expert in digital preservation, but it does help to know the basics of keeping up with your data. Still, the best way to preserve data in the long term is by giving it to a preservation expert (aka. a data repository) to manage. This way, you don’t have to learn the ins and outs of preservation and you don’t have to worry about keeping track of the data yourself. It’s just what every scientist wants: a hands-off system that keeps track of your data while costing little to no money.

Data repositories come in two major flavors: disciplinary repositories run by an outside group and your local institutional repository run by your library. Either way, it’s their whole job to make sure that whatever is in their repository is available many years from now. I suggest starting with your local repository when looking for a home for your data, but be aware that many of these repositories were built for open access articles and cannot handle large datasets. In that case, consider one of the follow repositories:

These repositories make data openly available because many journals and fields are coming to expect data publication alongside article publication. Still, it’s possible to upload your data and embargo it for a short period of time, allowing you to keep working with the data but not worry about preserving it. The repository figshare even has a new private repository feature, which I think is pretty cool: it keeps your data private (and privately shareable) for any amount of time but lets you easily switch a dataset to public when you need to.

This list represents my repository highlights but there are obviously many more available, especially in biology. Ask around to find out if there is one your peers prefer, which will make your data more likely to be found and cited.

Finally, I will add that we will be seeing much more about data repositories going forward. Between journal and funder requirements to publish data and the recent White House OSTP memo pushing for even more data sharing, data repositories and data publications are only going to grow from here. If it means that we stop hemorrhaging data over time, I think that’s a very good thing.

Posted in dataManagement, digitalPreservation | Leave a comment

Save Your Thesis (and back it up too)

I remember being incredibly paranoid when I was writing my PhD thesis that my computer would crash and I would lose all of my files. After 5 long years of work, I did not want anything keeping me from finally graduating, lost dissertation and data included. Luckily, no such calamity befell me, but I did have a friend whose laptop was stolen in the middle of writing his thesis. He was forced to start over from scratch because he did not have a good backup copy. Sadly, this is not a unique occurrence.

It’s bad enough to deal with the stress of writing a thesis and worrying about moving on from school—you do not need the added paranoia about losing (or difficulty finding) important information on top of that. Thankfully, data management offers some practical tips that can keep your worries focused solely on writing the actual thesis.

 

Back up your files

One strategy that will save you a lot of thesis stress is having a good backup system. I recently wrote about “the Rule of 3”, and thesis time is a great opportunity to follow it. The rule basically says that you should have 3 copies of your files, two onsite and one offsite. If one of your onsite copies fails, you still have two copies to fall back on; this can reduce a lot of paranoia about losing your important files.

To further allay your fears, I recommend using automated backup systems and test restoring from them. Automation removes any work that you have to do beyond set up because, frankly, you have enough things to work on right now. Once you set up your backups, you should run through the procedure for getting your files back from the system. This ensures that you won’t be frantically searching for the restore procedure if you lose your main copy and that your backup system is actually working.

Finally, I will remind you about the hidden perils of cloud storage. In a way, cloud storage is great for thesis writing, especially if you want access from several locations. But you should definitely read your cloud storage service’s terms of service to be sure that they can’t do anything they want with your thesis files. You thesis is too important to store in cloud storage that doesn’t protect your content.

 

Organize your information

A small thing that will smooth out the writing process is organizing your thesis documents as you create them. First consider how you want arrange your thesis files. It may be logical to organize things by chapter or section, keeping separate folders for figures, data, references, etc. Pick a system that feels logical to you so you’ll know where to find everything when cross-referencing and assembling the final document.

In conjunction with having a good organization structure for your files, think about consistent file naming. Labeling written draft files differently than figures and tables, and drafts differently than final versions makes it easier to find and use information. You can also tell, at a glance, what is done and what you have yet to do.

Another practice I highly recommend is to version your drafts. This means regularly saving a draft to a new file with a new version number. For example, I might save my first chapter drafts as the files “Ch01_v01.docx”, “Ch01_v02.docx”, etc. with each consecutive version being a more complete draft. The final version of this chapter would be named “Ch01_FINAL.docx”.

Not only does versioning allow you to easily revert to an earlier version of your draft or recover from a corrupt file but it also helps you keep track of the most current copy. This last point is very important if you are writing your thesis on multiple computers; you need to know which is the most current copy so that you don’t repeat effort or have to deal with merging edits.

In the end, you want a clear workflow for where things will go and how they will be named. Taking a few minutes before you start your thesis to come up with a system and sticking with these workflows can save you time later when you are looking for that one particular file right before submission.

 

Manage your references

I cannot say enough about the value of a good citation manager while writing your thesis. You are going to be citing a lot of sources, so you want a system that both organizes your references and helps you format your actual citations. There are many options available to you—most notably Medeley, Zotero, Refworks, Endnote, and Papers—so pick one and run with it. Writing a thesis without a citation manager is just asking for more frustration and stress.

 

Think ahead

You should address all of the things mentioned in this post before you actually start writing. It will take a little time at the beginning, but once you have set up your backup systems, established your workflows, and chosen a citation manager, everything should fade into the background behind actually writing. That’s the whole point of data management—to build workflows that make it easier for you, in the long term, to do your work.

So take a few minutes at the beginning of the process to set things up. I can’t promise it will entirely relieve your stress, but at least you’ll be worried about your writing instead of losing your thesis.

Posted in dataManagement | 1 Comment