“Data Is” or “Data Are”?

Want to start a disagreement amongst data managers? Ask them if “data” is a singular or plural noun. Does one say “data are” or is it better to say “data is”? Data people often have opinions about which is correct (and will let you know about it).

Personally, I’ve been on the “data are” side of this war for some time. This is partly due to the fact that my PhD advisor drilled into my head that one must never say “spectrums”; it’s either one spectrum or many spectra. Likewise “data” is the plural form. However, I recently had an opportunity to re-evaluate my viewpoint and am starting to lean more toward “data is”.

Much of the reason for my change of opinion came from feedback on my writing. As much as “data are” seems like it should be correct, many people stumble over reading this in a sentence. The meaning of the sentence gets lost as the brain tries to process the grammar. As a writer, this is the last thing that I want. Therefore, I started considering using “data” as a singular noun.

The other thing that moved me toward “data is” was the essay sent to me by a fellow data manager, Amanda, called “Data is a singular noun”. The author makes a good case, based on history and grammar, that it should always be “data is” instead of “data are”.

Part of this author’s reasoning is due to the fact that a word’s usage and evolution in English are more important than how the word’s originating language says the word should be used. So even though Latin suggests “data” should be plural, what matters most is how people actually use the word “data” in English. A second reason for choosing the singular is that we really never use the word “datum” anymore. This presumes that “data” is de facto singular form for this word. Either way, there’s a lot of history behind using “data” as a singular noun.

I’m sure that we’ll eventually reach a point where there is a conclusive answer to this question. Until then, I’m going to try to be more conscious about using “data is”. At least in my writing.

So, “data is” or “data are”? What do you think?

Posted in Uncategorized | Leave a comment

2015 Data Resolutions

With only a little time left in 2014, I’m sure I’m not the only one making New Year’s resolutions. While many people put diet and exercise at the top of this list, this year I’m making a few data management resolutions. I hope you will consider adding a similar goal to your 2015 list.

For all I blog about data management and do pretty well at managing my own digital content, there are a few things that I need to do better. Data management requires paying attention to your content and it’s easy to let things slides. For me, it’s my personal files that are getting out of hand. My work data are fairly organized and well backed up, but my personal files are a mess. Thankfully, most of the data management tricks that work for scientific data work for digital content in general.

Overall, I’m facing two problems. The first is data spread. I don’t have a consistent organization system for all of my personal content and have things randomly saved across multiple devices (laptop and external hard drive) and cloud storage platforms (Dropbox, Google Drive, and SpiderOak). Worse, there is no rhyme or reason to why a file gets saved, for example, to Dropbox instead of in my laptop’s documents folder. I spent a good 20 minutes last week failing to find a particular sewing pattern only to have it show up a day later in the most unlikely place (my SpiderOak Hive folder). Clearly, I need to be more conscious about where I’m saving which files.

Related to the data spread is the fact that I don’t have a good backup system in place for my personal files (though I do have a good backup system at work). With files in so many places, it’s practically impossible to make sure everything’s backed up properly. Additionally, my laptop is getting older and I need to be sure that all of my files are safe in case my hard drive dies in the near future.

So this year, I resolve to spend a little time getting my personal files organized, streamlined onto one central platform, and properly backed up. This will take a little time but will pay off hugely if/when my laptop finally dies.

I hope that by admitting my own flaws in data management you can see that nobody’s a perfect data manager. Instead, what matters most is that you make an effort. Any little bit I do in 2015 to take care of my personal files makes my files better protected from loss and easier to find when I need them. It’s not hard, it just requires a little work.

With the advent of the new year, I hope that you too take some time to care for your digital content. The start of the new year is the perfect time to review what needs attention or to resolve to improve your practices. It doesn’t have to be big, it just has to be something; every little bit helps. Therefore, I challenge you to make 2015 the year to start improving your data management habits.

Posted in dataManagement | Leave a comment

How to Share Your Research Data

With so many new policies from funding agencies and journals requiring data sharing, it’s growing more likely that you will encounter a data sharing mandate at some point in time. However, it can be difficult to know how to comply if you are new to such requirements. This is because, while the act of sharing data is not complicated, data sharing comes with new systems and best practices that are unfamiliar to many researchers. So let’s walk through the process of sharing your data so you know what to do when faced with a data sharing requirement.

Policy sources

The two most common places you will encounter data sharing requirements are your funder and the journal in which you publish. A list of US funders with data management and sharing requirements is available from the DMPTool. A list of journals requiring data sharing is available from Dryad. Always refer to the specifics of the policies that apply to you, as they can vary from the general description of data sharing requirements I’m outlining here.

What to share

To satisfy most data sharing requirements, you should share any data that underlie a publication. This means making available any and all data necessary to prove or reproduce your findings. Since data are so heterogeneous, you do have some leeway in the exact form of the data you share. Use your best judgment as to whether your peers will prefer raw data, analyzed data, data in a particular file format, etc. Do be sure to perform quality control on your data and add documentation prior to sharing.

When to share

Data sharing should occur at or slightly after the time you publish the article to which the data belong. Note that a few journals want to see your data during peer review (see below). With a few exceptions, you are not required to share you data before you publish your findings.

How to share

The best way to share your data is to place it in a data repository. Repositories are preferable to sharing-by-request as the repository does all of the work to ensure data persistence and discoverability. A repository is a very hands-off way to share once you deposit the data. Repositories also make data more findable and citable, meaning you’re more likely to get recognition for your work. To find a repository, look for suggestions from your journal, your local librarian, or on the repository lists at DataBib and re3data.

Peer review

While peer review is not the norm for shared data, there are methods available for you to have your datasets peer reviewed. The first is that a few journals look at data as part of the peer review process. More common is publishing your data as a “data paper”. Whereas a normal article describes the analysis done on a dataset, a data paper describes the dataset itself and undergoes peer review in tandem with the data. The reason some researchers prefer sharing data via data papers, besides providing thorough documentation and being peer reviewed, is that data papers receive citations just like articles. To see the journals that accept data papers, refer to this list from the University of Michigan library.

Final thoughts on data sharing

Data sharing is not complicated but it does to require work to clean up your data, add documentation, and deposit your data into a repository (though it does become hands-off at this point). One scientist estimated that he spent almost 10 hours preparing a dataset for public sharing, though he expected that preparation time for the next shared dataset would be shorter. I think that this demonstrates one of the biggest barriers to data sharing: we’re not used to doing it. The systems take time to learn and we have to think about preparing our data for sharing while we’re actively working on them in the middle of a project.

Eventually, everyone will get used to thinking about data as important research products and the systems for sharing data will become more established. In the meantime, I hope this post provides some clarity on complying with new data sharing requirements.

Posted in openData | Leave a comment

Where to Start with Data Management

Managing your research data well feels like a big task. There are so many practices that make up good data management that there are whole classes (Oregon State, University of Minnesota) and curricula (NECDMC, MANTRA) on the topic. There’s even a book on data management for researchers coming out in 2015 that I am very excited about (I’m admittedly biased on this). This profusion of practices is a lot to take in if you’ve never consciously managed your data before. So how does one tackle data management if you are new to the subject?

Rather than try to do everything at once, I always recommend a slow and strategic approach to data management when you are getting started. You can really begin anywhere (the posts on this blog can provide inspiration), but I usually suggest one of the following as a starting point:

  • Make sure you have reliable backups
  • Improve your note-taking practices
  • Decide on an organizational system for your data and use it consistently
  • If you have sensitive data, review your data security plan and update it
  • Check to see if you can read your old data files and update file formats and media as needed

All five of these practices are something that you can tackle without a lot of data management experience and will have a big impact on the safety or ease-of-use of your research data.

Once you pick a topic to work on, dedicate a month to slowly improve your habits in this area. With practices like note-taking and organization, try to make good habits part of your routine. With the other practices like reliable backups and keeping track of old files, work to make sure you’re using the best systems available to you; don’t forget to talk with your coworkers about systems you can use together. Basically, focus on improving one data management practice at a time until you are satisfied and comfortable with your new practice. Then choose another data management practice to focus on the next month.

Data management is really the compilation of a lot of small practices that add up over time, and the more you can make those practices routine, the easier it becomes to manage your data well. But every step you take to improve your data management helps. For example, if the only task you address on my list is adding reliable backups, then your data are safer from loss as compared to before. So put a little conscious effort into managing your data, improve your practices slowly over time, and pretty soon you will discover that you have well managed data.

Posted in dataManagement | Leave a comment

Cloud Backup

I’m going to come right out and say that Dropbox is not a sufficient backup. If all you have are files in a Dropbox folder that are synced to the cloud, you should not consider your files to be backed up and safe. This because your files are now entirely dependent on a company’s business model, one of the main perils of cloud storage, but also because synced cloud storage is not a true backup.

The reason Dropbox is not a good backup relates to how different cloud storage services work and the Rule of 3. The Rule of 3 states that you should have 3 copies of your data, 2 onsite and 1 offsite, for safest storage. The crux of the issue is that services like Dropbox, Box Sync, and OneDrive were designed to provide easy access to content from multiple locations and not to provide dedicated offsite backup. Because your files are synchronized across multiple locations, you really have one “copy” of the data that lives in both the cloud and locally. This is not enough to satisfy the Rule of 3.

With syncing, the method of creation and destruction matters – namely, when you update a file in one location it gets updated universally. Likewise, when you delete a file on your local Dropbox folder, it gets deleted in the cloud and vice versa. So if you are using synced storage and something happens to your local device, there is a chance your synced files in the cloud are at risk. And if Dropbox accidentally loses data in the cloud, as happened with cloud storage provider Dedoose, your local data are at risk.

I wish I could say that this is all theoretical, but people using synced cloud storage have lost data. For example, one researcher lost 8,000 photos both locally and in the cloud after a syncing glitch in Dropbox. Another person lost all his Box files when Box.com rolled his account into an unrelated corporate account. The good news is that synced storage services like Box and Dropbox do hold on to deleted files for 30 days, but this is not always foolproof.

So what should you do to make your data safer in this case? Add a backup to this system. Put a copy of your data on a local hard drive in addition to storing it in Dropbox. Alternatively, you can use a cloud storage service that provides independent storage/backup. For example, I use SpiderOak as an offsite backup. SpiderOak monitors my local files and saves a new version of a file to the cloud whenever I update it. This process is automatic, just like with syncing, but my cloud copy is independent of my local copy. If I delete a local file, the copy in the cloud is unaffected and vice versa. This means my cloud storage provides a true offsite backup and I’m more likely to get my files back if something catastrophic happens locally to my computer.

Cloud storage is a wonderful development in terms of convenience and providing offsite backup or access, but you should never rely on the cloud alone. It’s always best to follow the Rule of 3 and get another backup for your data, just in case.

Posted in dataStorage | Leave a comment