The Many Layers of Open

Open Data Sketchnote
Open Data Layers Sketchnote

I was at OpenCon last week and left with lots of ideas about being open. In particular, my general understanding of open broadened to thinking about: open is really just a means to other things, the necessity of advocacy, and that improved access and data literacy needs to go hand-in-hand with opening up data. I still have a lot to process but I wanted to blog about one issue that I sketchnoted (right) during the “Open Data Brainstorming” unconference session: the many layers of opening up data.

The point is this: that Open Data doesn’t have to be an all-or-nothing thing. This idea doesn’t really align with the big goals of OpenCon, but I think a layered approach to Open Data is very practical for those researchers who aren’t used to sharing data.

Instead of making your data totally open or totally closed, it might be better to think of the following layers of openness:

  • Making data usable for yourself
  • Making data usable for your future self
  • Making data usable for your coworkers/boss
  • Making data usable for your discipline
  • Making data usable for everyone
Layers of Open
Layers of Open Data

While the last layer is the ultimate goal of “Open Data”, I definitely think that there is value in the inner layers. For example, even if your data isn’t totally open, it can be of huge benefit for your data to be usable to your coworkers/boss instead of just yourself. The other reason that this model works is that data tailored for one layer is not automatically usable in the next layer out – though the reverse is usually true!

A related idea, and one that I’ve already blogged about, is the hidden cost of Open Data. (Basically, Open Data takes work but data management makes it easier to put your data in a form ready to be used by other people.) But if we think about Open Data in a layered approach, the cost comes in stages rather than all at once.

So instead of saying, “you must make your data totally open”, I instead challenge you to move a layer out. For example, are you terrible at data management? Try making data more useful to your future self. Can your coworkers/boss understand and use your data? Put practices into place to make that data usable to others in your field. Each of these steps outward brings concrete benefit to yourself and others.

I really think that Open Data can be a layered process. Not only does this help us recognize the work that open requires but can help bring those unused to sharing on board with the idea of Open Data.

Posted in openData | Leave a comment

Spreadsheet Best Practices

I gave a webinar recently on tools and tips for data management. While many of the themes I spoke about have been covered here previously (though you should really check out the webinar slides – they’re awesome), I realized that I have never written about spreadsheets on the blog. Let’s rectify that.

The big thing to know about spreadsheets is that best practices emphasize computability over human readability; this is likely different from how most people use spreadsheets. The focus on computability is partly so that data can be ported between programs without any issue, but also to avoid capturing information solely via formatting. Formatted information is not computable information, which defeats the main purpose of using a spreadsheet. It’s better to have a clean table that is computable in any program than to have a spreadsheet that looks nice but is unusable outside of a particular software package.

With computability in mind, here are a few best practices for your spreadsheets:

  1. Spreadsheets should only contain one big table. The first row should be variable names and all other rows data. Smaller tables should be collapsed into larger tables whereever possible.
  2. Kick graphs and other non-data items out of your spreadsheet tables. If you’re in Excel, move them to a separate tab.
  3. Keep documentation to a minimum in the spreadsheet. (This is where data dictionaries come in handy.)
  4. Differentiate zero from null.
  5. Remove all formatting and absolutely NO MERGED CELLS. Add variables as necessary to encode this information in another way.

If you follow these rules, you should create spreadsheets that are streamlined and can easily move between analysis programs.

Such portability is important for two reasons. First, there are many great data analysis tools you may want to leverage but they probably won’t import messy spreadsheet data. Second, there are issues with research’s ubiquitous Excel; for example, a recent study showed Excel-related gene errors in up to one fifth of supplemental data from top genomics journals, not to mention the fact that Excel is known to mangle dates. It’s therefore best to keep your options open and your data as neutral as possible.

I hope you use these best practice to streamline your spreadsheet data to take maximum advantage of it!

Posted in dataAnalysis | Leave a comment

Open Data’s Dirty Little Secret

Earlier this week, I was very happy to take part in the Digital Science webinar on data management. I spoke about how data management should be accessible and understandable to all and not a barrier to research. I also made a small point, thrown in at the last minute, that really seemed to resonate with people: that open data has a dirtly little secret.

The secret? Open data requires work.

In all of the advocacy for open data, we often forget to account for the fact that making data openly available is not as easy as flipping a switch. Data needs to be cleaned so that it doesn’t contain extraneous information, streamlined to make things computable, and documented so that another researcher can understand it. On top of this, you must choose a license and take time to upload the data and its corresponding metadata. One researcher estimated that this process required 10 hours for a recently published paper, with significantly more time spent preparing his code for sharing.

But there is another secret here. It’s that data management reduces this burden.

Managing your data well means that a good portion of the prep work is done by the time you go to make the data open. This is done via spreadsheet best practices, data dictionaries, README.txt files, etc. Well managed data is already streamlined and documented and thus presents a lower barrier to making it open.

These issues are reinforced by the recently published “Concordat on Open Research Data“. Made up of 10 principles, these two in particular stuck out to me:

  • Principle 3: Open access to research data carries a significant cost, which should be respected by all parties.
  • Principle 6: Good data management is fundamental to all stages of the research process and should be established at the outset.

As we advocate for open data, Principle 3 reaffirms that we need to recognize the costs. But – as most things I blog about here – there is a solution and it’s managing your data better.

Posted in dataManagement, openData | 1 Comment

Version Control for Evolving Files

How often do your files evolve? For example, many researchers develop their protocols, paper drafts, code, slide decks, and analyses over time, going through many different versions of a file before settling on a final one (or even never finalizing a document!). This is a totally normal part of doing research but does make managing those files a bit challenging.

Thankfully, there is a solution to dealing with such ever-changing documents: version control. Version control allows you to keep track of how your files change over time, either by taking a snapshot of the whole document or making note of the differences between two versions of the same document. This makes it easy to go back and see what you did or even revert to an earlier version of the file if you made changes that didn’t work out (just think how helpful this would be for files like protocols!).

The good news is that there are several ways to accomplish version control, depending on your need and skills.

The Simple Way

The easiest way to implement version control is to periodically save a new version of the file. Each successive file should either have an updated version number or be labelled with the current date. Here’s how a set of versioned files might look:

  • mydata_v1.csv
  • mydata_v2.csv
  • mydata_v3.csv
  • mydata_FINAL.csv

I like using the designation “FINAL” to denote the very last version – it makes for easy scanning when you’re looking through your files.

Dates are also useful, especially if you don’t anticipate ever having a final version of the document. Don’t forget to use ISO 8601 here to format your dates!

  • myprotocol-20151213.txt
  • myprotocol-20160107.txt
  • myprotocol-20160325.txt

The downside of this system is that it takes up lots of hard drive space, as you’re keeping multiple copies of your file on hand.

The Robust Way

For something more robust than the simple solution described above, I recommend a version control system such as Git. These systems come out of computer science and track the small changes between files, therefore taking up less disk space for file tracking. While originally designed only for code, Git is now being used by researchers to track many types of files.

The downside of version control systems that they can have a high learning curve (though Git has a GUI version that is less onerous to learn than the command line). But once you get over the curve, version control software is a really powerful tool to add to your research arsenal as it handles most of the legwork of versioning and offers other cool features besides, like collaboration support. Here’s a list of resources to help you get started with Git.

(A side note for those familiar with GitHub: be aware that Git and GitHub are two different things. Git is the system that handles versioning while GitHub is an online repository that houses files. They are ideally designed to be used together though it’s definitely possible to use Git without GitHub.)

An Alternate Way

As a last resort, you can leverage the fact that some storage platforms have built-in versioning. For example, both Box and SpiderOak keep track of each version of a file uploaded onto their systems. This is not the best option for versioning, as it takes control out of your hands, but it’s better than nothing and is useful in a pinch.

Final Thoughts

I hope you will consider using one of these methods of version control for your files. Version control is downright handy for when you accidentally delete a portion of a file, are sending documents back and forth with collaborators, need to revert to an earlier version of a file, or want to be transparent about evolving protocols. So pick the system that works best for you and go for it!

Posted in digitalFiles | Leave a comment

These are a Few of My Favorite Tools

If you follow me on twitter, you may have noticed a small rant the other day about how much I dislike how Excel handles dates (it’s a proven fact that Excel is terrible with dates). My beef with Excel got me thinking more about the programs I actually like to use for data processing and clean up. These are tools that I’d only stop using if you paid me serious money. And maybe even not then.

Since I love these tools so much and since I’m mistress of my own blog, I’m going to spend this post proselytizing about them to you because I want you to love them too. So here are my top 4 you-might-not-know-about-them-but-really-should tools for hacking at your data. Some of them only do a small job, but they do it incredibly well and are just what you need in a specific instance.

 

Regular Expressions

Regular expressions (or regex) are a somewhat obscure little tool for search and replace but I’ve found nothing better for cleaning up text. Regular expressions work by pattern matching text and have a lot of flexibility. Need to find every phone number in a document? Want to clean up dates? Have to reformat a document while keeping the text the same? Regex is the tool for you. Regex isn’t a standalone program but rather plugs into other tools (including all of the tools below, as well as some programming languages). I recommend this tutorial for getting started.

Notepad++

While not as dazzling as the other players on this list, Notepad++ is my go-to text editor and the main platform I use for leveraging regular expressions. It’s always good to have an open source text editor around and this one is my particular workhorse.

OpenRefine

OpenRefine (formerly Google Refine) in an open source tool for cleaning up spreadsheet data. This tool allows you to dig into your data by faceting it across any number of variables. I find it particularly handy for generating counts; for example, it’s incredibly easy to find out how many times {variable1=X AND variable2=Y} versus {variable1=X AND variable2=Z}. Faceting also allows for editing of select subsets of data. You can also do stronger data manipulations, such as streamlining inconsistent spelling and formatting or breaking multi-component cells apart/collapsing columns together. I recommend this tutorial for getting started.

Bulk Rename Utility

Need to rename a number of files? Bulk Rename Utility is the tool for you! This software allows you to rename files in very specific ways such as: adding characters to the ending, removing characters from the name beginning, changing something at a specific position in the name, and much more. You can also add numbering and dates to file names or do a custom search and replace with regular expressions. I don’t use Bulk Rename Utility a lot, but it saves me a ton of time and energy when I do.

Posted in dataAnalysis | Leave a comment

Sustainability (aka. Passing the Hit-by-a-Bus Test)

I’m finally back at work after a three-month maternity leave and trying to catch up on everything that I missed while I was home with the little one. It’s going fairly smoothly, mostly because I was able to do a lot of planning before I left.

Having time to plan ahead and fairly strict deadline is definitely a benefit of taking maternity leave. But I’ve been thinking a lot recently about how this doesn’t always happen. For example, what happens if you suddenly get sick and can’t work for a while? In the worst case, your research could be retracted because you aren’t available to answer questions about the work.

All this has me thinking about sustainability. Basically, does your data pass the hit-by-a-bus test? I’ve had several conversations with my data peers on this topic in the last year and thought it worth exploring a little on the blog.

So how can you make sure your data lives on if you suddenly can’t work for a while? Or if you take a new job? Or if you actually get hit by a bus?

Documentation is probably the single most important piece of data sustainability. Not only should your notes be understandable to someone “trained in the art”, but it’s also a good idea to add some documentation to your digital files – I love README.txt‘s for this. You should document enough for someone (including your future self once you recover from the bus) to pick up exactly where you left off without taking weeks to decipher your work. And don’t forget about code and procedures.

There’s also a technical side to sustainability. Take file types, for example. Will your data live on outside of that weird software that only your lab uses? Making sure that your data is stored properly and well backed up also matters. Data shouldn’t be put on an external hard drive and forgotten.

Finally, ask yourself ‘what is the worst that can happen’? This will vary from researcher to researcher, but thinking about this question will let you do a little disaster planning. It might lead to training a coworker on taking care of your animals or taking extra precautions with your specialized equipment, whatever you need to do to make sure your work survives.

You may never get run over by that fictitious bus, but it can still be a useful exercise to think about sustainability. At the very least, you make your research more robust and easier to pick up if you need to go back to it in the future. At the worst, your research will be one less thing to worry about if things take a turn for the worst.

Posted in dataManagement | 1 Comment