These are a Few of My Favorite Tools

If you follow me on twitter, you may have noticed a small rant the other day about how much I dislike how Excel handles dates (it’s a proven fact that Excel is terrible with dates). My beef with Excel got me thinking more about the programs I actually like to use for data processing and clean up. These are tools that I’d only stop using if you paid me serious money. And maybe even not then.

Since I love these tools so much and since I’m mistress of my own blog, I’m going to spend this post proselytizing about them to you because I want you to love them too. So here are my top 4 you-might-not-know-about-them-but-really-should tools for hacking at your data. Some of them only do a small job, but they do it incredibly well and are just what you need in a specific instance.

 

Regular Expressions

Regular expressions (or regex) are a somewhat obscure little tool for search and replace but I’ve found nothing better for cleaning up text. Regular expressions work by pattern matching text and have a lot of flexibility. Need to find every phone number in a document? Want to clean up dates? Have to reformat a document while keeping the text the same? Regex is the tool for you. Regex isn’t a standalone program but rather plugs into other tools (including all of the tools below, as well as some programming languages). I recommend this tutorial for getting started.

Notepad++

While not as dazzling as the other players on this list, Notepad++ is my go-to text editor and the main platform I use for leveraging regular expressions. It’s always good to have an open source text editor around and this one is my particular workhorse.

OpenRefine

OpenRefine (formerly Google Refine) in an open source tool for cleaning up spreadsheet data. This tool allows you to dig into your data by faceting it across any number of variables. I find it particularly handy for generating counts; for example, it’s incredibly easy to find out how many times {variable1=X AND variable2=Y} versus {variable1=X AND variable2=Z}. Faceting also allows for editing of select subsets of data. You can also do stronger data manipulations, such as streamlining inconsistent spelling and formatting or breaking multi-component cells apart/collapsing columns together. I recommend this tutorial for getting started.

Bulk Rename Utility

Need to rename a number of files? Bulk Rename Utility is the tool for you! This software allows you to rename files in very specific ways such as: adding characters to the ending, removing characters from the name beginning, changing something at a specific position in the name, and much more. You can also add numbering and dates to file names or do a custom search and replace with regular expressions. I don’t use Bulk Rename Utility a lot, but it saves me a ton of time and energy when I do.

Posted in dataAnalysis | Leave a comment

Sustainability (aka. Passing the Hit-by-a-Bus Test)

I’m finally back at work after a three-month maternity leave and trying to catch up on everything that I missed while I was home with the little one. It’s going fairly smoothly, mostly because I was able to do a lot of planning before I left.

Having time to plan ahead and fairly strict deadline is definitely a benefit of taking maternity leave. But I’ve been thinking a lot recently about how this doesn’t always happen. For example, what happens if you suddenly get sick and can’t work for a while? In the worst case, your research could be retracted because you aren’t available to answer questions about the work.

All this has me thinking about sustainability. Basically, does your data pass the hit-by-a-bus test? I’ve had several conversations with my data peers on this topic in the last year and thought it worth exploring a little on the blog.

So how can you make sure your data lives on if you suddenly can’t work for a while? Or if you take a new job? Or if you actually get hit by a bus?

Documentation is probably the single most important piece of data sustainability. Not only should your notes be understandable to someone “trained in the art”, but it’s also a good idea to add some documentation to your digital files – I love README.txt‘s for this. You should document enough for someone (including your future self once you recover from the bus) to pick up exactly where you left off without taking weeks to decipher your work. And don’t forget about code and procedures.

There’s also a technical side to sustainability. Take file types, for example. Will your data live on outside of that weird software that only your lab uses? Making sure that your data is stored properly and well backed up also matters. Data shouldn’t be put on an external hard drive and forgotten.

Finally, ask yourself ‘what is the worst that can happen’? This will vary from researcher to researcher, but thinking about this question will let you do a little disaster planning. It might lead to training a coworker on taking care of your animals or taking extra precautions with your specialized equipment, whatever you need to do to make sure your work survives.

You may never get run over by that fictitious bus, but it can still be a useful exercise to think about sustainability. At the very least, you make your research more robust and easier to pick up if you need to go back to it in the future. At the worst, your research will be one less thing to worry about if things take a turn for the worst.

Posted in dataManagement | 1 Comment

The Art of Discarding

I’m in the process of spring cleaning my house and am getting lots of inspiration from Marie Kondo’s “The Life Changing Magic of Tidying Up.” Her big message is that to truly achieve an organized home, one must discard all unneeded/unloved items before you can even begin to tidy. We hold on to a lot of junk and it’s preventing us from enjoying and relaxing in our homes. Using the suggestions in Kondo’s book, the purging process is working really well for me and I’m already feeling better about a lot of my home spaces.

The act of cleaning my home by getting rid of unnecessary junk has me thinking about how underrated the discarding step is in the process of data management. It’s actually important to periodically get rid of useless data so that the good data is easier to find. Why wade through a bunch of files you’ll never use in order to locate the ones you want?

Besides clearing out the cruft, there are two other reasons to consider discarding data. First, junk data takes up hard drive space. I’m totally guilty of holding on to everything and anything digital – such as when I recently transferred all of my old laptop files to my new laptop – but this means I devote more and more disk space to stuff I don’t really need to keep. In the long term, it’s not a very sustainable solution.

The other good reason to discard is if you’re dealing with sensitive data. Sensitive data can be a pain to keep secure, but such security concerns go away after the data is destroyed. You can’t lose data that no longer exists! It’s usually best practice to destroy the data after a fixed retention period so you have access to it for some period of time but not forever.

In many ways, data management is comparable to tidying your home; one must keep things organized and put away in the proper place in order to find them later. This analogy continues for the discarding process. Discarding is an important step in keeping a handle on what you have. So as you manage your data, I hope you consider how strategically trashing files can help keep your digital house in order.

Posted in dataManagement, dataStorage | Leave a comment

Copyright and Data

In my quest to educate everyone on research data management, I’m always looking for easier ways to explain things. On the top of this list is copyright, which is weird to begin with and gets pretty squirrelly when applied to data.

My latest effort in this sphere comes as a flyer that covers that basics of research data and copyright. Hopefully, this will give you a better sense of how copyright does (and does not) apply to research data and how this affects what you can do with your and others’ data.

The flyer is CC-BY licensed, so you are free to use and reuse as you like so long as you attribute!

Posted in copyright | 4 Comments

Love Your Data

It’s Love Your Data Week, an effort coordinated by one of my amazing data librarian peers, Heather Coates. Love your data week celebrates how prevalent and important data is to research while also acknowledging that we need to give our data some love from time to time.

Each day has a theme and I encourage you to check out the resources on the main site and those appearing on Twitter under the #LYD16 hashtag. I’ve also identified some posts and videos I’ve created over the last few years relating to the 5 topics. Do check them out and think about ways to love your data a little more this week!

Monday – Keep Your Data Safe

Tuesday – Do You Know Where Your Data Is?

Wednesday – Write It Down!

Thursday – Give and Get Credit

Friday – Transforming, Extending, Reusing Data

Posted in Uncategorized | Leave a comment

Hello, I’m Kristin Briney and I’m number 0000-0003-1802-0184

It’s only January, but it’s looking like one of the biggest trends of 2016 is going to be linking every researcher to a unique ID. I’m speaking, of course, of ORCID numbers and the recent news that even more publishers are now requiring authors to have an ORCID number when they publish articles in their journals. So if you don’t have an ORCID number, now is the perfect time to get one!

So what’s the big deal with these 16-digit numbers and why would anyone want to be a number?

The problem is best illustrated by the John Smith’s and Zhang Wei’s of the world; have you ever tried to find a paper by someone when there are at least 2 people with that name in the same subfield? The problem is further exacerbated by the fact that people change institutions, women change their name after marriage, and that journals don’t abbreviate names in the same way. How is anyone supposed to find someone else’s scholarship, let alone keep track of their own complete scholarly record?

The answer is to correlate a unique number to each researcher. That way, you know that 0000-0003-1802-0184 always means me and only me. I’m lucky to have a pretty unique name, but I’ve also worked at multiple institutions and in two completely different fields (chemistry and librarianship). Having an ORCID number means that someone else can find all of my scholarly work in an easy way.

I’ve been a big fan of ORCID for a while now and am very excited to see these major adoption milestone happening. I know that many people have already grabbed their own ORCID numbers and now is definitely the time to claim your number if you haven’t! Getting an ORCID number is free and pretty straightforward. Registration is quick, though it may take a little time to associate all of your old papers with your new number when you fill out your profile. Once this is done, however, it’s very easy to maintain your ORCID by occasionally adding new publications. More information can be found at orcid.org.

I really expect that the recent news about ORCID integration will only be the tipping point for this useful system. So don’t be surprised if your publisher, funding agency, or other research-associated organization starts asking for your ORCID soon. This means that you’ll want claim an ORCID number of your own, if you haven’t already. 2016 is likely to be the year that you need it.

Posted in publishing | Leave a comment