Love Your Data 2017: Finding the Right Data

It’s that time of the year again – Love Your Data Week. This annual celebration focuses on getting more from your data with helpful data management tips and skills.

Each day of Love Your Data Week has its own theme, but I want to focus on today’s theme (Thursday) because it’s something that I have not discussed on the blog before: finding the right data.

You might be used to finding books and journals for your research, but finding data is often more difficult. That is because data systems are rarely connected and there is no guarantee that the data you want even exists. It can be frustrating to even know where to begin looking for data, which makes the process of finding data feel time consuming and full of rabbit holes. Thankfully, I’m here to share some strategies!

The best strategy for finding data is to think about who may be creating data and search their websites/publications

  • Is it government data? Departments like the Centers for Disease Control (CDC), Department of Education, the Census, etc. often make data available.
  • Consider non-governmental organizations who might have data, like the United Nations, World Health Organization, International Monetary Fund, etc.
  • Private business often make data available for purchase. These resources are sometimes available through your local library.
  • Individual researchers are increasingly sharing their study data for reproducibility purposes. Check out their publications or send the corresponding author an email.
  • Might the data live in a special data repository? re3data lists a huge variety of repositories, many of which are subject specific.

Be aware that your local library probably has a few data resources

  • Libraries sometimes subscribe to databases that contain datasets instead of articles. An added benefit here is that you can ask a librarian for help with these resources!

To get started, try scanning publications for data

  • Published articles (newspaper and journal articles) may contain data tables and references to data sources. This is a good place to start if you are looking for background information on a topic.
  • Journal articles are increasingly linking to the data and code used for analysis. See if the publication mentions accompanying data, check out supplimental information, or email the author.

Add the word ‘statistics’ or ‘data’ to your searchers

  • Using Google, the library catalog, or another search tool.

When it doubt, ask for help

  • Librarians are really good at finding information (and that includes data!).

Remember, finding data often involves brainstorming and rethinking your search strategy when you hit a dead end. Try taking a step back when you get stuck. If you can’t find the specific data you are looking for, is there a more general dataset that you can still use to build your case/provide background? Finally, don’t forget to cite your sources!

I hope these tips help you find data for your research and that you can learn other helpful strategies from Love Your Data Week and its Twitter stream #LYD17!

 

Extra resources:

Posted in dataSources | Leave a comment

Book Review: Effective Data Visualization

Effective Data VisualizationI know a little bit about a lot of data things, but one area I’m weak in is data visualization. Sure, I can make a graph in Excel but that doesn’t mean that the graph is necessarily good. Thankfully, Sal Gore blogged a recommendation for the book Effective Data Visualization and, after a quick read, I’m feeling like a data viz wiz.

What I like about this book is that it doesn’t assume you have data visualization knowledge apart from basic familiarity with Excel. That’s actually a plus for this book, as the author Stephanie Evergreen shows you how to make most of these charts IN EXCEL. I know I’ve previously ragged about Excel on this blog but it really is the first place most people start with data viz. So if we’re all going to start there, at least this book shows you how to make your Excel charts not suck. Even better, Evergreen tells you how difficult a chart will be to create in Excel by including a helpful Excel ninja rating.

The other thing that’s great about this book is that charts are organized by the type of data you want to present. Categories include: a single number, comparisons, beating a benchmark, survey results, parts of a whole, correlations, qualitative data, and data over time. Evergreen bases her selection of charts on research showing which chart types are more effective for information retention. It’s a different way to think about charts, but one that I’m finding really useful.

The range of covered charts includes the usual suspects, from bar charts to scatter plots, but Evergreen also details visuals that I haven’t used before. The ones I plan to immediately add to my graphing repertoire are: icon arrays, slopegraphs, dot plots, back-to-back bar charts, and small multiples graphs.

Beyond choosing the right chart and knowing how to make it in Excel (which, of themselves, are incredibly useful skills), this book gave me a framework for creating charts that are easy to read and convey a clear message. For example, I now understand how to write an effective chart title, select good colors, reduce data overload, and eliminate chart junk. It’s reached the point where I can’t even look at my old graphs without wanting to tweak them.

There is one downside of this book and it’s that it was done with two-color printing. All of the charts are limited to shades of blue and grey. While this makes for a visually cohesive (and cheaper) book, the printed figures occasionally do not fully convey the author’s point – most often when showing a bad chart. This is annoying but it’s not enough to detract from the many good things about this book.

Overall, Effective Data Visualization is the perfect book for people who want to level up their data visualization skills beyond the defaults in Excel. I’ve learned so much from this book and it has fundamentally changed the way I think about visualizing data. I hope that you will find it just as useful.

Posted in bookReview, dataVisualization | 1 Comment

#TrumpSci

There have been several discussions among my data librarian colleagues about the future of open data and science in 2017, spurned on by articles such as this one on the future of data sharing and these articles on the continued existence of government-held climate data.

These concerns are realistic. We’ve seen from our neighbors in Canada that politics can have a profound impact on the sharing of science. In turn, librarians have a role to play to advocate for continued access to information (shout out to the amazing John Dupuis for that last link).

Relevant to my work here, the two things I’m most concerned about are:

  • Continued existence of requirements for funding agency data sharing.
  • Muzzling of researchers, particularly climate scientists.

I’ll going to try to keep up with what is going on in these two areas and will occasionally share my thoughts back here. In the meantime, I’ve started a #TrumpSci bookmarks list that you can follow along with here: list and RSS feed.

Please send me relevant stories as you find them!

 

Edited to add (2016-12-15): The wonderful John Dupuis preempted me with a Trump list. I’m still going to work on my list and talk about this topic on the blog but in the meantime you should definitely check out his more thorough round up.

Posted in government, openData, Uncategorized | Leave a comment

The Many Layers of Open

Open Data Sketchnote
Open Data Layers Sketchnote

I was at OpenCon last week and left with lots of ideas about being open. In particular, my general understanding of open broadened to thinking about: open is really just a means to other things, the necessity of advocacy, and that improved access and data literacy needs to go hand-in-hand with opening up data. I still have a lot to process but I wanted to blog about one issue that I sketchnoted (right) during the “Open Data Brainstorming” unconference session: the many layers of opening up data.

The point is this: that Open Data doesn’t have to be an all-or-nothing thing. This idea doesn’t really align with the big goals of OpenCon, but I think a layered approach to Open Data is very practical for those researchers who aren’t used to sharing data.

Instead of making your data totally open or totally closed, it might be better to think of the following layers of openness:

  • Making data usable for yourself
  • Making data usable for your future self
  • Making data usable for your coworkers/boss
  • Making data usable for your discipline
  • Making data usable for everyone
Layers of Open
Layers of Open Data

While the last layer is the ultimate goal of “Open Data”, I definitely think that there is value in the inner layers. For example, even if your data isn’t totally open, it can be of huge benefit for your data to be usable to your coworkers/boss instead of just yourself. The other reason that this model works is that data tailored for one layer is not automatically usable in the next layer out – though the reverse is usually true!

A related idea, and one that I’ve already blogged about, is the hidden cost of Open Data. (Basically, Open Data takes work but data management makes it easier to put your data in a form ready to be used by other people.) But if we think about Open Data in a layered approach, the cost comes in stages rather than all at once.

So instead of saying, “you must make your data totally open”, I instead challenge you to move a layer out. For example, are you terrible at data management? Try making data more useful to your future self. Can your coworkers/boss understand and use your data? Put practices into place to make that data usable to others in your field. Each of these steps outward brings concrete benefit to yourself and others.

I really think that Open Data can be a layered process. Not only does this help us recognize the work that open requires but can help bring those unused to sharing on board with the idea of Open Data.

Posted in openData | Leave a comment

Spreadsheet Best Practices

I gave a webinar recently on tools and tips for data management. While many of the themes I spoke about have been covered here previously (though you should really check out the webinar slides – they’re awesome), I realized that I have never written about spreadsheets on the blog. Let’s rectify that.

The big thing to know about spreadsheets is that best practices emphasize computability over human readability; this is likely different from how most people use spreadsheets. The focus on computability is partly so that data can be ported between programs without any issue, but also to avoid capturing information solely via formatting. Formatted information is not computable information, which defeats the main purpose of using a spreadsheet. It’s better to have a clean table that is computable in any program than to have a spreadsheet that looks nice but is unusable outside of a particular software package.

With computability in mind, here are a few best practices for your spreadsheets:

  1. Spreadsheets should only contain one big table. The first row should be variable names and all other rows data. Smaller tables should be collapsed into larger tables whereever possible.
  2. Kick graphs and other non-data items out of your spreadsheet tables. If you’re in Excel, move them to a separate tab.
  3. Keep documentation to a minimum in the spreadsheet. (This is where data dictionaries come in handy.)
  4. Differentiate zero from null.
  5. Remove all formatting and absolutely NO MERGED CELLS. Add variables as necessary to encode this information in another way.

If you follow these rules, you should create spreadsheets that are streamlined and can easily move between analysis programs.

Such portability is important for two reasons. First, there are many great data analysis tools you may want to leverage but they probably won’t import messy spreadsheet data. Second, there are issues with research’s ubiquitous Excel; for example, a recent study showed Excel-related gene errors in up to one fifth of supplemental data from top genomics journals, not to mention the fact that Excel is known to mangle dates. It’s therefore best to keep your options open and your data as neutral as possible.

I hope you use these best practice to streamline your spreadsheet data to take maximum advantage of it!

Posted in dataAnalysis | Leave a comment

Open Data’s Dirty Little Secret

Earlier this week, I was very happy to take part in the Digital Science webinar on data management. I spoke about how data management should be accessible and understandable to all and not a barrier to research. I also made a small point, thrown in at the last minute, that really seemed to resonate with people: that open data has a dirtly little secret.

The secret? Open data requires work.

In all of the advocacy for open data, we often forget to account for the fact that making data openly available is not as easy as flipping a switch. Data needs to be cleaned so that it doesn’t contain extraneous information, streamlined to make things computable, and documented so that another researcher can understand it. On top of this, you must choose a license and take time to upload the data and its corresponding metadata. One researcher estimated that this process required 10 hours for a recently published paper, with significantly more time spent preparing his code for sharing.

But there is another secret here. It’s that data management reduces this burden.

Managing your data well means that a good portion of the prep work is done by the time you go to make the data open. This is done via spreadsheet best practices, data dictionaries, README.txt files, etc. Well managed data is already streamlined and documented and thus presents a lower barrier to making it open.

These issues are reinforced by the recently published “Concordat on Open Research Data“. Made up of 10 principles, these two in particular stuck out to me:

  • Principle 3: Open access to research data carries a significant cost, which should be respected by all parties.
  • Principle 6: Good data management is fundamental to all stages of the research process and should be established at the outset.

As we advocate for open data, Principle 3 reaffirms that we need to recognize the costs. But – as most things I blog about here – there is a solution and it’s managing your data better.

Posted in dataManagement, openData | 1 Comment