Pivot Tables

One of my recent posts touched how powerful R is for data cleaning and manipulation, but I want to take a step back and recognize that a ton of science gets done in Excel. And while Excel has many limitations (cough dates cough), it does have a place in the research toolkit. So it’s worth discussing some of the more powerful features of the software to get the most out of using it.

In this post, we’re going to talk about a useful feature in Excel that not everyone knows about: pivot tables. If you’re already using pivot tables, you can skip this post entirely and go have a cup of tea instead. But for everyone else, let me blow your mind.

If you know how to write functions in Excel, you know that Excel can easily calculate sums, averages, and counts across all values for one variable in a dataset. What’s more difficult to do is segment that variable into subgroups and calculate sums, average, and counts for each distinct subset. This is where pivot tables come in.

For example, say I did a survey on college students where one variable in my dataset lists year in school (a text value) and another variable contains Likert scale data (a 1-5 integer rating value). I want to know how many freshman, sophomores, juniors, and seniors there are in the dataset. A pivot table can do that. Another example would be calculating average Likert response value for each year-in-school. Or create a simple table with year-in-school as the rows, Likert value as the columns, and counts of responses as the table entries. Pivot tables can do these too. Basically, any time you want to group your data into subsets and run some simple summary statistics on those subsets, you want a pivot table.

The nice thing about Excel is that it has a little wizard for making pivot tables that, with practice, is fairly straightforward to use. (Note: I’m working in Excel 2016.) To get started, highlight the data you want to analyze, click “Insert/PivotTable”, and select where you want to put your data.

Insert menu for adding a pivot table in Excel

Doing this drops in an empty pivot table and opens the PivotTable wizard.

Pivot table wizard and empty pivot table

Let’s start with the example of getting counts of how many freshman, sophomores, juniors, and seniors there are in the dataset. Drag-and-drop the “Year in School” label from the top of the wizard (under “PivotTable Fields”) down to the “Rows” box at the bottom of the wizard; this will put a list of years into your pivot table but no data. Next drag the “Year in School” label from the top into the “Values” box at the bottom right of the wizard; this will add values to your table. Note that the standard Value defaults to count. Now you have the table you need!

Pivot table showing counts by year and wizard settings

[Exercise: how would you create a table displaying how many times the Likert values 1-5 appear in your dataset?]

In the next example, we’ll look at average Likert value by year-in-school. To remove the current count data, you can either: make a new pivot table, drag-and-drop variables out of the “Values” box, or click the arrow next to a variable in the “Values” box and select “Remove Field”. With “Year in School” in the “Rows” box, drag-and-drop “Likert Value” into the “Values” box. Again, the default is count, which isn’t what we want here. Click the arrow next to “Count of Likert Value” in the “Values” box and select “Value Field Settings…”; this opens up a menu where you can select different functions, one of which is “Average”. Change to average and now you have a table of Likert averages displayed by year-in-school!

Pivot table showing Likert averages by year and wizard settings

[Exercise: how would you calculate standard deviation of Likert values for each year-in-school? Can you display both average and standard deviation in the same table?]

Our final example will add columns to our table. Reset your table. Drag-and-drop “Year in School” to the “Rows” box in the wizard and “Likert Value” into the “Columns” box. Drag “Likert Value” into the “Values” to populate the count data. And now we have our table of response counts broken down by both year-in-school and Likert value. The pivot table also shows totals across both rows and columns, which is a handy check to see if the data looks right.

Pivot table showing use of columns and wizard settings

[Exercise: how would you re-arrange this table to show year-in-school as columns and Likert value as rows? Does this change the calculated counts?]

I’ll let you all play around with the PivotTable wizard further, but know that you can:

  • Filter your data to display only certain variable values (e.g. freshman and sophomores only)
  • Resort your table by data value (e.g. highest response counts are at the top of the table)
  • Do different calculations besides average and count
  • Create figures directly from pivot table data

Pivot tables are very powerful!

If you’ve never used pivot tables before, I hope this post shows you how useful they are and gives you enough information to get started making pivot tables of your own.

Posted in dataAnalysis, spreadsheets | Leave a comment

Book Review: Storytelling with Data

Cover image of the book "Storytelling with Data"

One of the data areas that I’m currently interested in is data visualization, which has resulted in a pile of data viz books in my to-read list. I’ve been working my way through Show Me the Numbers, which may be a future book review, but got interrupted when the ebook for Storytelling With Data came available at the public library. It was a quick and enjoyable read and worth sharing with you!

Storytelling with Data is a useful overview on the best ways to use data to support whatever argument you’re making. In one sense it’s a data visualization book, but it also takes a larger view of how your data visualizations support the story you’re telling. Nussbaumer Knaflic does this through 6 lessons:

  1. Understand the context
  2. Choose an appropriate visual display
  3. Eliminate clutter
  4. Focus attention where you want it
  5. Think like a designer
  6. Tell a story

You can see from the lessons (chapters) that, while visualization practices make up the core of the book, everything is bookended by putting visualizations into the larger context of the message you are trying to convey.

What I like about this book is that it’s an accessible way to orient yourself to a number of design principles. It provides a framework for rethinking how you approach visualization and sets you on a path toward engaging with visuals in a new way. The storytelling aspect of the book is particularly helpful, and Nussbaumer Knaflic cites guides to effective presentations to underscore her points.

The downside of the book’s accessibility is that I found myself wishing to engage with a number of the ideas presented in the book at a deeper level. Nussbaumer Knaflic gives some advice on how to chose the proper graph format, but I found Stephanie Evergreen’s Effective Data Visualization (my review here) to be much more helpful here. Similarly, Storytelling with Data summarizes cognitive load and the Gestalt principles of visualization, but Stephen Few’s Show Me the Numbers goes into these in more depth to the point where I feel that Few allows me to actually apply these concepts to my visualizations.

In summary, I would definitely share Storytelling with Data to someone looking to get started with improving their data visualizations as it provides a good foundation to the topic. More experienced visualizers may also find useful information in the book’s pages.

Posted in bookReview, dataVisualization | Leave a comment

Managing Your Literature

I’m putting my librarian hat on for today’s post as we’re looking at citation managers. There’s a lot to love about this type of software (easy citation formatting!!!) but in essence, they’re a tool for managing a specific type of data: literature.

Thinking about citation management as a data management issue, we can start applying some of the principles of the latter to the former.

Choosing a tool
There are lots of great citation managers out there and they all perform in roughly the same ways. As citations can be exported and imported, you won’t be locked into any one platform, so I would pick something that fits in your research workflow. Popular options include Zotero, Mendeley, and EndNote; if I had to recommend one, I would opt for Zotero due to its open platform.

Organization
Like any other data, organizing your citations will help you later in finding what you need (though universal search in most tools is good). The key is: have a system and stick with it. The most common way to organize literature is by project/paper and you can definitely take advantage of subfolders for further organization.

File naming
Citation managers allow users to upload article pdf’s along with citation information, which can be helpful for keeping everything in one place; this isn’t a necessary practice, more of a personal workflow choice. So while you may not need to have nicely named pdf files if they live in your citation manager (as you can easily search for them), I still recommend using good names for your pdf’s so you can move them in and out of the tool. The name scheme I like for my literature is “FirstAuthorLastName_YYYY_ShortTitle.pdf”, e.g. “Briney_2018_TheProblemWithDates.pdf“.

Data quality control
My colleague who teaches students about citation management has a great demonstration where she’s imported the same reference from three different source (the journal, a database, and Google Scholar) into the citation manager, resulting in three slightly different records. So even if your import is automatic, performing quality control is a still good idea. Optimal times to do this are at time of citation import and as you do the final proof of a manuscript.

Documentation
I’m a big fan of documentation and I see two good ways to document the literature you keep in your citation manager. The first is to use any built-in notes tool that your citation manager provides. The second is to keep good notes yourself and be clear as to which citation you are referring (good file naming can also help reinforce this connection). Either way, I recommend making notes on articles if you’re doing a lot of reading or you’re likely to loose track of which article is which over time.

Data backup
Like any other data source, you should back up your citation library. Export citations and save them to a file (I use the BibTeX format), then back this file up. With many citation managers using cloud based storage, this gives an added layer of security and also allows you to easily switch platforms.

Data sharing
Citation sharing has gotten a lot easier in the past decade as citation managers have moved their content into the cloud. At this point, data sharing is more of a permissions issue in making sure the right people have access to the right content.

If you’re using a citation manager, you’re probably doing many of the above practices. Still, I think it’s a valuable exercise to think of citations as data to make sure that we’re caring for this information in the best possible way!

Posted in dataManagement | Leave a comment

Adding R to the Data Toolkit

I’ve officially jumped on the R bandwagon. I worked on a project last year for which R turned out to be the best solution to tackling a lot of messy data (OpenRefine was not reproducible enough and let’s not even talk about the disaster that was Access). Since then, I’ve thrown other data at R and now consider it as part of my regular suite of data tools.

I want to emphasize that last point that R is just one piece in the data toolkit. Software like R has a steep learning curve if you’ve never programmed before. There are other tools, like OpenRefine, that get the job done and are friendlier to the average user. But for processing large amounts of data in a reproducible way, R is definitely worth learning. (Here’s roughly how I break my data needs down: Excel is for everyday data work; OpenRefine is for one-off data cleaning; and R is for large scale/reproducible data cleaning and processing.)

So if you find yourself with a lot of data to process, I have some tips for learning R:

  • Run R in RStudio.
    • It takes a little effort to learn the RStudio interface but it will be a better experience if you’re not used to the command line (base R).
  • Have a problem to solve.
    • Learning a programming language is always easier if you have a specific task to accomplish.
  • Take advantage of existing resources.

Finally, I should say that I’m a patron of the Tidyverse, which is a flavor of R that comes with its own tools and methods for data handling. The Tidyverse makes data cleaning easy but you do have to organize your data in a particular way, with columns as variables and rows as individual observations. Tidy data is not condensed data and usually leads to a few columns with rows and rows of data, but this formatting enables streamlined processing. It’s not necessary to use the Tidyverse to use R, but it can be quite useful.

R is not the most efficient way to solve every data problem and it takes time to learn, but I think there is an advantage to learning a language like R (or Python or…) if you have serious data manipulation needs. Does it have a place in your data toolkit?

Posted in dataAnalysis | 1 Comment

Breaking the Blogging Silence

Wow what a year it’s been. I know that it has been really quiet on the blog since this time last year and that’s because life has been anything but quiet!

I stopped posting here a year ago when we added a new member to the family. And just when things looked to be calming down, we decided to move cross-country from Wisconsin to California. It’s been a big change and I’m still getting used to the weather, the culture, and the commute.

The good news is that I’ve started a new job as a Biology Librarian and will continue to do some data work in this role. So there will be future posts on data management tips and tricks! I’m looking forwarding to being back.

Posted in admin | Leave a comment

Data Management in Library Learning Analytics

My latest paper was published this week and I am so very excited to share it with you all. It is Data Management Practices in Academic Library Learning Analytics: A Critical Review.

Every article has a story behind it and this one, as happens with the best articles, started with me getting very annoyed. I had just been introduced to the concept of library learning analytics and was reading a pair studies for a different project. I couldn’t focus on the purpose of the studies because I kept running into concerns with how the researchers were handling the data. What annoyed me most was that one of the studies kept insisting that their data was anonymous when it clearly wasn’t, which has huge implications for data privacy. A little poking around made me realize that such data problems appear with terrible frequency in library learning analytics.

There’s quite a history of ethical debates around library learning analytics but almost no research on the data handling practices which impact patron privacy in very practical ways. After a little digging through the literature and a lot of shouting at my computer, I knew I had to write this paper.

So what did I find? Libraries: we need to do better. For all that we talk about patron privacy, there is sufficient evidence to show that we’re not backing up that intent with proper data protections. The best way to protect data is to collect limited amounts, de-identify it where possible, secure it properly, and keep it for a short time before deleting it. We’re not doing that. I’m also concerned about how we handle consent and opt-in/out, something I didn’t originally intend to study but couldn’t ignore once I started reading. There’s a lot more in the paper, including some explanations of why these are best practices, so I encourage you to go there for more details. And afterward go figure out how to protect your data better.

Finally, I need to again thank Abigail Goben and Dorothea Salo for acting as my sounding boards through this entire process. They listened to me rant, helped me worked out a path for this research, and edited drafts for me. I am deeply grateful for their assistance and I know this paper would not be half as good without their help.

Posted in dataManagement, libraries | Leave a comment