Managing Your Literature

I’m putting my librarian hat on for today’s post as we’re looking at citation managers. There’s a lot to love about this type of software (easy citation formatting!!!) but in essence, they’re a tool for managing a specific type of data: literature.

Thinking about citation management as a data management issue, we can start applying some of the principles of the latter to the former.

Choosing a tool
There are lots of great citation managers out there and they all perform in roughly the same ways. As citations can be exported and imported, you won’t be locked into any one platform, so I would pick something that fits in your research workflow. Popular options include Zotero, Mendeley, and EndNote; if I had to recommend one, I would opt for Zotero due to its open platform.

Like any other data, organizing your citations will help you later in finding what you need (though universal search in most tools is good). The key is: have a system and stick with it. The most common way to organize literature is by project/paper and you can definitely take advantage of subfolders for further organization.

File naming
Citation managers allow users to upload article pdf’s along with citation information, which can be helpful for keeping everything in one place; this isn’t a necessary practice, more of a personal workflow choice. So while you may not need to have nicely named pdf files if they live in your citation manager (as you can easily search for them), I still recommend using good names for your pdf’s so you can move them in and out of the tool. The name scheme I like for my literature is “FirstAuthorLastName_YYYY_ShortTitle.pdf”, e.g. “Briney_2018_TheProblemWithDates.pdf“.

Data quality control
My colleague who teaches students about citation management has a great demonstration where she’s imported the same reference from three different source (the journal, a database, and Google Scholar) into the citation manager, resulting in three slightly different records. So even if your import is automatic, performing quality control is a still good idea. Optimal times to do this are at time of citation import and as you do the final proof of a manuscript.

I’m a big fan of documentation and I see two good ways to document the literature you keep in your citation manager. The first is to use any built-in notes tool that your citation manager provides. The second is to keep good notes yourself and be clear as to which citation you are referring (good file naming can also help reinforce this connection). Either way, I recommend making notes on articles if you’re doing a lot of reading or you’re likely to loose track of which article is which over time.

Data backup
Like any other data source, you should back up your citation library. Export citations and save them to a file (I use the BibTeX format), then back this file up. With many citation managers using cloud based storage, this gives an added layer of security and also allows you to easily switch platforms.

Data sharing
Citation sharing has gotten a lot easier in the past decade as citation managers have moved their content into the cloud. At this point, data sharing is more of a permissions issue in making sure the right people have access to the right content.

If you’re using a citation manager, you’re probably doing many of the above practices. Still, I think it’s a valuable exercise to think of citations as data to make sure that we’re caring for this information in the best possible way!

Posted in dataManagement | Leave a comment

Adding R to the Data Toolkit

I’ve officially jumped on the R bandwagon. I worked on a project last year for which R turned out to be the best solution to tackling a lot of messy data (OpenRefine was not reproducible enough and let’s not even talk about the disaster that was Access). Since then, I’ve thrown other data at R and now consider it as part of my regular suite of data tools.

I want to emphasize that last point that R is just one piece in the data toolkit. Software like R has a steep learning curve if you’ve never programmed before. There are other tools, like OpenRefine, that get the job done and are friendlier to the average user. But for processing large amounts of data in a reproducible way, R is definitely worth learning. (Here’s roughly how I break my data needs down: Excel is for everyday data work; OpenRefine is for one-off data cleaning; and R is for large scale/reproducible data cleaning and processing.)

So if you find yourself with a lot of data to process, I have some tips for learning R:

  • Run R in RStudio.
    • It takes a little effort to learn the RStudio interface but it will be a better experience if you’re not used to the command line (base R).
  • Have a problem to solve.
    • Learning a programming language is always easier if you have a specific task to accomplish.
  • Take advantage of existing resources.

Finally, I should say that I’m a patron of the Tidyverse, which is a flavor of R that comes with its own tools and methods for data handling. The Tidyverse makes data cleaning easy but you do have to organize your data in a particular way, with columns as variables and rows as individual observations. Tidy data is not condensed data and usually leads to a few columns with rows and rows of data, but this formatting enables streamlined processing. It’s not necessary to use the Tidyverse to use R, but it can be quite useful.

R is not the most efficient way to solve every data problem and it takes time to learn, but I think there is an advantage to learning a language like R (or Python or…) if you have serious data manipulation needs. Does it have a place in your data toolkit?

Posted in dataAnalysis | 1 Comment

Breaking the Blogging Silence

Wow what a year it’s been. I know that it has been really quiet on the blog since this time last year and that’s because life has been anything but quiet!

I stopped posting here a year ago when we added a new member to the family. And just when things looked to be calming down, we decided to move cross-country from Wisconsin to California. It’s been a big change and I’m still getting used to the weather, the culture, and the commute.

The good news is that I’ve started a new job as a Biology Librarian and will continue to do some data work in this role. So there will be future posts on data management tips and tricks! I’m looking forwarding to being back.

Posted in admin | Leave a comment

Data Management in Library Learning Analytics

My latest paper was published this week and I am so very excited to share it with you all. It is Data Management Practices in Academic Library Learning Analytics: A Critical Review.

Every article has a story behind it and this one, as happens with the best articles, started with me getting very annoyed. I had just been introduced to the concept of library learning analytics and was reading a pair studies for a different project. I couldn’t focus on the purpose of the studies because I kept running into concerns with how the researchers were handling the data. What annoyed me most was that one of the studies kept insisting that their data was anonymous when it clearly wasn’t, which has huge implications for data privacy. A little poking around made me realize that such data problems appear with terrible frequency in library learning analytics.

There’s quite a history of ethical debates around library learning analytics but almost no research on the data handling practices which impact patron privacy in very practical ways. After a little digging through the literature and a lot of shouting at my computer, I knew I had to write this paper.

So what did I find? Libraries: we need to do better. For all that we talk about patron privacy, there is sufficient evidence to show that we’re not backing up that intent with proper data protections. The best way to protect data is to collect limited amounts, de-identify it where possible, secure it properly, and keep it for a short time before deleting it. We’re not doing that. I’m also concerned about how we handle consent and opt-in/out, something I didn’t originally intend to study but couldn’t ignore once I started reading. There’s a lot more in the paper, including some explanations of why these are best practices, so I encourage you to go there for more details. And afterward go figure out how to protect your data better.

Finally, I need to again thank Abigail Goben and Dorothea Salo for acting as my sounding boards through this entire process. They listened to me rant, helped me worked out a path for this research, and edited drafts for me. I am deeply grateful for their assistance and I know this paper would not be half as good without their help.

Posted in dataManagement, libraries | Leave a comment

Taking a Break: Some Stories of Documentation

I’ve been thinking a lot about documentation this month as I prepare to take 12 weeks of leave away from my job. The upside is that I’ve had 9 months to plan for this, but I will also say that following good data management and reproducibility practices has greatly helped with the efforts to shift my duties temporarily to others.

In today’s post, I want to provide some snapshots on how I’m documenting tasks so that others can perform them in my absence. I’m hoping that these vignettes will provide inspiration for others aiming to provide enough documentation with their work, whether they are taking leave or not.

Story #1

My smoothest project to shift involved some R code I wrote over the summer to run automated reports for my library’s public services statistics. I knew going into the project that I would not be able to run the reports myself during the key period at the end of the semester, so I made sure to document everything at a level for a novice to pick up. In practice, this meant including a README.txt to walk someone through everything from installing the software to adjusting key variables to running the code. I also tried to make clear within the code, via commenting, which parts needed to be updated to customize the reports. Building code with the intention of it being used by others is really the best practice, and I can see the benefits of taking this approach that will help beyond my 12 weeks away.

Story #2

Another task I’m temporarily shifting is acting as Secretary for my professional association. Again, I’ve helped myself a lot here by having a good README.txt file laying out the structure and permissions for all of the files I manage as Secretary. So it was simply a matter of adding notes on my duties so that they could be adequately covered.

Story #3

A more involved project to shift is a research project I’m on for which I have an assistant. Key documentation here included a timeline for assistant onboarding tasks and a lot of communication with my collaborators. The timeline turned out to be a good idea, generally, as expectations are clear for everyone; this is likely a method I’ll use in the future. Otherwise, I’m trying to go for a more-communication-is-better approach, which requires extra work but will benefit everyone when I’m away.

These three vignettes show [what I hope are] successful efforts to document tasks for others. I think what makes them good is that I’ve built a lot of the documentation into the projects to begin with, making it easier for me to pass stuff on now. I admit that not every project I run is documented to the levels described here but my future self is usually more grateful than not for taking 10 minutes early on to write a README or send a status email.

I hope these stories provide you with some ideas for your own projects that may need to be passed along or picked up again by your future self. Even a little documentation created early in a project is helpful and usually doesn’t take a huge amount of time to create. The benefits to your collaborators and your future self usually make it all worthwhile.

Posted in documentation | Leave a comment