Thoughts on Data Management as Housekeeping

My colleague Carolyn Bishoff at the MDLS20 conference introduced me to the idea that data management is like housekeeping: it’s a task that you have to continually do in order to live and thrive in your environment. It’s not something that we always enjoy doing and it’s something that we can get away with doing the bare minimum in order to survive, but it’s still something that needs to be done.

Carolyn continued this metaphor in the scope of teaching people how to do data management (which is something I do as part of my job). She likened it to teaching someone to do laundry; just because you know how to wash and fold your clothes doesn’t mean that you actually get your laundry done. I think this is a good reminder for everyone, data management instructors and practitioners alike, about the continual nature of the work and that knowing doesn’t necessarily translate into doing.

I’m also reflecting on another talk by my colleague Hannah Gunderman at the RDAP21 conference who acknowledged how much anxiety exists around data management. When we talk about “data best practices” (which I’ve done frequently), that can create anxiety because we feel like we aren’t living up to that ideal data standard. Instead, Gunderman suggested using the term “recommended practices” and recognizing that the perceived ideal is impossible. We might desire to be the Martha Stewart of data management (I will admit to personally having this desire), but it’s not a realistic standard for everyday life.

A third reference I want to pull into this reflection is from the book Unf*ck Your Habitat by Rachel Hoffman. It’s a book about literal housekeeping but I think some of the lessons apply to data management. Namely, Hoffman recommends that, instead of doing deep cleaning sprees when your house gets super messy, you should regularly set aside small amounts of dedicated time (with the duration depending on your energy and ability) to try to improve your environment. This can be 5 minutes or 30 minutes, but when that time’s up you stop cleaning and take a short break. You won’t be able to clean everything during these short periods but you can actually make a positive difference in this short amount of time. This incremental method gives us the ability to make improvements while also relieving ourselves of the need for housekeeping perfection.

Finally, I’m thinking about housekeeping as care work, which is often invisible and gendered. If we use this metaphor, we need to recognize that housekeeping labor has mainly been the provenance of women in American society (and other Western countries) and historically undervalued. It’s unpaid labor and, even though it’s critical to a functioning society, it’s made invisible (see Abigail Goben’s Women’s Labor in COVID bibliography for all of the ways our reliance on unpaid care labor has broken us during this pandemic). I think data management is a form of care work, in that we are caring for our research results, yet the act of managing data is often rendered invisible until a data disaster happens. This perspective also makes me wonder if data management is a gendered act within the research enterprise?

While metaphors always have their limitations, I think there is value in thinking of data management as housekeeping. There’s no one right way to keep your house clean and there’s no one right way to keep your data organized, but there’s value in making continual small steps to make things better. Embrace the imperfection and do what you can to make your data a little more organized that before; these small differences really do help. And finally, as a collective we must value the work of data management, even when there’s societal pressure to render it invisible.

Posted in dataManagement | Leave a comment

Citation Omitted: A Story of Re-identification

I published the article “Data Management Practices in Academic Library Learning Analytics: A Critical Review” in 2019. Eagle-eyed readers may have noticed that I omitted a couple of citations, instead listing them as “citation omitted in order to protect students’ identities.” This is because the two studies in question published student details so identifying that it would be possible to attach names to those individuals. Of the two studies, one was incidentally identifying and one was egregiously identifying. This blog post will talk about the egregious study, this study:

Murray, A., Ireland, A., & Hackathorn, J. (2016). The Value of Academic Libraries: Library Services as a Predictor of Student Retention. College & Research Libraries, 77(5), 631–642.

I had not publicly identified the study until January 6 of this year when, in my frustration about CR&L publishing yet another study with privacy concerns and the simultaneous unfolding of American democracy, I vented out this Twitter rant:

Then, to properly explain my concerns about privacy and re-identification in the article, I followed up with this Twitter thread:

I’ve had several requests to turn those two Twitter threads into a citeable blog post, so here we are.

I have two privacy concerns with the study. My major concern centers on Table 1, which lists study participants to include: 2 Native American freshman, 3 Native American sophomores, 1 Pacific Islander freshman, and 1 Pacific Islander sophomore. The study also lists age range of participants from 17 to 83, meaning the oldest participant is 83 years old. By including this specific information, the article basically identified several students even without giving us their names.

In research this is called “n=1”, meaning that you’ve divided up demographics so much that you identify single people. It’s definitely not something that should be done when publishing research results. The individuals in this example are even more identifiable as they come from minority student populations (with examples of both race and age minorities), so it’s bad on two fronts.

If I was a part of the university where the study was conducted, just knowing “83 year-old student” or “Pacific Islander sophomore” may be enough for me to come up with specific names because I’m familiar with the student body. As an outsider, it’s still a rather trivial process to go from n=1 identifiers to names.

Let’s take the “Pacific Islander sophomore” and work through the thought example (I’m not actually going to find a name, just talk about the process). We’ll pull in an outside dataset to make this work, in this case IPEDS. IPEDS in a national database that collects statistics on every U.S. academic institution. One of the statistics IPEDS collects is completions by year in different majors broken down by racial demographics, aka. the “Completions” table. So now I can look up the university, look up the year, and find my single Pacific Islander to discover their major. Then it becomes a matter of visiting the department webpage or Facebook or the graduation program to get a list of names corresponding to that major in that year. Finally, using context and other available data I can whittle the names down to a likely candidate. The person’s minority status makes them easier to identify here, especially if they have a non-White name or do not pass for White in departmental photos. This whole process may take 30 minutes or so and uses information that is freely available on the web.

Coming back to the study, while putting a name to a study participant does not tell me what that student did in the library, it’s still not acceptable for the article to identify them. And it’s not okay that these issues slipped past peer reviewers and editors. And, when I contacted the editor about correcting the issue, it’s not okay that nothing was done (the conclusion was basically “it’s bad but not bad enough to merit correction”).

So, seeing as this is a blog on better data practices, what should be done instead? Whenever you have small populations, think carefully before you report data about those people. There’s no hard rule of thumb for size but consider: warnings at under 20, red flags at less than 10, and full stop for under 5. There are two common options for dealing with small populations: aggregate small subgroups into one “Other” group to add up small numbers into a larger number (e.g. there are 33 Asian, Native American, and Pacific Islanders in this study); or obscure the small/outlier number values (e.g. “<5 Pacific Islanders” or “>65 years old”). Be aware that the first option can hide the existence of minority racial populations by erasing their representation in the data, so be thoughtful to balance representation with privacy.

The second thing that needs to be done is that we all need to be better at identifying and calling out these problems when we see them, especially peer reviewers and editors. I know I get on my soapbox periodically about “anonymization” versus “de-identification”, but it’s because many people fundamentally don’t understand the difference. We need to learn that datasets about people are never anonymous and that we should always operate from the perspective that they can be re-identified.

Finally, I won’t deny that there are a lot of power dynamics in play for why I haven’t told this story previously. I didn’t want to identify the article, and thereby identify the students, as the students have no power in this situation and didn’t ask to be identified just because they used the library. I was also leery, as a somewhat new librarian, of calling out one of the field’s preeminent journals. I have now done both because it’s important for people to understand just how easy it is to re-identify people from scant published information. I do this not to rehash the past but because I want people to do better going forward. So go, do better, and never publish n=1 again.

Thank you Dorothea and Callan and everyone else who suggested that this be a blog post.

Posted in libraries, privacy, publishing | Leave a comment

Visualizing COVID in Fiber

A side effect of being a data specialist-by-day and a crafter-by-night, is that the two sometimes combine. This leads to things like the bad passwords dress and the women in science dress, but more recently, I’ve been dabbling in data visualization with fiber. This is my first finished visualization project.

Woman holding a long scarf vertically. Scarf is made up of tiny woven hexagons sewn together. Hexagons in the top are white, then progress through pinks to red. There are three large blotches of dark red within the large red section.

This is a scarf representing the daily U.S. COVID fatalities in 2020. Each 2″ woven hexagon represents the number of reported deaths from COVID on a single day in the United States, starting on January 1, 2020 and ending December 31, 2020.

To interpret the visualization, you need to know how the data is laid out and what the colors mean. Data is laid out by week exactly as it appears on the calendar, with each week starting on Sunday and ending on Saturday. I represented the number of deaths on a logarithmic scale by color, where:

  • white = 0 deaths
  • light pink = 1-9 deaths
  • dark pink = 10-99 deaths
  • red = 100-999 deaths
  • dark red = 1000+ deaths

The data was pulled from The COVID Tracking Project at The Atlantic and used under a Creative Commons CC BY 4.0 license.

For those interested in the crafting part of the project, I wove the hexagons on a 2-inch, fine-sett turtle pin loom using KnitPicks Palette yarn. Full crafting details are available on this Ravelry page.

Small hexagon pin loom, half woven on top and bottom with red yarn.

I don’t think we, as a society, have entirely processed the half million deaths (and counting) from this pandemic. This visualization was one way for me work through the massive loss. It was very surreal to weave a dark red hexagon and realize that it represented over 1,000 people on one day. And then acknowledge that there are over 150 dark red hexagons on the scarf.

For me, this visualization also represents imperfection and I embrace that. Color-coding on a log scale distorts the numbers but was a necessary trade off to limit yarn colors. I also see artifacts in the data, especially in August and November, where deaths are reported in lower numbers on Sundays and Mondays. Perhaps most importantly, I recognize that these numbers only represent reported deaths, so are not an accurate picture of the total loss from COVID. Nor do the numbers show impacts we all feel from that loss.

Close up of woven hexagons, showing all five colors used in the data visualization. This is data from around March 2020.

I don’t know if I’ll do a second half of this scarf representing 2021. But I will say that it’s my fervent wish that that visualization would be made up of mostly white hexagons. Until then, wear a mask, keep your distance, and get vaccinated when you are able to.

Posted in dataVisualization | 1 Comment

New NIH Data Management and Sharing Policy

The thud you might have heard yesterday was NIH dropping a new Data Management and Sharing Policy. It won’t go into effect until 2023-01-25 but the policy has so many ramifications that I don’t plan to waste time in preparing.

I’m going to do a short overview of initial thoughts here. I expect that I’ll be working through all of the nuances more in the weeks to come.

Here are the highlights for how this policy effects researchers:

  • All NIH grants will be required to have a 2-page maximum data management plan (DMP). NIH expects researchers to: be clear in the DMP about where they plan to share (“to be determined” is no longer acceptable), notify them if plans change, and actually follow the plan.
  • You will be sharing more data, as NIH not only wants the data that underlies publications but all data that verifies results.
  • You will be sharing data sooner. NIH prefers if you share as soon as possible, but at the latest sharing should occur with publication or at the end of the grant period, which ever comes first. That last part is a huge change.
  • You will share your data in a repository. Criteria for data repositories are provided in a supplement and I expect to see more in this area between now and 2023.
  • You can ask for money to support data management and sharing activities, including pre-paying for long-term hosting of open data.
  • If you conduct data on people, sharing expectations are changing. NIH really wants researchers to fine-tune the balance between sharing and privacy. Two mechanisms explicitly called out are outlining sharing practices during informed consent and controlled data sharing, even for de-identified data. This is another area where I want to see more development.
  • If you are doing research on indigenous populations, you must respect Tribal sovereignty. This is a great addition to the policy.

I think this is a good policy, though it’s definitely overdue. I don’t love the lack of clarity around retention times and I’m not sure how I feel about review of DMPs shifting from peer reviewers to program officers. But these are minor quibbles in what I think is a pretty solid policy.

The biggest takeaways is that this policy represents a shift in expectations for data sharing. It has stronger requirements than the NSF data policy and will really move things forward. Some people are going to hate it and it’s going to be a big adjustment, but it’s a win for reproducibility and open data.

Posted in dataManagementPlans, fundingAgencies | 1 Comment

Foundational Practices of Research Data Management

If you’re a regular reader of my blog, you’ll know that one of my goals is for all researchers to adopt the basic data management practices that make conducting research easier. I’ve written a whole book on data management, done videos, created checklists, written numerous blog posts, etc., but it will never be enough until researchers are regularly taught these skills. Until that point, I’ll keep sending the gospel of data out in the world in different formats, hoping to reach new audiences.

My latest iteration of educating about the principles of data management is in the form of a research article in RIO. I really like the article format because it’s just enough space to provide a broad overview of the basic data management practices. And if readers want to learn more, we’ve provided a handy list of citations!

The new article covers 10 practices of data management that my coauthors and I consider to be foundational:

  • Practice 1: Keep sufficient documentation
  • Practice 2: Organize files and name them consistently
  • Practice 3: Version the Files
  • Practice 4: Create a security plan, when applicable
  • Practice 5: Define roles and responsibilities
  • Practice 6: Back up the data
  • Practice 7: Identify tool constraints
  • Practice 8: Close out the project
  • Practice 9: Put the data in a repository
  • Practice 10: Write these conventions down [in a data management plan]

This is by no means the complete scope of data management but rather a good introduction. Honestly, if you implement all ten practices into your research, you’re going to be doing very well with your data.

So if you or a peer are looking for a general introduction to research data management, check out my new article “Foundational Practices of Research Data Management.”

Citation: Briney KA, Coates H, Goben A (2020) Foundational Practices of Research Data Management. Research Ideas and Outcomes 6: e56508.

Posted in dataManagement | Leave a comment