Data Ab Initio

The White Supremacy of Library Learning Analytics

by Kristin Briney Posted on 2021-06-09

I’ve removed this post because it is problematic. I want to thank my colleagues of color for pointing this out to me and educating me.

It is a privilege to not see white supremacy in learning analytics research.
I jumped into an ongoing conversation without recognizing the work of peers, mostly people of color, who are already working in this area. Just because ideas are new to me does not mean that they are new. (I’ll point you to the work of Yasmeen Shorish if you want to learn more.)
I posted something that needed further reflection because I got excited to put something on my blog. Essentially, I tried to get a cookie when no cookies were deserved.

Thank you again to those willing to take time to educate me. I will work to be more thoughtful in the future.

Posted in libraries, socialJustice | Leave a comment

Thoughts on Data Management as Housekeeping

by Kristin Briney Posted on 2021-05-04

My colleague Carolyn Bishoff at the MDLS20 conference introduced me to the idea that data management is like housekeeping: it’s a task that you have to continually do in order to live and thrive in your environment. It’s not something that we always enjoy doing and it’s something that we can get away with doing the bare minimum in order to survive, but it’s still something that needs to be done.

Carolyn continued this metaphor in the scope of teaching people how to do data management (which is something I do as part of my job). She likened it to teaching someone to do laundry; just because you know how to wash and fold your clothes doesn’t mean that you actually get your laundry done. I think this is a good reminder for everyone, data management instructors and practitioners alike, about the continual nature of the work and that knowing doesn’t necessarily translate into doing.

I’m also reflecting on another talk by my colleague Hannah Gunderman at the RDAP21 conference who acknowledged how much anxiety exists around data management. When we talk about “data best practices” (which I’ve done frequently), that can create anxiety because we feel like we aren’t living up to that ideal data standard. Instead, Gunderman suggested using the term “recommended practices” and recognizing that the perceived ideal is impossible. We might desire to be the Martha Stewart of data management (I will admit to personally having this desire), but it’s not a realistic standard for everyday life.

A third reference I want to pull into this reflection is from the book Unf*ck Your Habitat by Rachel Hoffman. It’s a book about literal housekeeping but I think some of the lessons apply to data management. Namely, Hoffman recommends that, instead of doing deep cleaning sprees when your house gets super messy, you should regularly set aside small amounts of dedicated time (with the duration depending on your energy and ability) to try to improve your environment. This can be 5 minutes or 30 minutes, but when that time’s up you stop cleaning and take a short break. You won’t be able to clean everything during these short periods but you can actually make a positive difference in this short amount of time. This incremental method gives us the ability to make improvements while also relieving ourselves of the need for housekeeping perfection.

Finally, I’m thinking about housekeeping as care work, which is often invisible and gendered. If we use this metaphor, we need to recognize that housekeeping labor has mainly been the provenance of women in American society (and other Western countries) and historically undervalued. It’s unpaid labor and, even though it’s critical to a functioning society, it’s made invisible (see Abigail Goben’s Women’s Labor in COVID bibliography for all of the ways our reliance on unpaid care labor has broken us during this pandemic). I think data management is a form of care work, in that we are caring for our research results, yet the act of managing data is often rendered invisible until a data disaster happens. This perspective also makes me wonder if data management is a gendered act within the research enterprise?

While metaphors always have their limitations, I think there is value in thinking of data management as housekeeping. There’s no one right way to keep your house clean and there’s no one right way to keep your data organized, but there’s value in making continual small steps to make things better. Embrace the imperfection and do what you can to make your data a little more organized that before; these small differences really do help. And finally, as a collective we must value the work of data management, even when there’s societal pressure to render it invisible.

Posted in dataManagement | Leave a comment

Citation Omitted: A Story of Re-identification

by Kristin Briney Posted on 2021-04-01

I published the article “Data Management Practices in Academic Library Learning Analytics: A Critical Review” in 2019. Eagle-eyed readers may have noticed that I omitted a couple of citations, instead listing them as “citation omitted in order to protect students’ identities.” This is because the two studies in question published student details so identifying that it would be possible to attach names to those individuals. Of the two studies, one was incidentally identifying and one was egregiously identifying. This blog post will talk about the egregious study, this study:

Murray, A., Ireland, A., & Hackathorn, J. (2016). The Value of Academic Libraries: Library Services as a Predictor of Student Retention. College & Research Libraries, 77(5), 631–642. https://doi.org/10.5860/crl.77.5.631

I had not publicly identified the study until January 6 of this year when, in my frustration about CR&L publishing yet another study with privacy concerns and the simultaneous unfolding of American democracy, I vented out this Twitter rant:

I'm angry at everything right now, including the new the latest problematic article from C&RL.

So I think it's time we talked about C&RL's history of publishing identifying information in articles.

(Yes, @LibSkrat I'm finally telling this story)https://t.co/JPCks3tohr
— Dr. Kristin Briney (@KristinBriney) January 6, 2021

Then, to properly explain my concerns about privacy and re-identification in the article, I followed up with this Twitter thread:

I want to add an addendum to this thread from the other day to show why publishing an n=1 is so bad. It's because I can likely identify and put a name to this student.

(I'm not going to do that here but I am going to show you how easy it is.) https://t.co/bxLrKtdPT9
— Dr. Kristin Briney (@KristinBriney) January 8, 2021

I’ve had several requests to turn those two Twitter threads into a citeable blog post, so here we are.

I have two privacy concerns with the study. My major concern centers on Table 1, which lists study participants to include: 2 Native American freshman, 3 Native American sophomores, 1 Pacific Islander freshman, and 1 Pacific Islander sophomore. The study also lists age range of participants from 17 to 83, meaning the oldest participant is 83 years old. By including this specific information, the article basically identified several students even without giving us their names.

In research this is called “n=1”, meaning that you’ve divided up demographics so much that you identify single people. It’s definitely not something that should be done when publishing research results. The individuals in this example are even more identifiable as they come from minority student populations (with examples of both race and age minorities), so it’s bad on two fronts.

If I was a part of the university where the study was conducted, just knowing “83 year-old student” or “Pacific Islander sophomore” may be enough for me to come up with specific names because I’m familiar with the student body. As an outsider, it’s still a rather trivial process to go from n=1 identifiers to names.

Let’s take the “Pacific Islander sophomore” and work through the thought example (I’m not actually going to find a name, just talk about the process). We’ll pull in an outside dataset to make this work, in this case IPEDS. IPEDS in a national database that collects statistics on every U.S. academic institution. One of the statistics IPEDS collects is completions by year in different majors broken down by racial demographics, aka. the “Completions” table. So now I can look up the university, look up the year, and find my single Pacific Islander to discover their major. Then it becomes a matter of visiting the department webpage or Facebook or the graduation program to get a list of names corresponding to that major in that year. Finally, using context and other available data I can whittle the names down to a likely candidate. The person’s minority status makes them easier to identify here, especially if they have a non-White name or do not pass for White in departmental photos. This whole process may take 30 minutes or so and uses information that is freely available on the web.

Coming back to the study, while putting a name to a study participant does not tell me what that student did in the library, it’s still not acceptable for the article to identify them. And it’s not okay that these issues slipped past peer reviewers and editors. And, when I contacted the editor about correcting the issue, it’s not okay that nothing was done (the conclusion was basically “it’s bad but not bad enough to merit correction”).

So, seeing as this is a blog on better data practices, what should be done instead? Whenever you have small populations, think carefully before you report data about those people. There’s no hard rule of thumb for size but consider: warnings at under 20, red flags at less than 10, and full stop for under 5. There are two common options for dealing with small populations: aggregate small subgroups into one “Other” group to add up small numbers into a larger number (e.g. there are 33 Asian, Native American, and Pacific Islanders in this study); or obscure the small/outlier number values (e.g. “<5 Pacific Islanders” or “>65 years old”). Be aware that the first option can hide the existence of minority racial populations by erasing their representation in the data, so be thoughtful to balance representation with privacy.

The second thing that needs to be done is that we all need to be better at identifying and calling out these problems when we see them, especially peer reviewers and editors. I know I get on my soapbox periodically about “anonymization” versus “de-identification”, but it’s because many people fundamentally don’t understand the difference. We need to learn that datasets about people are never anonymous and that we should always operate from the perspective that they can be re-identified.

Finally, I won’t deny that there are a lot of power dynamics in play for why I haven’t told this story previously. I didn’t want to identify the article, and thereby identify the students, as the students have no power in this situation and didn’t ask to be identified just because they used the library. I was also leery, as a somewhat new librarian, of calling out one of the field’s preeminent journals. I have now done both because it’s important for people to understand just how easy it is to re-identify people from scant published information. I do this not to rehash the past but because I want people to do better going forward. So go, do better, and never publish n=1 again.

Thank you Dorothea and Callan and everyone else who suggested that this be a blog post.

Posted in libraries, privacy, publishing | Leave a comment

Visualizing COVID in Fiber

by Kristin Briney Posted on 2021-03-18

A side effect of being a data specialist-by-day and a crafter-by-night, is that the two sometimes combine. This leads to things like the bad passwords dress and the women in science dress, but more recently, I’ve been dabbling in data visualization with fiber. This is my first finished visualization project.

Woman holding a long scarf vertically. Scarf is made up of tiny woven hexagons sewn together. Hexagons in the top are white, then progress through pinks to red. There are three large blotches of dark red within the large red section.

This is a scarf representing the daily U.S. COVID fatalities in 2020. Each 2″ woven hexagon represents the number of reported deaths from COVID on a single day in the United States, starting on January 1, 2020 and ending December 31, 2020.

To interpret the visualization, you need to know how the data is laid out and what the colors mean. Data is laid out by week exactly as it appears on the calendar, with each week starting on Sunday and ending on Saturday. I represented the number of deaths on a logarithmic scale by color, where:

white = 0 deaths
light pink = 1-9 deaths
dark pink = 10-99 deaths
red = 100-999 deaths
dark red = 1000+ deaths

The data was pulled from The COVID Tracking Project at The Atlantic and used under a Creative Commons CC BY 4.0 license. https://covidtracking.com/

For those interested in the crafting part of the project, I wove the hexagons on a 2-inch, fine-sett turtle pin loom using KnitPicks Palette yarn. Full crafting details are available on this Ravelry page.

Small hexagon pin loom, half woven on top and bottom with red yarn.

I don’t think we, as a society, have entirely processed the half million deaths (and counting) from this pandemic. This visualization was one way for me work through the massive loss. It was very surreal to weave a dark red hexagon and realize that it represented over 1,000 people on one day. And then acknowledge that there are over 150 dark red hexagons on the scarf.

For me, this visualization also represents imperfection and I embrace that. Color-coding on a log scale distorts the numbers but was a necessary trade off to limit yarn colors. I also see artifacts in the data, especially in August and November, where deaths are reported in lower numbers on Sundays and Mondays. Perhaps most importantly, I recognize that these numbers only represent reported deaths, so are not an accurate picture of the total loss from COVID. Nor do the numbers show impacts we all feel from that loss.

Close up of woven hexagons, showing all five colors used in the data visualization. This is data from around March 2020.

I don’t know if I’ll do a second half of this scarf representing 2021. But I will say that it’s my fervent wish that that visualization would be made up of mostly white hexagons. Until then, wear a mask, keep your distance, and get vaccinated when you are able to.

Posted in dataVisualization | 2 Comments

New NIH Data Management and Sharing Policy

by Kristin Briney Posted on 2020-10-30

The thud you might have heard yesterday was NIH dropping a new Data Management and Sharing Policy. It won’t go into effect until 2023-01-25 but the policy has so many ramifications that I don’t plan to waste time in preparing.

I’m going to do a short overview of initial thoughts here. I expect that I’ll be working through all of the nuances more in the weeks to come.

Here are the highlights for how this policy effects researchers:

All NIH grants will be required to have a 2-page maximum data management plan (DMP). NIH expects researchers to: be clear in the DMP about where they plan to share (“to be determined” is no longer acceptable), notify them if plans change, and actually follow the plan.
You will be sharing more data, as NIH not only wants the data that underlies publications but all data that verifies results.
You will be sharing data sooner. NIH prefers if you share as soon as possible, but at the latest sharing should occur with publication or at the end of the grant period, which ever comes first. That last part is a huge change.
You will share your data in a repository. Criteria for data repositories are provided in a supplement and I expect to see more in this area between now and 2023.
You can ask for money to support data management and sharing activities, including pre-paying for long-term hosting of open data.
If you conduct data on people, sharing expectations are changing. NIH really wants researchers to fine-tune the balance between sharing and privacy. Two mechanisms explicitly called out are outlining sharing practices during informed consent and controlled data sharing, even for de-identified data. This is another area where I want to see more development.
If you are doing research on indigenous populations, you must respect Tribal sovereignty. This is a great addition to the policy.

I think this is a good policy, though it’s definitely overdue. I don’t love the lack of clarity around retention times and I’m not sure how I feel about review of DMPs shifting from peer reviewers to program officers. But these are minor quibbles in what I think is a pretty solid policy.

The biggest takeaways is that this policy represents a shift in expectations for data sharing. It has stronger requirements than the NSF data policy and will really move things forward. Some people are going to hate it and it’s going to be a big adjustment, but it’s a win for reproducibility and open data.

Posted in dataManagementPlans, fundingAgencies | 1 Comment

Bulk File Renaming

by Kristin Briney Posted on 2020-09-02

Today, we’re going to discuss what happens when you don’t end up liking your file naming conventions. (Every time I think I’ve covered file naming conventions enough, I find something new on this topic to talk about. They are my favorite data management trick, after all. Sorry not sorry.)

So anyway, what do you do when your file naming convention isn’t working well for you? You use a file renaming tool to apply a new naming convention! I use Bulk Rename Utility for Windows, but there other good tools available.

A file renamer lets you add information to your file name, remove information, and move pieces of your file name around, among other things. Don’t like the date at the end of the file name? A renamer can move it to the beginning of the file name, no problem.

The biggest benefit of a file renamer is that you can easily rename a whole set of files at the same time instead of renaming files one-by-one. A file renamer will save you so much time and can mean the difference between being able to rename your files or not.

The one thing to note about a file renamer is that it works best when you start with consistent file names to convert. If your file names are an inconsistent mess, a file renamer is not going to help you at all. But even a little consistency can help you break your files into manageable chunks. A file renamer also demonstrates the benefit of separating information (metadata) in file names with dashes or underscores, as they help you process particular sections of your file name independently of the others.

A file renamer is the type of tool that I don’t need often, but it saves me so much time when I do. I hope that, enlightened of their existence, they will help you too!

Posted in dataManagement, metadata | Leave a comment