DataWorks! Prize Symposium

I blogged in February about how my Library won a DataWorks! prize for our data reuse recipe: the file naming convention worksheet. This month FASEB, the prize sponsor along with NIH, just announced the 2024 DataWorks! Prize Symposium, which will consist of talks by prize award winners about their winning recipes.

There are some amazing projects from the DataWorks! award winners and I am looking forward to hearing about their work. The full agenda is here. I’ll be speaking about file naming conventions at 1:15pm EST/ 10:15am PST and am lucky to share my session with my colleague Karen Yook from microPublication Biology.

I encourage you to register for the event and hope to see you there!

Posted in dataManagement, digitalFiles, openData | Leave a comment

A Basic Strategy for Determining Where to Share Your Data

Deciding on where to share your data can be difficult. The current universal expectation for data sharing is swiftly becoming: put shared data in a data repository. The challenge for researchers is that the data repository landscape is still evolving. Given the number of data repositories available and how many of them specialize in a particular type of data, it can be helpful to understand how to navigate the data repository landscape.

As a librarian who helps people find data repositories for sharing, I have a basic strategy for picking a repository when I am sharing data. It goes as follows:

  1. Identify all of the data that needs to be shared.
  2. Is there is a known disciplinary data repository, such as one used by everyone in your research field for a specific type of data? If so, deposit the relevant data in that repository; continue if there is more data to share, otherwise go to step 7.
  3. Is there a logical disciplinary data repository on this list of recommended repositories? If so, deposit the relevant data in that repository; continue if there is more data to share, otherwise go to step 7.
  4. Does your institution have a data repository? If so, deposit the remainder of your data in that repository and jump to step 7.
  5. Do you have a preferred generalist data repository? If so, deposit the remainder of your data in that repository and jump to step 7.
  6. Pick a generalist data repository and deposit the remainder of your data and continue to the next step.
  7. Record the permanent identifier, ideally a DOI, from each data deposit. If you didn’t receive a permanent identifier, go back and select a different repository for that data.

One important thing to know about this strategy is that it doesn’t assume that you will deposit all of your data in the same data repository. This often happens when you are required to deposit a specific type of data in one repository (e.g. genetic information in GenBank) but that repository doesn’t accept all of the data you need to share. To account for this, you should work your way through repositories from most specific to most general until all of the data has been deposit once.

The other notable thing about this strategy is tacked onto the end: you need a permanent identifier for each data deposit. Having a permanent identifier, such as a DOI, for shared data is a newish requirement but one that will soon be universal from funding agencies. For example, the recent NIH data management and sharing policy requires a permanent identifier for shared data; there isn’t a compliance mechanism for this yet, but expect to report these DOIs back to the NIH within a few years. For now, always make sure you get a permanent identifier such as a DOI or accession number when you deposit your data.

I can’t guarantee that this strategy is perfect but it’s a good place to start when you’re trying to figure out what to do with your data. Hopefully, the data repository landscape will get less confusing, but in the meantime, you have a way to navigate it!

Posted in openData | Leave a comment

A Data Management Philosophy

I’ve been reviewing data management books recently and picked up a copy of “Ecological Data: Design, Management and Processing,” edited by William K. Michener and James W. Brunt. Despite this book being published in 2000, meaning it references outdated technology, this little book is a gem.

One section in the second chapter of the book, written by Blunt, really stuck with me: his data management philosophy. The philosophy relies on an adherence to two principles:

1) Start small, keep it simple, and be flexible
2) Involve scientists in the data management process

The first principle is one I use extensively in teaching data management, even though I’ve never named it explicitly. For example, at the end of most teaching sessions, I remind my audience to start slow by incorporating one data practice at a time. After that practice is routine, researchers can add a second practice until that is routine, etc.

As for flexibility, this is what makes it a joy and a pain to teach data management. The answer to so many questions I get from researchers about how to manage data is “it depends”. It depends on their workflows and what works best for them. I can teach how to create a file naming convention but the best file naming convention depends on the files and how one searches for them. And even then, sometimes people have to bulk rename files to make them easier to organize and find. Data management must be flexible because research is so heterogenous.

The second of Blunt’s principles really reinforces the need for data management to be context dependent. A small group of researchers is going to know best how to organize and manage their files. Similarly, scientific subfields have developed norms for metadata and data sharing, building systems that work best for those researchers.

Some of my favorite data management outcomes have come from consultations where I provide the structure, the researcher provides the context, and we jointly come up with something that works really well. Scientists don’t have to know all about data management, but data management really shines when scientists are involved in the data decisions.

I think Blunt’s data management philosophy lays a pretty good foundation. It aligns with the way I teach and consult on data management and will be a useful framework for going forward.

If I had to come up with my own data management philosophy, I might borrow an adage from camping: leave the campsite better than you found it. For camping, this means to not only pack out whatever you packed in, but to also find ways to improve your surroundings so that the effect of humans being present is less notable. For data management, I interpret the adage to mean that you should keep making improvements, no matter how small. So, while it can be nice to have large structures to protect us from the elements, the small things (like keeping everything tidy and clean) really do have a big impact. It’s not a perfect metaphor but it still encapsulates the way I think about a lot of incremental data management strategies.

I know that these are not the only data management philosophies out there, but they do provide an interesting insight into some ways to engage with data management. Do you have a data management philosophy?

Posted in dataManagement | Leave a comment

Just Published Commentary on Data Management and Research Misconduct

I really appreciate the blog Retraction Watch. I used this source heavily in writing my first book, Data Management for Researchers, and regularly cite stories from the blog in my teaching. It’s a fact of science that errors occur, and Retraction Watch makes those errors – both accidental and intentional – transparent.

The transparency brought about by Retraction Watch is part of a larger movement (see efforts such as the Center for Open Science and PubPeer) to stop scientific errors and research misconduct from occurring. It can be difficult to expose and fix such problems, but this is all part of the self-correction process that is fundamental to scientific research.

And here’s where data management comes in: good data management also prevents scientific errors and can curtail misconduct investigations. This is because managing data well results in a clear accounting of what was done to the data, in addition to well-organized and available data files. So when someone has a question about your research, it’s easy to put your hands on the relevant data and documentation to prove exactly what was done.

I had the honor of co-teaching a workshop about the relationship between data management and research misconduct at last year’s RDAP Summit with Heather Coates and Abigail Goben. And the ideas behind that workshop were recently published in the special RDAP issue of the Journal of eScience Librarianship as the commentary, “What if It Didn’t Happen: Data Management and Avoiding Research Misconduct“.

I’m not going to repeat the arguments of the commentary here in this blog post, but I will say that there are a lot of useful case studies in this area and there’s definitely potential for more work to be done on this topic. So I encourage you to jump over and read the commentary, and start thinking about the ways that data management can prevent research misconduct.

Citation: Coates, Heather, Abigail Goben, and Kristin Briney. 2023. “What if It Didn’t Happen: Data Management and Avoiding Research Misconduct.” Journal of eScience Librarianship 12(3): e746. https://doi.org/10.7191/jeslib.746.

Edited to add: the commentary was featured in the weekly round up on Retraction Watch on 2024-01-06!

Posted in dataManagement, researchMisconduct | Leave a comment

Living Data Management Plans

It’s well past time we discuss living data management plans (living DMPs). Somehow, I’ve been running this blog for over 10 years and I don’t have a post specifically discussing this important document type. I obviously need to fix this right now.

You’re probably wondering what a living DMP is and how it differs from a more “traditional”, grant-related data management plan. Honestly, the two-page document you turn in for a grant application is important but it’s often treated a box you have to check to make sure your grant submission is complete. A living DMP, on the other hand, is an evolving document that actually helps you manage your data during a project.

A living DMP describes how you will organize, name, store, and handle your data during a research project. While this is helpful for single-researcher projects, it’s invaluable for research done by a group. The living DMP makes sure that everyone is in agreement about how and where data will be stored and used. When someone needs to know where a find a specific dataset collected by someone else on the project, the living DMP should be the map for finding the file.

What makes this DMP “living” is that it should be updated whenever data handling practices change. A living DMP should accurately reflect the current data practices in the research project and should be added to when new procedures are developed.

The idea of a “living DMP” has been around for a while (I’m not sure who first came up with the term but I would love to give them credit for it) and it’s a document type that I’ve used several times. I made a living DMP for when generating files for my first book. More notably, I created three living DMPs for the Data Doubles project, one for each of the research phases of the project; we actually wrote up an article about the process of creating these DMPs and made the DMPs themselves publicly available.

So how do you create a living DMP and what should you put into it? To get started, see the Write a Living Data Management Plan (DMP) exercise in The Research Data Management Workbook. After that, add any data handling information you think is beneficial to record for later.

DMPs don’t have to just be boring documents for grant compliance. They can be helpful maps for decoding data practices when used as living documents. I hope I’ve convinced you to give this type of DMP a try in your next research project.

Posted in dataManagementPlans | Leave a comment