Rethinking TXT Files

I’ve been doing a lot of research into accessibility recently, specifically thinking about how to make research data files more accessible. There is a lot of existing content about the accessibility of common file types used in business (e.g. Word, PowerPoint, Excel, etc.), but only a little content specific to the accessibility of research data. Part of the issue is that a lot of our guidance around research data focuses on reusability and computability – guidance that sometimes conflicts with accessibility principles.

All of this has me thinking about the humble TXT file. Data management and sharing experts commonly recommend writing README.txt files to accompany shared data files (I myself have given this guidance many, many times). The TXT file type is recommended because it’s a simple file type that can be opened by many software programs, including the command line, making it so users don’t need special proprietary software to read these files. TXT files come up a lot for open documentation and often for data files themselves, especially when doing text analysis.

The problem with TXT files, however, is that they are not very accessible. There is zero extra formatting in a TXT file, meaning there is zero formatting for accessibility in a TXT file. Features that make text files more accessible include headings, hyperlinks, bullet points, etc.. TXT files don’t support these, let alone allow for content like images and tables. Unless the TXT file is very short, it’s going to be challenging to make a TXT document that is maximally accessible for a disabled user to navigate and read (that’s not to say someone using a screen reader can’t read a TXT file; rather, it will be inefficient to navigate).

The TXT’s role in documentation is even more concerning when considering recent research on data reuse by Koesten, et al. This group found that highly reused datasets on GitHub had more words, more headers, and more links in their README documentation files than less reused GitHub datasets. This is a correlation, not causation, but it makes sense that longer documentation makes for easier data reuse. My concern is that these helpful extras – like headers, links, and tables – are not supported by TXT files.

So where does that leave us? Microsoft Word has a ton of accessibility features, to the point where it’s the recommended file format in the U.S. government’s text document accessibility tutorial. But Word is a proprietary format owned by Microsoft. It’s now a bit easier to open and edit such files due to Google Docs, but using a proprietary file type for important data and documentation still raises concerns for me around reusability and computability.

Other alternatives for text-based document types are PDF and LaTeX (which can be converted into PDF). However, PDFs are notoriously difficult to make accessible; you need knowledge of how to make PDFs accessible and you have to use the paid version of Adobe Acrobat to edit the accessibility settings. LaTeX has some support for accessibility, but LaTeX accessibility is a currently developing area and, again, requires a lot of knowledge of how to do.

I’m personally very interested in Markdown (MD or RMD) for filling this documentation accessibility/reusability gap. In fact, Koesten’s research (cited earlier) looked at datasets on GitHub, which uses Markdown as the default file format for README documentation files. Markdown is an open text format that supports formatting like headings, hyperlinks, bullet points, etc.. Markdown does this by using special characters to signify where formatting should be applied to specific text. This does have a learning curve, but it’s not as challenging to learn as LaTeX. Markdown also requires tools to convert the marked up text into HTML, PDF, Word, etc., which means Markdown integration into other systems may be a limiting factor for the general population’s adoption of this file format.

I’m not sure I have a clear answer to the challenge posed in this post. The bigger issue is that we must start considering accessibility in our default guidance for data management and sharing. And by considering accessibility, it will start to change our default guidance, hopefully for something better. As for text files and documentation, I think Markdown can fill an important gap for accessible and reusable text, but I also recognize that many researchers don’t have the knowledge and infrastructure to make this switch at the present time.

What are your thoughts about the humble TXT file?

Posted in accessibility, documentation | 1 Comment

Using Persistent Identifiers as Documentation

I recently attended an RDAP webinar about data sharing for physical samples. While the requirement to share this type of data is not universal, it is increasingly popping up in public access policies. Economically and scientifically, it makes sense to share samples, such as core samples taken from under a sea bed, that can cost thousands or tens of thousands of dollars to acquire. One of the things that struck me during this webinar was that the presenters were working as part of a larger team to build infrastructure for consistently identifying such samples using persistent identifiers (PIDs).

There is a larger movement in the research support ecosystem to create PID systems and to assign research products and components their own unique IDs. In fact, an often overlooked part of the U.S. funding agencies’ push for public access (stemming from the Nelson memo) is that these agencies are required adopt persistent identifiers. As a researcher, you are probably familiar with DOI’s and ORCID’s, though PIDs extend beyond these two systems.

All of this has me thinking about how PIDs occupy an important niche in documentation for data sharing. PIDs are a form of documentation, because they link a unique identifier with a list of information (metadata) about a particular thing. When you share the identifier with someone, you are actually sharing a lot of information about that specific thing and helping to distinguish it from related items.

There are a lot of PIDs that are relevant to data. That said, not every data sharing system has all of these PIDs integrated. So what should you do about PIDs as a researcher? Definitely share PIDs when you are asked for them. And if there’s no form field for a specific PID, you can always add it to your README.txt file.

This post reviews the PIDs that I think are most relevant to data sharing. Identifiers are listed from the most established to the least. There’s a lot of active work going on in the last two-to-three areas, so keep an eye open for these types of PIDs!

Identifying shared digital data

Just like we use DOIs for articles, DOIs are also becoming the go-to for identifying datasets. DOIs are extra special because we can use them like URLs to actively find something on the internet, but they are a whole lot more stable than URLs which can move over time.

DOIs are not the only PID used to identify shared data. There’s also: ARK, Handle, PURL, and others. In the absence of any of these, you can also use an accession number in a database to help identify your data. What matters most is that there is a unique ID of some sort for your shared digital data.

Identifying people

ORCID is the preferred system for uniquely identifying researchers. Individual researchers can create profiles in ORCID that list their publications and grants. Because ORCID is so well integrated into other scholarly systems, publishers can push new publications onto a researcher’s ORCID profile and other systems can pull from ORCID to populate bibliographies. If you don’t have an ORCID as a researcher, you need to get one!

There are actually several other systems for identifying researchers, but they are typically limited to identifiers used in article databases such as: Scopus, Web of Science, Google Scholar, PubMed, and ArXiv. It can be useful to officially claim these IDs, if only to ensure that your publication list in that database is complete and correct.

Identifying institutions

Data sharing systems are actively working to integrate the ROR identifier into infrastructure. RORs help identify institutions, such as funding agencies and universities, and publishing systems seem to have coalesce around ROR as the PID of choice for this. Using a ROR makes it easier to do things like search for all data generated by a specific university (a question that I’m definitely interested in). ROR operates behind the scenes, so it’s less important to know your institution’s ROR and more important to select your institution from a default list, when available.

Identifying materials and equipment

Identifiers for research materials and equipment is an area of active development with several projects going on. The biggest of which is currently RRID, which combines several existing ID systems (for antibodies, plasmids, instruments, etc.) under one umbrella. There are also curated disciplinary resources that do work in this area, a good example of which is the Alliance of Genome Resources (with its child resources such as Flybase, Wormbase, etc.). Larger infrastructure is still in development, but if you have the opportunity to use identifiers that are consistent with a discipline-specific resource, definitely do so!

Identifying shared physical samples

This brings me back to IDs for physical samples. Honestly, this system is still in development so there is no clear winner for how to assign IDs and located physical samples. I’m personally going to be looking into work done by ESIP, specifically their guides on Publishing Open Earth Science Samples and Publishing Open Research Using Physical Samples.

Posted in documentation, openData | Leave a comment

2025 Wrap Up

We’ve thankfully reached the end of 2025. It’s been a rollercoaster of a year, with lots of ups and downs both personally and professionally. On the professional side, I’ve had some really solid highlights, so it seems fitting to review them in a blog post.

Publications

I had a very good publication year. My third book came out in December:

I also published two articles: a bibliography of data management books for researchers (a list that I recently made out of date); and an article about developing the exercises in my new book.

And I almost forgot about the book chapter that came out this year (books take forever between the writing and publication):

Public Access

I spent a good part of 2025, and plan to spend a good part of 2026, supporting the new public access policies from US funding agencies. As a librarian, it’s been incredibly frustrating to be caught between funder requirements for public access and publisher open access policies that often conflict. I think that most people agree that the current scholarly publishing industry is too expensive and isn’t working, but it’s really messy to be working in this area as we try to transition to something better. For everyone’s sake, I hope 2026 is easier in this area.

ASL

I’m currently 80% done with a certificate in American Sign Language (ASL); I have to take one more class this spring, ASL 4, in order to finish. I think it’s a great idea for someone in public service to know how to communicate with Deaf patrons, which is why I’m taking advantage of my university’s tuition assistance to work on this certificate. I don’t think I’ll ever be fluent in ASL, but I’m definitely more comfortable communicating in this language and have enjoyed learning about Deaf culture.

Looking Forward

I’m ended 2025 with some good news: I just signed a contract to write my fourth book. I’ve already drafted a few chapters and will share more information once I draft more. I can tell you it’s about data management and sharing, which is probably not a surprise.

I hope you have a restful holiday season and a wonderful 2026.

Posted in admin | Leave a comment

The Data Management Workbook

I am thrilled to share that my book, The Data Management Workbook, will be published next month on December 2, 2025 by Pelagic Publishing.

The Data Management Workbook is a collection of 24 hands-on exercises to help you improve your data management. For example, if you learned about the concept of file naming conventions in my first book, Data Management for Researchers, an exercise in the Workbook actually walks you through the steps to create a customized file naming convention for your research. The goal is for the book to help you implement data management strategies that fit your research workflows.

The Workbook is the traditionally published and updated version of educational resource, The Research Data Management Workbook, which I blogged about previously. Do note that the downloadable versions of the old version have gone away, though the online edition is still up. This published edition not only is updated and polished, but also contains six completely new exercises. I’m really pleased with the updated version of the Workbook and I enjoyed spending the extra time and effort to make the exercises that much better for everyone.

Do you want a copy of The Data Management Workbook? You can order the book from the publisher, Pelagic, or find it on Amazon. Or encourage your library to buy a copy, so multiple people can enjoy it.

I hope you enjoy this book and it helps you improve your data management!

Posted in admin, dataManagement | Leave a comment

Data Management for Collaborations

I’ve written a lot about the fundamentals of data management, usually from the viewpoint of a single researcher trying to make their data a little easier to deal with. However, a lot of research is collaborative, so it’s worth taking a little time to detail the data management practices that benefit collaborative research.

I actually co-wrote a paper a few years ago about the data management processes for a large collaborative project that I was a member of, the Data Doubles project. While a big part of the article centered on the project’s living data management plans (DMPs), I want to get explicit about some of the data management strategies that were particularly helpful:

  1. Common storage
    • Collaborative research requires a common storage area where researchers can expect to find shared files. It’s important that everyone knows where this storage is and has access to it.
  2. File organization
    • Shared files are even more likely to be disorganized than data files from a single researcher. It’s critical for collaborators to work out a system for organizing files so that everyone knows where data is expected to be within the storage system. A file organization system will save everyone so much time when searching for a specific file, especially if you didn’t create it yourself.
  3. File naming
    • File naming is the last piece of the storage-file organization-file naming trifecta that will help files move seamlessly between collaborators. If everyone knows and uses a shared file naming convention, it becomes easy for anyone to identify files at a glance and know what data has already been collected. Have someone propose a naming scheme then refine it as a group to help get buy-in.
  4. Documentation
    • I’m a big fan of having an index/inventory for data files when doing collaborative research. Of course you’ll need other project documentation like standard research notes, but an index provides a couple extra benefits: it allows for an alternate method to discover and locate data files, and allows collaborators to track the process of data collection. A spreadsheet works great for an index.
  5. Permissions
    • It’s extra important in collaborative research to be clear what people can and cannot do with the project data. This might be as formal as a Data Use Agreement but could also just be a discussion about how project members should ask for permission before reusing data for other research projects.
  6. Living DMP
    • The living DMP basically collects all of the above into one document to help collaborators remember data management decisions. The document should be reviewed and updated occasionally and is especially useful for onboarding new project members.
  7. Data manager
    • If the project is large enough, it’s worthwhile to designate someone as the “data manager.” This person might: propose the file organization and naming systems; update the data index; draft the living DMP; remind everyone of data management strategies; and clean up disorganized data, as needed. Not every project needs a data manager, but it’s usually good to make data management someone’s responsibility or else it might be no one’s responsibility and never get done.

I’m sure there are more data management strategies that are useful for collaborative projects, but I think these are some of the most basic. What have you found helpful when managing data in collaborative research?

Posted in dataManagement | Leave a comment

Data Management Books for Researchers

I’m inordinately proud of my first book, Data Management for Researchers, which is officially 10 years old this month! A lot has changed in the last decade, to the point where I really do need to update this book (though that is going to have to wait due to other exciting book news).

While I will always be partial to my own book, my colleague Abigail Goben and I were curious about what other data management books are available for researchers right now. The answer to this question can be found in an annotated bibliography that we published earlier this year.

Our article describes 17 data management books published between 1986-2023 that fully or partially cover data management strategies for a research audience. We list prices, open access availability, target audiences, and topics covered. The article also provides a brief summary of all 17 books, to help you decide which book is right for your needs. Basically, it’s a high-level overview of all of the books available for researchers that cover data management strategies.

As an author of one of the books on the list, I’m happy to see such a variety of offerings in the area of data management, though I know that there are gaps yet to be filled. If my book doesn’t speak to your personal research needs (or you’re just curious what’s available), I encourage you to check out the bibliography to find the right book for you!

Posted in bookReview, dataManagement, publishing | Leave a comment