Using Persistent Identifiers as Documentation

I recently attended an RDAP webinar about data sharing for physical samples. While the requirement to share this type of data is not universal, it is increasingly popping up in public access policies. Economically and scientifically, it makes sense to share samples, such as core samples taken from under a sea bed, that can cost thousands or tens of thousands of dollars to acquire. One of the things that struck me during this webinar was that the presenters were working as part of a larger team to build infrastructure for consistently identifying such samples using persistent identifiers (PIDs).

There is a larger movement in the research support ecosystem to create PID systems and to assign research products and components their own unique IDs. In fact, an often overlooked part of the U.S. funding agencies’ push for public access (stemming from the Nelson memo) is that these agencies are required adopt persistent identifiers. As a researcher, you are probably familiar with DOI’s and ORCID’s, though PIDs extend beyond these two systems.

All of this has me thinking about how PIDs occupy an important niche in documentation for data sharing. PIDs are a form of documentation, because they link a unique identifier with a list of information (metadata) about a particular thing. When you share the identifier with someone, you are actually sharing a lot of information about that specific thing and helping to distinguish it from related items.

There are a lot of PIDs that are relevant to data. That said, not every data sharing system has all of these PIDs integrated. So what should you do about PIDs as a researcher? Definitely share PIDs when you are asked for them. And if there’s no form field for a specific PID, you can always add it to your README.txt file.

This post reviews the PIDs that I think are most relevant to data sharing. Identifiers are listed from the most established to the least. There’s a lot of active work going on in the last two-to-three areas, so keep an eye open for these types of PIDs!

Identifying shared digital data

Just like we use DOIs for articles, DOIs are also becoming the go-to for identifying datasets. DOIs are extra special because we can use them like URLs to actively find something on the internet, but they are a whole lot more stable than URLs which can move over time.

DOIs are not the only PID used to identify shared data. There’s also: ARK, Handle, PURL, and others. In the absence of any of these, you can also use an accession number in a database to help identify your data. What matters most is that there is a unique ID of some sort for your shared digital data.

Identifying people

ORCID is the preferred system for uniquely identifying researchers. Individual researchers can create profiles in ORCID that list their publications and grants. Because ORCID is so well integrated into other scholarly systems, publishers can push new publications onto a researcher’s ORCID profile and other systems can pull from ORCID to populate bibliographies. If you don’t have an ORCID as a researcher, you need to get one!

There are actually several other systems for identifying researchers, but they are typically limited to identifiers used in article databases such as: Scopus, Web of Science, Google Scholar, PubMed, and ArXiv. It can be useful to officially claim these IDs, if only to ensure that your publication list in that database is complete and correct.

Identifying institutions

Data sharing systems are actively working to integrate the ROR identifier into infrastructure. RORs help identify institutions, such as funding agencies and universities, and publishing systems seem to have coalesce around ROR as the PID of choice for this. Using a ROR makes it easier to do things like search for all data generated by a specific university (a question that I’m definitely interested in). ROR operates behind the scenes, so it’s less important to know your institution’s ROR and more important to select your institution from a default list, when available.

Identifying materials and equipment

Identifiers for research materials and equipment is an area of active development with several projects going on. The biggest of which is currently RRID, which combines several existing ID systems (for antibodies, plasmids, instruments, etc.) under one umbrella. There are also curated disciplinary resources that do work in this area, a good example of which is the Alliance of Genome Resources (with its child resources such as Flybase, Wormbase, etc.). Larger infrastructure is still in development, but if you have the opportunity to use identifiers that are consistent with a discipline-specific resource, definitely do so!

Identifying shared physical samples

This brings me back to IDs for physical samples. Honestly, this system is still in development so there is no clear winner for how to assign IDs and located physical samples. I’m personally going to be looking into work done by ESIP, specifically their guides on Publishing Open Earth Science Samples and Publishing Open Research Using Physical Samples.