Data or It Didn’t Happen

There’s a story in the news this week about the requested retraction of a study on changing people’s minds on same sex marriage. I always find it interesting when retraction stories are picked up by major news outlets, especially when the article’s data (or lack thereof) is central to the reasons for the retraction.

The likely retraction (currently an expression of concern) in question concerns a study published in Science last year looking at the effect of canvassing on changing people’s minds. Study participants took pre- and post-canvassing online surveys to judge the effect of canvassing on changing opinions. While the canvassing data appears to be real, it looks like the study’s first author, Michael LaCour, made up data for the online surveys.

The fact of the faked data is remarkable enough, but what particularly interests me is how it was discovered. Two graduate students at UC-Berkeley, David Broockman and Joshua Kalla, were interested in extending the study but had trouble reproducing the original study’s high response rate. Upon contacting the agency who supposedly conducted the surveys, they were told that the agency did not actually run or have knowledge of the pre- and post-tests. Evidence of misconduct mounted when Broockman and Kalla were able to access the original data from another researcher who posted it in compliance with a journal’s open data policy. They found anomalies once they started digging into the data.

In my work, I talk a lot about the Reinhart and Rogoff debacle from two years ago where a researcher gaining access to the article’s data led to the fall of one of the central papers supporting economic austerity practices. We’re seeing a similar result here with the LaCour study. But in this case, problems arose due to a common practice in research: using someone else’s study as a starting point for your own study. Building from previous work is a central part of research and bad studies have problematic downstream effects. Unfortunately, such studies aren’t easy to spot without digging into the data, which often isn’t available.

There’s an expression that goes “pictures or it didn’t happen,” suggesting that an event didn’t actually take place unless there is photographic proof. I think this expression needs to be coopted for research to be “data or it didn’t happen.”  Unless you can show me the data, how do I know that you actually did the research and did it correctly?

I’m not saying that all research is bad, just that we need regular access to data if we’re going to be able to do research well. We can’t build a house on a shaky foundation and without examining the foundation (data) in more detail, how will we find the problems or build the house well?

So next time you publish an article, share the data that support that article. Because remember, data or it didn’t happen.

Posted in openData, researchMisconduct | Leave a comment


In my last post, I discussed my philosophy on documentation in that most researchers need to take better notes and augment them with a few key types of documentation, as needed. I’ve already blogged about a few of these special documentation types – data dictionaries, README.txt files, and e-lab notebooks – but one structure we haven’t examined here is templates. Let’s correct that now.

Templates are one of my favorite recommendations for adding structure to research notes and making sure that you’ve recorded all of the necessary information. They coopt the benefits of a formal metadata schema – making documentation easy to search across, helping you record all essential information, and providing consistency – without all of the fiddliness or rigidity. This makes templates much easier to adopt and use.

So how do templates work? Basically, you sit down at the start of data collection and make a list of all the information that you have to record each time you acquire a particular dataset. Then you use this as a checklist whenever you collect that type of data. That’s it.

You can use templates as a worksheet or just keep a print out by your computer or in the front of your research notebook, whatever works best for you. Basically, you just want to have the template around to remind you of what to record about your data.

Let’s look at an example. When I was a practicing chemist, there were a few critical pieces of information I needed to record every time I ran an experiment. This list included the following:

  • Date
  • Experiment
  • Scan number
  • Laser beam powers
  • Laser beam wavelengths
  • Sample concentration
  • Calibration factors, like timing and beam size

Using this list as a template, I would then record the necessary information every time I did an experiment. The result might look something like the following:

  • 2010-06-05
  • UV pump/visible probe transient absorption spectroscopy
  • Scan #3
  • 5 mW UV, visible beam is too weak to measure accurately
  • 266 nm UV, ~400-1000 nm visible
  • 5 mMol trans-stilbene in hexane
  • UV beam is 4 microns, visible beam is 3 microns

Basically, the list is memory aid to make sure my notes include everything they should for any given experiment. And I could even use different templates for different types of experiments to be more thorough.

Remembering to record the necessary details is the biggest benefit of using a template, as this is an easy mistake to make in documentation. Templates can also help you sort through handwritten notes if you always put the same information in the same place on a notebook page. Basically, templates are a way to add consistency to often chaotic research notes.

I challenge you to try out a template or two and see if they help you record the better notes. Because, as I’ve said before, research data without documentation are useless and, honestly, having insufficient documentation can be just as frustrating. So make your data better by using a template!

Posted in documentation | Leave a comment

On Documentation

I just got back from my favorite conference on data, Research Data Access and Preservation (Storify highlights), and am processing all of the great things I learned about there. While some of these things will probably end up in future blog posts, I did want to share a bit on what I talked about during my own panel presentation which is relevant here.

The panel itself was entitled “Beyond Metadata” and I spoke about different methods for teaching documentation types other than metadata. I was particularly excited to be on this panel because I think that librarians’ love of metadata doesn’t always translate into what’s needed in the laboratory. So even though your funder may ask in a data management plan for the metadata schema you plan to use, most of the time that’s not the documentation type you really need.

My general philosophy on research documentation is as follows:

  • Most researchers don’t need formal metadata schemas, unless you have a big (time/size/collaborative) project to organize or are actively sharing your data.
  • Your first strategy for documentation should be to improve your research notes/lab notebook that you are likely already using.
  • That said, you can augment your notes strategically with documentation structures such as README.txt files, data dictionaries, and templates.

It’s actually this latter category of documentation types that you find me talking about a lot, as these are the ones that can really help but that many researchers do no know about.

There are plenty of good reasons to improve your documentation (including giving you the ability to reuse your own data, making sure you don’t lose important details, and being transparent for the sake of reproducibility), but we often don’t teach documentation to researchers beyond the basics. So here are a few resources I’ve created so you can learn to improve your documentation:

Looking over this list, I realize that there are a few gaps in the content of this blog when it comes to documentation practices. So look for future posts on templates and good note taking practices!

Research may yet get to the point where metadata is commonplace but we have many useful documentation structures to employ in the meantime. Research notes in particular have been used effectively for hundreds of years and will continue to be useful. In the end, you should use whatever documentation type that works well for you and ensures that you record the best information you can about your data.

Posted in documentation | 3 Comments

New Data Requirements and How To Meet Them

Around the time when I started this blog in 2013, the White House Office of Science and Technology Policy (OSTP) decreed that all major federal funders would soon have to require data management plans and data sharing from their grantees. It’s been almost two years since the OSTP memo came out, but we are finally starting to see the funder’s plans for enacting public access requirements.

The biggest recent announcement came from the NIH. NIH previously had a data sharing requirement for grants over $500,000 per year, but the new policy requires data management plans and data sharing from everyone. This matches the NSF policy on data, which will not change significantly under the new mandates.

In addition to NIH and NSF, other US funding agencies have new data policies. DOE, for example, now requires a data management plan with grant applications and data sharing from funded researchers. Similar requirements now exist for NASA, the CDC, and others. Basically, if you are getting research money from a US agency, you should now plan to write a data management plan and share your data.

So, given these new requirements, how do researchers meet them? In terms of data management plans, I’m pushing people at my university toward the DMPTool. The tool is regularly updated with new policy requirements/templates, contains helpful information for writing a plan, and has features that enable collaboration and review. It’s a great resource for anyone writing a data management plan for a US-based funding group.

The harder part is on the data sharing portion of the new data requirements. This is because a significant number of researchers will have to share data that were not required to do so before. Additionally, funders haven’t been very good about specifying where to share data. So we have a huge need to figure out where to put data and not a lot of recommendations on where that actually should be.

In terms of what I’m doing on my campus, I have three recommendations. First, look for where your funder, journal, or peers recommend you put data. This is likely the best place to put your data. Second, look for lists of data repositories by discipline. I particularly like this one from the new journal Scientific Data and the master repository list at re3data. Finally, you can always contact your local data librarian. I expect finding repositories for people’s data is going to be a big part of how my university is responding to these new requirements.

Overall, I’m very excited about these new requirements as I think that data management will really help researchers take care of their data and data sharing will promote transparency in research. Still, there is not a lot of infrastructure or support behind these new demands. This makes it difficult for both those who support research data and those who generate it.

The good news is that this is an evolving process and that, over time, systems and workflows will develop to make it easier to comply with these requirements. Things will get better. Until then, remember that you likely have assistance at your institutional library.

Posted in dataManagement, government, openData | Leave a comment

Clarification and Correction on My Uniform Guidance Post

After talking more yesterday with my university’s compliance person about the new Uniform Guidance, I realize that I misinterpreted the “new” part of the guidance relating to data, A-81 section 200.430, in my last post. Having now read through the guidance several more times (don’t you just love long, dry government documents?), I want to correct my comments on this section.

For clarity, here is the section in question:

(i) Allowable activities. Charges to Federal awards may include reasonable amounts for activities contributing and directly related to work under an agreement, such as delivering special lectures about specific aspects of the ongoing activity, writing reports and articles, developing and maintaining protocols (human, animals, etc.), managing substances/chemicals, managing and securing project-specific data, coordinating research subjects, participating in appropriate seminars, consulting with colleagues and graduate students, and attending meetings and conferences. [emphasis mine]

While I originally interpreted this as meaning all data management expenses can be charged to a federal grant (if you’re at an institute of higher education), really it is only people’s time spent managing data that is allowable. This is part of a larger expansion of allowable personnel charges, such as for administrative staff, under the new Uniform Guidance. My fault for not reading more carefully that this section applies to only people’s time.

Do note that this does not supersede any individual funders’ stipulations that allow a wider variety of data management expenses (eg. storage infrastructure, preservation in a repository, etc.) to be charge to a grant.

While I’m obviously disappointed that my original interpretation is not correct, it is still nice to see the cost of data management explicitly being allowed to be paid for by a federal grant. Because data management certainly requires people’s time to perform. That said, it also usually requires infrastructure and I’d like to see funders do more to cover the total cost of taking care of research data.

Posted in dataManagement, fundingAgencies, government | Leave a comment

New Federal Grants Guidance and How It Effects Data

If I made a list of the things I cite the most in the course of my job as a data management specialist, at the top would be ISO 8601, the recent Vines, et al. study on data loss over time, and OMB Circular A-110. I’ve already written about the first two on my blog and I want to finally consider Circular A-110 in this post.

Circular A-110 comes from the White House Office of Management and Budget (OMB) and is the document that defines research data and retention requirements for all research supported by US federal funding. It’s also no longer applicable to federally-sponsored research in the US.

Replacing A-110 and several other Circulars is the new Uniform Guidance, also known as OMB Circular A-81. This document was designed to standardize guidance for everyone receiving federal funding in the US (hence the name “Uniform Guidance”). For this reason, it echoes many of the requirements that were in place before but with a few exceptions. Most of these exceptions concern grants administration and are not relevant to this blog, but I am interested in what the new guidance says about data.

On the whole, the new Uniform Guidance looks a lot like the old A-110. For instance, it includes a verbatim copy of the definition of “research data” from A-110 (see A-81 section 200.315):

(3) Research data means the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This “recorded” material excludes physical objects (e.g., laboratory samples). Research data also do not include:

(i) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and

(ii) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

Section 200.315, like the A-110 section 36, also states that the Federal government has a right to access and reproduce data produced under a federal award and delineates how to respond to a Freedom of Information Act request for data.

d) The Federal government has the right to:

(1) Obtain, reproduce, publish, or otherwise use the data produced under a Federal award; and

(2) Authorize others to receive, reproduce, publish, or otherwise use such data for Federal purposes.

(e) Freedom of Information Act (FOIA).

(1) In addition, in response to a Freedom of Information Act (FOIA) request for research data relating to published research findings produced under a Federal award that were used by the Federal government in developing an agency action that has the force and effect of law, the Federal awarding agency must request, and the non-Federal entity must provide, within a reasonable time, the research data so that they can be made available to the public through the procedures established under the FOIA. If the Federal awarding agency obtains the research data solely in response to a FOIA request, the Federal awarding agency may charge the requester a reasonable fee equaling the full incremental cost of obtaining the research data. This fee should reflect costs incurred by the Federal agency and the non-Federal entity. This fee is in addition to any fees the Federal awarding agency may assess under the FOIA (5 U.S.C. 552(a)(4)(A)).

A-81 also still requires a 3-year retention period for all research records (see A-81 section 200.333), though the exceptions differ slightly from those in A-110:

Financial records, supporting documents, statistical records, and all other non-Federal entity records pertinent to a Federal award must be retained for a period of three years from the date of submission of the final expenditure report or, for Federal awards that are renewed quarterly or annually, from the date of the submission of the quarterly or annual financial report, respectively, as reported to the Federal awarding agency or pass-through entity in the case of a subrecipient. Federal awarding agencies and pass-through entities must not impose any other record retention requirements upon non-Federal entities. The only exceptions are the following:

(a) If any litigation, claim, or audit is started before the expiration of the 3-year period, the records must be retained until all litigation, claims, or audit findings involving the records have been resolved and final action taken.

(b) When the non-Federal entity is notified in writing by the Federal awarding agency, cognizant agency for audit, oversight agency for audit, cognizant agency for indirect costs, or pass-through entity to extend the retention period…

On the whole, these requirements are the same (and often verbatim copies of) requirements from OMB A-110.

There is, however, one section of the new Uniform Guidance concerning data that does not appear in Circular A-110. This is A-81 section 200.430, which states that grants to institutions of higher education may include the following items in their budgets:

(i) Allowable activities. Charges to Federal awards may include reasonable amounts for activities contributing and directly related to work under an agreement, such as delivering special lectures about specific aspects of the ongoing activity, writing reports and articles, developing and maintaining protocols (human, animals, etc.), managing substances/chemicals, managing and securing project-specific data, coordinating research subjects, participating in appropriate seminars, consulting with colleagues and graduate students, and attending meetings and conferences. [emphasis mine]

This means that you are allowed to charge data management expenses people’s time spent managing data [ADDED 2015-02-18, see follow up post on this] to your grant. Currently, many US funding agencies requiring data management plans already allow data management-related expenses to be added to the grant budget, but this appears to be an entirely new stipulation at the federal level. Personally, I’m very happy to see this allowance in the new Uniform Guidance because researchers often need funds to manage data properly.

Overall, there’s very little change to the research data landscape under the new Uniform Guidance with the exception that all university researchers can now charge data management expenses to their grants. This is definitely something I plan to promote more to the researchers on my campus!

Posted in fundingAgencies, government | 2 Comments