Data Ab Initio

What Do You Mean By ‘Data Services’?

by Kristin Briney Posted on 2013-06-26

It’s been a very big month for me: I graduated from the University of Wisconsin-Madison in May and promptly got a job as Data Services Librarian at the University of Wisconsin-Milwaukee. The combination represents big life changes and a lot of new information to process. The result has been a lot of thinking about data but not being able to organize these thoughts until recently. Now that I have a better understanding of my new job, I think it will be good for me to put some of my recent thoughts into a (hopefully) coherent post.

One thing I should say is that I’m coming into an environment where there is a lot of interest in data management but no one person who is centrally responsible for data management on campus. This is an entirely new position and I am the first person on campus whose whole job it is to address data management issues. I’ll therefore play a large role in shaping the so-called ‘data services’ from which my job title derives.

So what are ‘data services’ exactly? Well, they can be a lot of things. A recent white paper (pdf) from the Association of College & Research Libraries (ACRL) surveyed libraries on the types of services they offering around research data. The services covered in this report include:

Consulting on data management plans
Consulting on data and metadata standards
Outreach with other data service providers on campus
Providing reference support in finding and citing data sets
Creating guides for finding data
Directly participating in research projects
Discussing data services with others on campus
Training librarians and others on campus
Providing repository services

A surprisingly large percentage of the surveyed libraries already provide some or all of these data services or plan to do so in the near future. And it isn’t just the large doctoral institutions that are doing this, though they are more likely to offer data services than other types of universities/colleges. It’s quite possible that your institution offers something similar, though be aware that it may not be through the library.

That’s the thing about data services—it’s not just a library issue. Certainly, there are particular data services that the library is in a unique position to provide (such as assistance with finding and citing data sets), but dealing with research data involves other stakeholders, such as IT and the campus divisions that support research and the faculty whose data we’re supporting. For this reason, I’ve spent a lot of time at my new job meeting with a wide range of people from across campus. I’m not sure what the campus-wide efforts around research data will be, but I can say what I’m looking to do initially in my role as Data Services Librarian.

First, I’m focused on grant compliance. The requirement for NSF proposals to include data management plans means that this is a clear nucleus for discussion of data management on campus. Additionally, the White House OSTP memo’s promise that data plans will become standard for all federal grants means that the need for data management plan assistance will only grow in the coming years.

The other area I’m focusing on is training in data management and writing data management plans. If I only do one thing in this position, it will be to give researchers the tools they need to manage data well. These sessions will be aimed at faculty, students, and staff, though I must say that I have a soft place in my heart for working with grad students in this area.

I’m still working out the details of these two services and the best ways to advertise them on campus, but those are my current thoughts. I wholly expect my ideas to evolve over time, just as I expect data services to evolve over time. Because data services isn’t a static, one-size-fits-all kind of thing. Such services must meet the needs of the individual university and, especially because it’s data, those needs are likely to be continually changing.

So those are my current thoughts on my new position and how I’m approaching data services at UW-Milwaukee. I hope these thoughts were coherent enough and you have a better sense of the types of things I will be working on. I’ll be sure to share any new and interesting things from the job as they arise!

Posted in dataManagement | Leave a comment

Reinhart and Rogoff

by Kristin Briney Posted on 2013-05-22

I can’t tell you how happy I am to be back to this blog, talking about data. I’ve actually spent a lot of the last month writing about data issues, but for my last class of my Master’s degree in library and information studies instead of this blog. On that front, I’m happy to report that I graduated this past weekend!

My last assignment for my degree involved writing on data sharing. While all of my thoughts on the topic are too numerous to write about in a single blog post, there is one particular thread of the assignment worth elaborating upon here: the recent Reinhart and Rogoff news.

If you missed it, Reinhart and Rogoff are two Harvard economics professors who published a study (pdf) examining economic growth for countries with high debt-to-GDP ratios. Their finding have been used as evidence for austerity measures in both America and Europe. Unfortunately, their conclusions are wrong because their analysis is flawed.

The errors were discovered by a UMass-Amherst grad student Thomas Herndon who read the paper and tried to reproduce the analysis. Failing to do so, he contacted the authors and was given access to the spreadsheet containing their data and analysis. Upon examining the spreadsheet, Herdon found data points erroneously discarded and coding errors. When the errors were fixed, the conclusions of the original paper were not supported.

This story is important for a few reasons. First, the article has had significant and most likely negative impact on the American and European economies. Second, it was only through the sharing of the original data and analysis that the errors were conclusively discovered and proven. Third, had the original authors not chosen to share their data (which is still not a common practice) the errors and resulting economic policies could have persisted for years.

I find this story to be one of the best examples of the power of data sharing, between the paper’s significant impact and the fact that a careful reading of the article was not enough to conclusively prove mistakes. Stanford statistic professor David Donoho once likened (pdf) journal articles to the advertising of scholarship, whereas the data and analysis are the actual scholarship. That perfectly encapsulates the issues here.

Science values reproducible work but reproducibility often can’t be proven from articles alone. Thankfully, checking for reproducibility becomes easier if data sharing is a part of the standard research process. Scientists can go directly to the data and analysis if they have questions with the work.

The ultimate goal is to have an accurate scientific record, preventing more studies like Reinhart and Rogoff’s from causing harm. And as evidenced from the Reinhart and Rogoff story, data sharing can play an important role in reaching this goal.

Resources:

Reinhart, Rogoff… and Herndon: The student who caught out the profs

Influential Reinhart-Rogoff economics paper suffers spreadsheet error

What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine

Reinhart, Rogoff Backing Furiously Away From Austerity Movement

Posted in dataAnalysis | 2 Comments

Describing Your Data

by Kristin Briney Posted on 2013-04-16

Have you ever had that experience where you’re analyzing a dataset and realize you should have written some extra bit of information down when you took the data? Sometimes this is just a minor inconvenience, but sometimes this makes your data unusable. The experience of missing information supports the fact that the context of our data is just as important as the data themselves. Data aren’t usable on their own.

One of the best things you can do to manage your data better is simply to describe it better. Recording information such as acquisition conditions, acquisition date, a brief description of the data, and file name makes it easier for you to find and use that data later.

(In the library realm, we call any information about a dataset that isn’t the dataset itself ‘metadata’. Metadata literally means data about data. It’s a weird word, but the concept is not wholly unfamiliar–most researchers already record metadata in their research/laboratory notebooks. It’s whatever you write down that isn’t your actual dataset.)

Contextual information about data runs the spectrum from the informal scribblings in a laboratory notebook to the very standardize schemas written in XML that accompany digital data files. While both informal and formal have their places in research, I’m an advocate for some amount of standardization.

Standardizing recorded information about your data helps you in several ways. First, it reminds you to record all of the necessary information about your data. Second, it helps you find datasets because it’s easier to search through organized information. Thirdly, standardization helps your colleagues understand your data, which is useful during collaboration and should you leave a laboratory. Finally, standardization can be personalized and doesn’t have to be rigid. Standardization should easily fit into your workflow and should be adaptable enough to respond to any changes in your research.

To standardize the information you record about your data, you need to reflect on what is important about your data. This is likely to be different for different types of data. Once you’ve done some brainstorming, write down a list of the things that you should record each time you acquire that type of data. You can then type this up into a table, make a bunch of copies, and use them in your notebook or post a cheat sheet on your lab bench near where you take notes. Don’t be afraid to adapt this list as your needs change.

Another standardization option is using a formal list, or schema, from your field. There are a ton of schemas out there, so I recommend consulting with your colleagues or your local reference librarian on what people use in your field. The nice thing about using such a formal list is that it identifies information that your community finds useful. It’s likely that you’ll find this information useful as well.

As an example of how to use a standard list, let’s look at the generic schema called Dublin Core. I really like Dublin Core because it lays out the most basic information one should record about an object. Here are Dublin Core’s categories:

contributor
coverage
creator
date
description
format
identifier
language
publisher
relation
rights
source
subject
title
type

This selection of categories works for images, physical samples, spreadsheet data, text, and whatever else you need to describe. Some of these categories may be less useful depending on the project, but it’s still a nice starting point.

So let’s take this list and apply it to a fictitious microscope image:

contributor – Jane Collaborator
creator – Kristin Briney
date – 2013 Apr 15
description – A microscopy image of cancerous breast tissues under 20x zoom. This image is my control, so it has only the standard staining describe on 2013 Feb 2 in my notebook.
format – JPEG
identifier – IMG00057.jpg
relation – Same sample as images IMG00056.jpg and IMG00055.jpg
subject – Breast cancer
title – Cancerous breast tissue control

Even without using all of the Dublin Core categories, this gives you a pretty good sense of what is in my fictitious dataset. (If this were my real dataset, I would probably expand on the acquisition conditions using a subject-specific schema like OME-XML or DICOM.) Good data descriptions should stand on their own, meaning you shouldn’t have to look at the data to know what they are. The goal is to record all of the necessary information so that someone, myself include, can find and understand the data later. A formal list makes complete description easier and is not difficult to implement in the laboratory.

Nothing is more frustrating that trying to reconstruct older datasets from partial notes. This happened to me in grad school and it’s now my goal for no one else to experience this frustration. The way to prevent it is by properly describing your datasets when you collect them. Standardization makes this easier, but any good data description will enable you to better manage and use your data.

Posted in dataManagement, documentation, metadata | 1 Comment

The Hidden Costs of Cloud Storage

by Kristin Briney Posted on 2013-04-09

Cloud storage is an increasingly popular way to store research data. Being able to upload and access files from any location is useful and makes transfer between computers much easier. But for all of the upsides of cloud storage, there are also a few downsides.

Data Ownership

While most of us don’t usually read terms of service agreement, it’s worth doing a little digging when it comes to your cloud reader. For example, Google Drive’s terms of service includes this little tidbit:

When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works… communicate, publish, publicly perform, publicly display and distribute such content.

You retain intellectual properties rights over the content you put on Drive but Google can still do a lot of things with your content. This should make you a little worried about any research data you put on Drive.

There are some ways around this problem. One example comes from UW-Madison, which has negotiated Google Drive terms of service for faculty and students where Google has no ownership or use permissions. The other option is just to pick a cloud storage provider that won’t use your data, but even that isn’t always perfect. Dropbox, for example, doesn’t take quite the same liberties Google does with your data but does it spell out in its terms of service how it can use your personal information (name, address, log-in information, etc) or provide your files to law enforcement.

My best advice? Read the terms of service before choosing a cloud storage provider for your data.

Security

The other natural concern when giving your data to a third party is security. This is especially important when putting sensitive information or student information (covered under FERPA) in the cloud. You need to take a lot of extra precautions in the cloud if your data is sensitive.

One secure cloud storage option I’ve run across is SpiderOak. Unlike other cloud storage options, SpiderOak cannot actually read any of your data because it gets encrypted before it even arrives at the SpiderOak servers. And in this Ars Technica review , SpiderOak favorably compares with other popular cloud services like Dropbox and SugarSync.

So unless you find such a service like SpiderOak that guarantees security, the cloud is not the best place for your sensitive data.

The Limitations of the Cloud

Cloud storage can be a blessing in the laboratory, but putting your data in the cloud does not automatically mean that your data is well backed-up nor well managed. This is because your data is outside of your control when you give it to another entity. If your cloud storage provider folds or suddenly changes their terms of service (as seen in the recent Instagram debacle), you could suddenly be in a tight spot. For safety’s sake, it’s better to have other back-ups besides your cloud drive.

I’m in no way saying that you should not use cloud storage for research. Instead, you should be smart about choosing a service provider and know that service’s limitations. With a little bit of forethought, cloud storage can be a valuable asset in the laboratory instead of a potential security hole.

Posted in dataStorage, digitalFiles | 2 Comments

The Proper Pen

by Kristin Briney Posted on 2013-03-28

Have you ever wondered what the scientifically optimal writing utensil is to use in your lab notebook? No? Well, this post contains the answer to a record-keeping question you never thought to ask.

The answer comes from one of my favorite books on managing laboratory records, Writing the Laboratory Notebook by Howard Kanare. It was published in the 1980’s (making the section on electronic record keeping highly entertaining) by the ACS and thoroughly covers the how’s and why’s of keeping a proper notebook.

This book is so thorough, in fact, that it spends 6 pages (p. 11-16) on the proper type of paper and ink to use. Kanare even conducts experiments with 15 different types of pens to determine the most colorfast and solvent-fast inks. I found his experiments so interesting that I thought it worth sharing the highlights with you.

Just say no to pencils

First, I should say that pencils are right out. They’re erasable, they smudge, and they don’t copy well when you’re backing up your notebook. If you want to be sure that data hasn’t been changed or lost to illegibility, it’s better to stick with a pen.

Ink color

The choice of ink color comes down to lightfastness, since modern inks no longer contain the harsh acids that eat through paper over time–a historic problem. Kanare tested ink under both fluorescent light and sunlight and found that red inks fade most easily, blue ink fades some (the amount of fading depends on the pen type), and black inks fade the least.

Pen type

Felt-tip pens have a few things going against them from the start. Their inks are water based, making the ink more likely to bleed and less permanent. On the positive side, these porous-tip pens held up to Kanare’s solvent tests (using water, hexane, HCl, acetone, and methanol) about as well as the ballpoint pens.

The other main option, a ballpoint pen, does pretty well under Kanare’s solvent tests and the pen’s solvent-based ink makes writing more permanent. Kanare’s only warning about these pens is that the ink can coagulate or settle during long-term storage, leading to performance problems in older pens.

Kanare also brings up the option of using archival quality pens, but it’s not clear without testing if it’s worth the added expense over the long term.

And the winner is…

You can’t go wrong with a humble black ballpoint pen when writing in your lab notebook. This ink will stand up the most to fading and spills and provide good permanence, making your records readable for a long time.

Posted in labNotebooks | Leave a comment

Why Should I Share My Data?

by Kristin Briney Posted on 2013-03-19

I’m going to be talking a lot about data sharing on my blog, so it’s worth investigating why I believe sharing data is beneficial. There are plenty of reasons for and against sharing–I will highlight some of them in this post–but I believe that the overall balance comes out in favor of sharing.

Reasons for sharing

Many of the reasons for sharing are driven by the desire to make science more transparent. Data sharing helps ensure that we are conducting research properly and that our analyses are reproducible. Freely available datasets (and code!) allow others to test data for anomalies and analyses for validity (example). It is frightening to let others delve into our data to look for errors, but this ultimately makes our science better.

Another reasons for data sharing is the ability to conduct novel analyses on datasets. For example, meta-research brings together a variety of datasets, looking for connections that can’t be found in one dataset alone (example). It’s also possible that your data is useful ways you’ve never dreamed. Freely available data lets researchers create interesting mash-ups that can lead to new science.

Lest you think that data sharing is only good for others, there is also evidence that data sharing increases article citation counts. Data sharing also gives researchers a way to get credit for traditionally unpublishable results. Just because your dataset isn’t interesting enough to publish doesn’t mean it doesn’t have value.

Finally, data sharing benefits society as a whole. Data sharing represents a return on the public’s investment in federally supported research. It also lowers the barrier of entry into research for non-scientists. Finally, data sharing without cost or barrier to access can help spread scientific ideas faster.

Reasons against sharing

One of the biggest concerns about data sharing is being scooped. If I share my data, the fear is that someone will use my data to publish my study before me. There are two points to make here. The first is that data sharing should not be expected before the data’s corresponding paper is published, preventing others from publishing your data before you. Secondly, when someone uses your research outputs (ideas or data) without proper attribution, that person is committing a transgression. I don’t think that we should avoid doing something beneficial just because some people will never follow the rules.

Another major concern about data sharing is that it hurts researchers who invest significant time and effort into their datasets. A large dataset that takes years to acquire and may be used for several papers is not easy to freely share. I don’t have a good answer here other than it is worth having a discussion on embargoing data for a short period of time.

Finally, many datasets have issues that prevent sharing, such as human subject information and health information. This is a valid concern and such data should not be shared as is. It is possible to deidentify some datasets, but I recognize that there are other datasets that just can’t be shared.

Why I believe in data sharing

I think it’s important to remember in the context of data sharing that early scientists were secretive and did not publish their results in journal articles. This changed in 1665 with the arrival of the first scientific journal. Since then, the journal article has become the currency upon which scientific exchange is based–but the journal article is only the norm because we as scientists have made it so. If we find value in shared data, we researchers can change our norms to make data another research currency.

So why would we place research data at the same level as the journal article? We should because sharing data, with some limitations as for privacy, benefits the greatest number of people. Scientists benefit through reproducibility, novel analyses, and more citations while non-scientists benefit through better access to science and the ability to access the results that our tax dollars have paid for. Technology has enabled us to share data with unprecedented ease and, by doing so, we can dramatically further the cause of science.

This post represents the highlights of why I think data sharing is beneficial. I understand that not everyone agrees with my view and for that reason we should move toward more data sharing in a smart and measured way. There is definite momentum the direction of sharing original research data and I’m looking forward to having more discussions about it on this blog.

Posted in openData | Leave a comment