Recently, New York City released a dataset containing information on over 170 million taxi rides in the city. While this information is a treasure-trove of information for researchers, the dataset itself is problematic due to anonymization gone wrong. This is just one of many examples of anonymization problems in publicly released data, but a useful one to discuss on the blog.
The key issue is that the taxi dataset contains drivers’ taxi numbers and license numbers. In order to release the data to the public, this information must be translated into a form that cannot be directly linked back to any individual. The data administrators used a common process, hashing, to achieve this anonymization.
Hashing works by performing a prescribed computation process on a textual input (or the bits that make up a file) to turn the input into a (nominally) unique value of consistent length. For example, the MD5 hash function spits out 32-character hashes, transforming the value zero into “cfcd208495d565ef66e7dff9f98764da”. Hashing is generally a one-way computation because it is difficult to get the input value back when you only know that hashed value. Hashing is a popular method for anonymization because a given number always results in the same hash value, allowing for correlation between related, but hashed, information within a dataset. Multiple hash algorithms are available, such as MD5 and SHA-1.
Hashing has a major drawback that is made apparent by the taxi data. Namely, hashing becomes less secure if your input values have a prescribe format. In the case of this dataset, taxi license numbers have one of the following formats (where ‘X’ = letter and ‘9’ = number):
- 9X99
- XX999
- XXX999
Given a prescribed format and a known hash function, it is easy to calculate all of the possible input values, run the hash function, then compare the complete set of hashed values to the values in the dataset. In the case of the taxi data, researchers almost immediately discovered that the data used the MD5 hash function. By computing all 22 million possible taxi license numbers and running the MD5 hash function on them, researchers were able to completely re-identify this dataset.
So does this mean that you can no longer use hashing to de-identify your data? No, but hash functions should not be your only anonymization tool. In the case of the taxi data, one way to improve on the hashing would be to substitute a random number for each license ID, then hash the random numbers. Not only does this make the hashed values harder to re-identify (because the input has no consistent format), but even if the data are re-identified, the random values are not personally identifiable. Note that administrators must maintain a list of IDs and their corresponding random values if they ever want to re-identify the dataset.
Even running personally-identifiable data through an alias may not completely anonymize a dataset. For example, it may be possible to identify drivers and passengers in this dataset by using geographic and timestamp information. In 2000, data privacy expert Dr. Latanya Sweeney showed that the majority of Americans can be identified only by ZIP code, birthdate, and sex. This is what makes the anonymization of research datasets so challenging, because the interaction of variables in the dataset (or between the dataset and outside datasets) can make information personally identifiable.
There is obviously more to anonymizing data that I can cover in one blog post, but the moral of the story is that you should not assume that removing or obscuring variables in a dataset makes your data anonymized. It is much better for your research (and your job security) if you put a little thought into the anonymization process. Proper anonymization is a tricky process, so don’t assume that the bare minimum is sufficient for your data.