Cloud Backup

I’m going to come right out and say that Dropbox is not a sufficient backup. If all you have are files in a Dropbox folder that are synced to the cloud, you should not consider your files to be backed up and safe. This because your files are now entirely dependent on a company’s business model, one of the main perils of cloud storage, but also because synced cloud storage is not a true backup.

The reason Dropbox is not a good backup relates to how different cloud storage services work and the Rule of 3. The Rule of 3 states that you should have 3 copies of your data, 2 onsite and 1 offsite, for safest storage. The crux of the issue is that services like Dropbox, Box Sync, and OneDrive were designed to provide easy access to content from multiple locations and not to provide dedicated offsite backup. Because your files are synchronized across multiple locations, you really have one “copy” of the data that lives in both the cloud and locally. This is not enough to satisfy the Rule of 3.

With syncing, the method of creation and destruction matters – namely, when you update a file in one location it gets updated universally. Likewise, when you delete a file on your local Dropbox folder, it gets deleted in the cloud and vice versa. So if you are using synced storage and something happens to your local device, there is a chance your synced files in the cloud are at risk. And if Dropbox accidentally loses data in the cloud, as happened with cloud storage provider Dedoose, your local data are at risk.

I wish I could say that this is all theoretical, but people using synced cloud storage have lost data. For example, one researcher lost 8,000 photos both locally and in the cloud after a syncing glitch in Dropbox. Another person lost all his Box files when rolled his account into an unrelated corporate account. The good news is that synced storage services like Box and Dropbox do hold on to deleted files for 30 days, but this is not always foolproof.

So what should you do to make your data safer in this case? Add a backup to this system. Put a copy of your data on a local hard drive in addition to storing it in Dropbox. Alternatively, you can use a cloud storage service that provides independent storage/backup. For example, I use SpiderOak as an offsite backup. SpiderOak monitors my local files and saves a new version of a file to the cloud whenever I update it. This process is automatic, just like with syncing, but my cloud copy is independent of my local copy. If I delete a local file, the copy in the cloud is unaffected and vice versa. This means my cloud storage provides a true offsite backup and I’m more likely to get my files back if something catastrophic happens locally to my computer.

Cloud storage is a wonderful development in terms of convenience and providing offsite backup or access, but you should never rely on the cloud alone. It’s always best to follow the Rule of 3 and get another backup for your data, just in case.

Posted in dataStorage | Leave a comment

Data Dictionaries

Recently, I was reading through Christie Bahlai’s excellent roundup of spreadsheet best practices when I started thinking about documenting spreadsheets. You see, best practices say that spreadsheets should contain only one large data table with short variable names at the top of each column, which doesn’t leave room to describe the formatting and meaning of the spreadsheet’s contents. This information is important, especially if you are trying to use #otherpeoplesdata, but it honestly doesn’t belong in the spreadsheet.

So how do you give context to a spreadsheet’s contents? The answer is a data dictionary. And seeing as I haven’t found a good post on data dictionaries and data dictionaries are right up there with README.txt’s as a Documentation Structure of AwesomenessTM, I obviously need to give them a whole post on this blog.

So what is a data dictionary? A data dictionary is something that describes the data in a dataset. Generally, a data dictionary includes an overall description of the data along with more detailed descriptions of each variable, such as:

  • Variable name
  • Variable meaning
  • Variable units
  • Variable format
  • Variable coding values and meanings
  • Known issues with the data (systematic errors, missing values, etc.)
  • Relationship to other variables
  • Null value indicator
  • Anything else someone needs to know to better understand the data

This list represents the types of things you would want to know when faced with an unknown dataset. Not only is such information incredibly useful if you’re sharing a dataset, but it’s also useful if you plan to reuse a dataset in the future or you are working with a very large dataset. Basically, if there’s a chance you won’t remember the details or never knew them in the first place, a data dictionary is needed.

Lemur Spreadsheet Subset
Lemur Spreadsheet Subset

Let’s look at a real world example from some newly released data from the Duke Lemur Center (data descriptor, dataset). I downloaded the “DataRecord_3_DLC_Weight_File_06Jun14.csv” file from Dryad and found that, while the dataset is very clean (yay!), I can’t interpret all of the data from the spreadsheet alone. For example, what does the variable “AgeAtWt_mo_NoDec” mean or what does the “Taxon” variable code “CMED” stand for? Enter the data dictionary in the form of the README.doc file.

Lemur Data Dictionary Subset
Lemur Data Dictionary Subset

The Lemur data dictionary nicely lays out information on each variable in the dataset. For example, it defines the variable “AgeAtWt_mo_NoDec” as

Age in months with no decimal:  AgeAtWt_mo  value rounded down to a whole number for use in computing average individual weights (FLOOR(AgeAtWt_mo))

It also has a whole separate table listing the various Taxon codes. This is just the type of added context that describes the variables enough to make the data useful. It’s also the type of information that you can’t smoosh into a spreadsheet without ruining the spreadsheet’s order and computability. So this data dictionary is adding a lot of value and context to the data without messing up the data themselves.

The Lemur dataset can be easily understood and reused because it has clean data, well-named variables, and a nice data dictionary. If you are sharing your data publically, or even just with your future self, plan to give your data the same treatment. And if you don’t have time to do all three preparations? Make the data dictionary. You can’t use data you don’t understand.

Now go out and make some data dictionaries!

Posted in documentation | Leave a comment

Dating Your Data (or How I Learned to Stop Worrying and Love the Standard)

I’m going to come right out an admit something terribly nerdy: I have a favorite standard. It’s ISO 8601. My having a favorite standard probably doesn’t surprise you, as I am a person who writes a blog on data management for fun. Why wouldn’t I have a favorite standard? But my having a favorite standard isn’t something for me alone (though I do use the standard often), it’s because ISO 8601 is incredibly useful for data management. Therefore, I want to make my favorite standard your favorite standard too.

The standard ISO 8601 concerns dates, a common type of information used for data and documentation. To understand why this standard is important, consider the following dates:

  • March 5, 2014
  • 2014-03-05
  • 3/5/14
  • 05/03/2014
  • 5 Mar 2014

All of these represent the same date but are expressed in different formats. The problem is that if I use all of these formats in my notes, how will I ever find everything that happened on March 5th? It’s simply too much work to search for all the possible variations. The answer to this problem is ISO 8601.

ISO dictates that all dates should use the format “YYYYMMDD” or “YYYY-MM-DD”. So the example above becomes “20140305” or “2014-03-05”. This provides you with a consistent format for all of your dates. Such consistency allows you to more easily find and organize your data, the hallmark of good data management.

ISO 8601’s consistency is nice in and of itself, but here’s where things get really awesome: when you use ISO 8601 dates at the beginning of file names. This is because dates using this standard sort chronologically by year, by month, and then by date. So if you date all of your file names using ISO 8601, you suddenly have a super easy way to find and sort through information.

Let me give you an example to show you how wonderful this is. I recently cleaned up over 10 years of files for a committee that I am currently on. The committee’s membership changes each calendar year and it was hard to find specific files from previous committees. My solution was to make all the file names start with a date. This makes everything super easy to find and I can now simply ignore content from years I don’t need.

The other great thing about using the “YYYY-MM-DD” format is that you can mix and match how specific your dates are. For example, all of my presentation files live in folders labeled by date. One-off presentations are given an exact date, eg. “2014-04-30_DataManagementWebinar”, while presentations that I give multiple times are only given a year, eg. “2013_CreatingADMP”. I also have files that end up with just a year and a month, eg. “2012-09_Website”. No matter the specificity of the date, all of these files sort chronologically. It’s a beautiful thing.

I highly recommend using ISO 8601 as the way you write dates in your research. It’s a trivially small change, but can have a huge impact in terms of how easy it is to find and use your content. That is data management at its very best.

Posted in dataManagement, digitalFiles | 1 Comment

Anonymization Gone Wrong

Recently, New York City released a dataset containing information on over 170 million taxi rides in the city. While this information is a treasure-trove of information for researchers, the dataset itself is problematic due to anonymization gone wrong. This is just one of many examples of anonymization problems in publicly released data, but a useful one to discuss on the blog.

The key issue is that the taxi dataset contains drivers’ taxi numbers and license numbers. In order to release the data to the public, this information must be translated into a form that cannot be directly linked back to any individual. The data administrators used a common process, hashing, to achieve this anonymization.

Hashing works by performing a prescribed computation process on a textual input (or the bits that make up a file) to turn the input into a (nominally) unique value of consistent length. For example, the MD5 hash function spits out 32-character hashes, transforming the value zero into “cfcd208495d565ef66e7dff9f98764da”. Hashing is generally a one-way computation because it is difficult to get the input value back when you only know that hashed value. Hashing is a popular method for anonymization because a given number always results in the same hash value, allowing for correlation between related, but hashed, information within a dataset. Multiple hash algorithms are available, such as MD5 and SHA-1.

Hashing has a major drawback that is made apparent by the taxi data. Namely, hashing becomes less secure if your input values have a prescribe format. In the case of this dataset, taxi license numbers have one of the following formats (where ‘X’ = letter and ‘9’ = number):

  • 9X99
  • XX999
  • XXX999

Given a prescribed format and a known hash function, it is easy to calculate all of the possible input values, run the hash function, then compare the complete set of hashed values to the values in the dataset. In the case of the taxi data, researchers almost immediately discovered that the data used the MD5 hash function. By computing all 22 million possible taxi license numbers and running the MD5 hash function on them, researchers were able to completely re-identify this dataset.

So does this mean that you can no longer use hashing to de-identify your data? No, but hash functions should not be your only anonymization tool. In the case of the taxi data, one way to improve on the hashing would be to substitute a random number for each license ID, then hash the random numbers. Not only does this make the hashed values harder to re-identify (because the input has no consistent format), but even if the data are re-identified, the random values are not personally identifiable. Note that administrators must maintain a list of IDs and their corresponding random values if they ever want to re-identify the dataset.

Even running personally-identifiable data through an alias may not completely anonymize a dataset. For example, it may be possible to identify drivers and passengers in this dataset by using geographic and timestamp information. In 2000, data privacy expert Dr. Latanya Sweeney showed that the majority of Americans can be identified only by ZIP code, birthdate, and sex. This is what makes the anonymization of research datasets so challenging, because the interaction of variables in the dataset (or between the dataset and outside datasets) can make information personally identifiable.

There is obviously more to anonymizing data that I can cover in one blog post, but the moral of the story is that you should not assume that removing or obscuring variables in a dataset makes your data anonymized. It is much better for your research (and your job security) if you put a little thought into the anonymization process. Proper anonymization is a tricky process, so don’t assume that the bare minimum is sufficient for your data.

Posted in security, Uncategorized | Leave a comment

A Great Password Policy

In response to my previous blog post on strong passwords, a friend pointed out Stanford’s new password policy, which I quite like and thought worth sharing. This policy plays off probabilities, meaning that if you decrease the number of total characters in your password (decreasing the total permutations), you must use more character types instead (increasing the total permutations).

The policy breaks requirements into 4 tiers by password length:

  • “8-11: requires mixed case letters, numbers, and symbols
  • 12-15: requires mixed case letters and numbers
  • 16-19: requires mixed case letters
  • 20+: any characters you like!”

This policy is also mobile friendly, as it’s much easier to type a bunch of letters on a phone than a few random symbols.

I find the policy flexible and accessible and I hope it helps improve your understanding of how to make a strong password.

Stanford IT Services.
Stanford IT Services.
Posted in security | Leave a comment

Strong Passwords

An important part of data management is protecting data from loss. While good storage practices are a first line of defense, there are several other things you can do to help keep your data secure. One is to use strong passwords.

Strong passwords prevent other people from accessing your systems, either because an outsider cannot guess the password or because a computer cannot brute-force attack the system until it stumbles upon the right password. Even if you don’t deal with sensitive data, it’s still a good idea to put a barrier, in the form a password, between your data and people who might accidentally or purposefully harm your files.

Strong passwords have a number of characteristics. The first is that they are not obvious, meaning that they are not easy to guess. There are several flavors of obvious passwords, starting with the generally obvious password. This category includes passwords like:

  • 12345
  • qwerty
  • abc123
  • password (or passw0rd)

These examples are actually from a list of the 25 worst passwords of 2011. It’s worth perusing the list because it’s very enlightening.

The second category of obvious passwords include passwords that are personal to you but still easy to guess. This includes things like:

  • Your pet’s name
  • A family member’s name
  • A birthdate, a marriage date, etc.
  • Your username
  • The name of your favorite band, movie, etc.

Personally obvious passwords offer a little more protection than generally obvious passwords, but are still easy to guess if the hacker knows something about you.

A third category of obvious passwords is the single dictionary word. You should avoid this category because dictionary words are more vulnerable to brute-force attack and are still fairly guessable. Here are some examples of passwords to avoid:

  • monkey
  • baseball
  • dragon
  • sunshine

These examples are actually pulled from the bad password list linked above, which is further proof why you should avoid the single-word password.

Another characteristic of a strong password is that it is not used for more than one platform. Using the same password on multiple platforms means that if one platform is hacked, you are now vulnerable on other platforms. This does mean maintaining a lot of passwords, but using a different password for each system you work on makes everything more secure overall.

Now that we’ve looked at a few things that password’s shouldn’t be, let’s look at some characteristics that strong passwords should have. The first characteristic is that passwords should be long – strong passwords have at least 8 characters and preferably more. The reason for using long passwords is that they are harder to crack – a fact which comes back to basic probabilities. There are a total of (26)^8, or over 200 billion, possible options for 8-character passwords consisting of only lowercase letters. That’s a lot of passwords to try in order to find the right one. When you add more characters, you increase the total number of permutations and decrease the probability of finding the right password on any one guess. Therefore, long passwords make strong passwords.

A second quality of strong passwords is that they mix upper- and lowercase letters, numbers, and symbols. Coming back to probabilities, we can see why this is a good thing. If we add uppercase letters into our 8-character password, suddenly we have (52)^8 possible passwords, or over 50 trillion permutations. That number goes up when you add in numbers and symbols. Plus, a lot of variety makes it harder to outright guess a password. So always use many types of characters in your passwords.

The final characteristic of a strong password is that it is something that you’ll actually remember. It’s not worth using a password that you have to physically write down in order to remember; that defeats the security of your password. The good news is you can now use a password manager to keep track of your passwords. Password managers also make it easy to maintain different passwords for different platforms, a problem mentioned above. The one thing to note is that you must use a really strong password for the manager software, as this will help keep all of your other passwords safe.

So how does one combine all of this advice into a single password? One strategy is to string a few words together and throw in some uppercase letters, numbers, and extra characters. For example, I can combine the words “badger” and “moon” into the password “1B@dger+1/4mOOn”. Not only is this a strong password, but it’s easy for me to remember by the phrase “one badger plus quarter moon”. A second strategy is to hack up a phrase you will remember. For example, I can abbreviate the phrase “wit beyond measure is man’s greatest treasure” from the Harry Potter books and make it into the password “WbMiMgT#Ravenclaw”. A third strategy is to borrow from l33t speak to swap out letters with equivalent l33t characters. All of these strategies reduce to using a bit of creativity to transform something you will remember into a strong password.

My last word on passwords is that you should never, ever share your passwords. Just don’t do it. Ever. You obviously don’t want strangers gaining access to your account, but even friends without malicious intent can mess with your files. You also give up control when sharing a password because the other person is free to further share your login information. Plus, if anything bad happens, you are to blame because the issue happened under your login credentials. Suffice to say, sharing a password is never a good idea. Take it from the The Doubleclicks and these other geeky icons: Don’t Tell Anyone Your Password.

Posted in security | 1 Comment