There is a saying about storage in the library world: lots of copies keep stuff safe. The abbreviation, LOCKSS, not only defines this principle but also provides the name of two storage systems, LOCKSS and CLOCKSS, which libraries buy into to add redundancy to their data storage. The idea behind the principle is that even if your local storage system fails, you still have access to your data.
LOCKSS is a great concept, but for everyday storage I boil it down to the ‘Rule of 3’. This rule of thumb says that you should keep 3 copies of your data, 2 onsite copies and 1 offsite copy. This is not only a good level of redundancy, but also a very achievable level of redundancy.
The third offsite copy is actually critical to the success of the Rule of 3. Many people keep their data and a backup copy on-site, but this doesn’t factor in scenarios where the building floods or burns down or a natural disaster occurs. One only has to look at universities recovering after hurricane Katrina or the Japan tsunami to see how devastating a natural disaster can be to research (among other things). Storing a copy of your data off-site can make the recovery process a bit easier if everything local is lost.
While the Rule of 3 speaks mainly to redundancy, I also see it as a recommendation for variety; mainly, that each copy should be on a different type of hardware. Usually, the first copy is on your computer, so options for the other copies include external hard drives, cloud storage, local server, CDs/DVDs, tape backup, etc. Each of these technologies has its own strengths and weaknesses, so you spread out your risk by not relying on one storage type.
For example, if you keep your data backed up off-site on commercial cloud storage, keeping an extra copy on a hard drive on-site means that the safety of your data is not based solely on the success of a business. Alternatively, tape backup is high quality but slow to recover from, but it’s a great option for the ‘if all else fails’ backup copy. The exact configuration of your backups will depend on the technology options available to you, but variety should be a factor when you choose your systems.
I personally love the Rule of 3 and follow it for my work information. For my data, I keep:
- a copy on my computer (onsite)
- a copy backed up weekly to the office shared drive (onsite)
- a copy backed up automatically to the cloud via SpiderOak (offsite)
The shared drive is the weak link in this chain, as I transfer files manually, but setting a weekly reminder in my calendar makes sure that I stay on top of things. Additionally, I would not use the office shared drive if I had security or privacy concerns with my data. Besides keeping my data in these 3 locations, I have practiced retrieving information from both backups so I know that they are working and how to restore my information if disaster strikes.
In the end, the Rule of 3 is simply an interpretation of the old expression, ‘don’t put all of your eggs in one basket.’ This applies not only to the number of copies of your data but also the technology upon which they are stored. With a little bit of planning, it is very easy to ensure that your data are backed up in way that dramatically reduces the risk of total loss.
Great post; two questions come to mind.
1. In the ‘Rule of 3’, where does a storage method that has a level of inherent redundancy built in fall (e.g., a RAID)? My guess is you’ll say that’s a “free” 0.5 or so?
2. How do to simplify redundancy while simultaneously keeping it easy (ideally automatic) and version maintenance / consistency in a many-to-many model (multiple access points, multiple data redundancies)? Next, please add in an assortment of data that are classified, proprietary, limited license, protected, etc., all of which may be read-write-executable from a single workstation but have variable limitations on storage and access.
1. I think that’s less a technical question than a people question. Technically, RAID storage may be a free 0.5 but that depends on if you trust the person running the server. How good are they at keeping things running in good order? If they stop supporting the server, will they let you know in enough time to get a copy of your data? If the person running the system is not trustworthy, then the extra half doesn’t matter.
2. Oh, the intricacy of research data. First, separate the issue of access and backup. You need a system that allows access to data at multiple location without propagating the data; is it possible to run a central server behind this? This is your working system. For backup, consider a dark archive. A dark archive is storage with controls on who can access it and also may be slower to get data from. If you’re dealing with licensed and sensitive data, a dark archive off the grid that is hard to get to but safe (maybe tape?) is a good option.
Pingback: Save Your Thesis (and back it up too) » Data Ab Initio
Pingback: Test Your Backups » Data Ab Initio
Pingback: Strong Passwords » Data Ab Initio
Pingback: 2015 Data Resolutions » Data Ab Initio
Pingback: Video: The 3-2-1 Rule » Data Ab Initio