Hurricane Ike has been in the news lately and my sympathy goes out to all those affected. It is events like these that test IT resiliency. The damage can range from slight to severe and we invest in reliable and robust data protection processes to protect from disasters like this. The unfortunate reality is that, no matter how much you plan for it, the recovery process often takes longer and is more difficult than expected.
In many respects, data protection is an insurance policy. You hate to pay your homeowners premium every month, you do it because you know that it is your only protection if major damage ever happens to your house. In the case of data protection, you invest hours managing your backup environment to enable recovery from incidents like this. The unfortunate reality is that even with the best planning and policies things still may not turn out as expected. Four of the most common pitfalls I hear from customers include:
Tape corruption
This is a scenario where a tape is unreadable during restoration. There are numerous causes including dirty tape heads, old media, damaged media and improperly stored media. The loss of one tape can have a major impact on a restore operation especially if it contains data from numerous backup sets.
The best solution to this problem is to perform DR tests periodically to ensure that your backup hardware and media is functioning properly. Furthermore, disk-based backup can often eliminate these types of problems since disk is typically much more reliable than tape media. You should also consider replication to allow your DR site to benefit from the same reliability and performance of your primary site.
Missed servers
In today’s rapidly changing IT environments, it is vital that all required servers are protected. The problem is that many users may be provisioning new servers rapidly. With multiple computing resources coming online, it is very easy to miss adding a critical one to the backup environment. A disaster is the worst time to discover this.
This is a difficult problem to remedy since there may be numerous parties involved. (e.g. the team provisioning the servers, the storage and the backup environment.) The backup team should periodically perform internal audits to ensure that critical systems are being adequately protected. Backup reporting tools can also help in this area since they can provide detailed reports summarizing policies and servers.
Failed jobs
Managing a backup environment is a complex process and can include hundreds of different policies. Such complexity puts strain on backup administrators especially since the reporting tools integrated with most backup applications are limited. The result is that some backup jobs can fail leaving the administrator either unaware or unable to do anything about. (Imagine a large server that is supposed to backup nightly and fails. The administrator finds this in the AM and is left with the choice of skipping the backup or impacting the environment by starting a backup during operational hours.) This is a common problem and leaves the administrator with a difficult choice.
There are a couple of good solutions for this problem. First, reporting tools can help the administrator pinpoint exactly what jobs have failed and can even help pinpoint why. This is very powerful information, but it will only be useful after the failure has occurs. The best solution is one that avoids the failure in the first place. A disk-based backup appliance will provide dramatically improved reliability compared to any tape-based solution. Thus, disk will reduce backup failures and you should look for a solution that includes HA features such as redundant power, cooling and data paths.
Access to backup media
When restoring data, you must have access to your tapes. In some cases the tapes will be onsite, but in other cases they may be elsewhere such as at a remote site like Iron Mountain. The problem is that you need to get the tapes back to either your primary or DR site. This can be especially problematic if fuel availability is limited or roads are impassable. The other problem is that in all cases there is a risk of tape loss. What happens if a tape is missing? Nothing good, I assure you.
The solution is to move to a disk-based environment with replication. In this case, your first option is to restore locally from your disk-based device. The good news is that deduplication enables you to retain massive amounts of data locally at a cost similar tape. You can benefit from the ease of use and performance of disk without spending much more than tape. This will provide the fastest and most reliable performance and will ensure that local data won’t get lost. If your local site is no longer operational, you can use the replicated copy of your data on the DR site. However, restore performance is vital in this case since you will be restoring large amounts of data and as posted before, deduplication algorithms can have a large negative impact on restore speeds.
As an end user, you should be considering each of the above points carefully to ensure that they do not happen to you. In all cases, the situations can be avoided with careful use of technology and policy. In part 2 of this article, I will cover some additional pitfalls that I encounter.