In part 1, I touched on four of the most common challenges with data restoration in a disaster scenario. In this post, I will review some other key considerations. These examples focus on the infrastructure required after a disaster has occurred.
Available restore target
After a disaster occurs, the immediate response is to try to get critical systems up and running. The process involves restoring data to servers that are in place at the DR site. The challenge is ensuring that you have enough computing resources available. What good is a restore operation if you have nothing to restore the data to? This problem gets worse in environments where new servers are frequently provisioned in the primary datacenter.
The biggest challenge is that IT budgets are limited and companies often cannot have a matching DR server for every production server. The good news is that server virtualization like VMware can help. With VMware you can provision multiple virtual servers on one physical server, thus reducing your physical server requirements at the remote site. The downside is that performance may suffer on some virtual servers due to resource contention; however, in most instances, full recovery with a performance impact is better than no recovery. The best way to ensure that the DR site is adequately provisioned is to perform periodic tests where you bring up all of your DR servers and ensure that they are in working order and that adequate computing resources are available.
After a disaster occurs, the administrator must decide what systems to recover first. This can be a complex decision given the needs of various departments and limited availability of tape restoration resources. Trying to create these priorities after a disaster has occurred is a recipe for trouble. Administrators must focus on restoring data and not negotiating restore priorities with various stakeholders.
The best solution here is to understand the criticality of various servers BEFORE the disaster occurs. You need to understand your environment and categorize systems based on the level of criticality for your organization. For example, the corporate order entry system is probably more important than each employee’s fileshare. Once you have your priorities clearly defined, you can focus on restoring the most critical data after a disaster occurs.
After the disaster occurs, you will immediately need to provision computing resources which will be used as targets for recovery operations. The challenge is ensuring that you have a sufficient infrastructure for the new resources. Power, cooling and computing power is vital and you must ensure that you have enough of these to meet your requirements. This can be especially troublesome if multiple entities at the DR site are all trying to simultaneously implement their plans
This problem is avoidable with planning and testing. The best solution is to perform periodic DR tests to ensure that you have enough of these limited resources. Additionally, you must ensure that you have the appropriate SLAs in place with your DR provider to guarantee that they can provide enough power and cooling resources to meet your requirements.
In order for the DR process to begin, you need people to provision the DR environment and begin the restore process. You can use existing team members to handle the process; however, in cases of a major disaster, those people may be unable to reach the DR site and be unavailable by telephone. This can be very problematic and brings an interesting debate about whether it is better to have a relatively close (easy to access, but more at risk) or distant (difficult to access, but safer) DR site.
There is no simple solution to the above problem. As mentioned, one solution could be to choose a closer DR site, but that brings other issues. This decision is best left to each company to decide based on their business requirements. At the very least, companies should pro-actively monitor situations and be ready to send at least one human resource to a remote site in advance of any recognizable DR scenario. Of course, in some cases, there is no advanced warning and in those cases you should ensure that you have IT experts geographically disbursed as part of your business operations.
All of the above challenges are addressable with careful planning. One of the most important activities a team can perform is periodic DR testing. These simulations allow companies to refine their DR processes and minimize the downtime if a disaster occurs. All of these activities should happen in the context of a larger DR plan which should include elements of testing, human resources, application prioritization and other elements.