Why Recovery Matters: Two Case Studies

I started this blog over two years ago to focus on the criticality of data protection and specifically data recovery. While technology continues to evolve, the importance of these two elements remains consistent. Every company must have a recovery strategy to protect against data loss or corruption. Some people may be inclined to de-emphasize backup and recovery based on the faulty assumption that today’s virtualized hardware and software is more reliable or flexible, but this is a mistake. In the last month, we have seen two examples of why data recovery is critical, and both affected entities had large IT staffs and huge budgets. Without an effective protection strategy, massive data loss would have been unavoidable in both cases. The companies recovered the vast majority of their data but experienced an outage that was far longer and more damaging than either anticipated.

The first situation occurred at the Virginia Information Technology Agency (VITA), the organization that provides IT for the state of Virginia. Jon Toigo has a detailed post on the event and the agency has a site dedicated to the outage. In summary, on August 25, 2010 a critical disk array experienced a problem which was compounded by human error and resulted in total data loss. VITA was forced to initiate a complete recovery from physical tape, and a week later, the system was not fully online. Eventually, VITA successfully recovered about 97% of their data and the remaining 3% may be gone forever. The combination of data loss and an extended outage reflected poorly on VITA, their subcontractors and suppliers, and resulted in the Governor of Virginia calling for a third-party investigation.

The second situation impacted Chase Bank’s online banking systems and occurred on Monday, September 13. During the 48 hour outage, Chase’s online banking systems were inaccessible resulting in extreme customer frustration. The Register has an article on the subject. To summarize, the database software used for the online banking system experienced an error resulting in unrecoverable data corruption. The same corruption was replicated to the hot backup making it unusable as well and so Chase was forced to recover from their offline media. They restored the full backup from Saturday and rolled forward the transaction logs. The result was a complete recovery albeit with longer then desired downtime.

Both of these examples serve as important reminders that data loss can occur at any time. The sources of these outages can include hardware, software and human error and is often a combination of these. When losses occur, the resumption of operations depends on the ability to rapidly recover data. Both VITA and Chase suffered from slower then desired recovery times and extended outages. Clearly, the longer the downtime the greater the business impact and in both cases, the entities have had to deal with strong backlash from their constituencies.

Extended outages like these must be avoided. In order to assess restore risk, end users should closely analyze their recovery time objectives (RTO) and ensure that they choose technology that aligns with their requirements. A high performance disk-based backup solution like a SEPATON S2100-ES2 can assist in this area by providing recovery speeds of up to 17.3 TB/hr and the reliability of a highly redundant architecture. Users must also remember that recovery tests are critical and should perform complete restore tests periodically. The goal is to ensure that RTO requirements are met and that the appropriate critical data is being protected. The combination of these two strategies will not prevent outages, but ensure that a company is prepared when one occurs.

Leave a Reply Cancel reply