Backup Deduplication

6 Reasons not to Deduplicate Data

Deduplication is a hot buzzword these days. I previously posted about how important it is to understand your business problems before evaluating data protection solutions. Here are six reasons why you might not want to deduplicate data.

1. Your data is highly regulated and/or frequently subpoenaed
The challenge with these types of data is the question of whether deduplicated data meets compliance requirements. John Toigo over at Drunken Data has numerous posts on this topic including feedback from a corporate compliance usergroup. In short, the answer is that companies need to carefully review deduplication in the context of their regulatory requirements. The issue is not of actual data loss, but the risk of someone challenging the validity of subpoenaed data that was stored on deduplicated disk. TThe defendent would then face the added burden of proving the validity of the deduplication algorithm. (Many large financial institutions have decided that they will never deduplicate certain data for this reason.)

2. You are deduplicating at the client level
Products like PureDisk from Symantec, Televaulting from Asigra or Avamar from EMC deduplicate data at the client level. With these solutions, the client bears burden of deduplication and only transfers deduplicated (e.g. net new) data across the LAN. The master server maintains a disk repository containing only deduplicated data. Trying to deduplicate the already deduplicated repository will not result in storage savings.

Backup Deduplication Restore Virtual Tape

DeltaStor Deduplication, cont….

Scott from EMC and author of the backup blog responded to my previous post on DeltaStor. First, thank you for welcoming me to the world of blogdom. This blog is brand new and it is always interesting to engage in educated debate.

I do not want this to go down a “mine is better than yours” route. That just becomes annoying and can lead to a fight that benefits no one. I am particularly concerned since Scott, judging by his picture on EMC’s site, looks much tougher than me! 🙂

The discussion really came down to a few points. For the sake of simplicity I will quote him directly.

So, putting hyperbole aside, the support situation (and just as importantly, the mandate to test every one of those configurations) is a pretty heavy burden.

DeltaStor takes a different approach to deduplication than hash-based solutions like EMC/Quantum and Data Domain. It requires SEPATON to do some additional testing for different applications and modules. The real question under debate is how much additional work. In his first post, Scott characterized this as being entirely unmanageable. (My words, not his.) I continue to disagree with this assessment. Like most things, the Pareto Principle applies here (Otherwise known as the 80-20 rule.).  Will we support every possible combination of every application, maybe not.  Will we support the applications and environments that our customers and prospects use? Absolutely.

Backup Deduplication Restore

Deduplication, do I really need it?

I’m always puzzled when a customer tells me that they “need deduplication to run my backups better.” This drives me nuts. Deduplication in and of itself doesn’t make your backups run better. In fact some technologies make backups and restores take a lot longer. These customers aren’t really thinking about the root causes of their problems. They are like patients who see an ad for a prescription medication on TV and decide they need some. When the doctor asks why, the response is “because the ad sounded like something that will make me feel better.” Sounds ludicrous, doesn’t it? Well, that is no different from the dedupe statement above.

The simple reality is that, like the prescription drugs on TV, dedupe is not a silver bullet. It solves specific problems such as retention and reduction in $/GB. If that is your problem, then by all means please look at dedupe. But please, understand the problem you are trying to solve. Trust me, you don’t want to be the guy taking the drugs just because they sound good on TV.

Backup Deduplication Restore Virtual Tape

DeltaStor Deduplication

I track numerous sites focused on data protection and storage and am always interested in what other industry participants are discussing.  The blogroll on this site includes links to some informative and interesting blogs.  However, as with anything else on the web, everyone brings their view of the world which inevitably influences their perspective.

This brings me to my next point; I recently ran across a blog post where an EMC guy bad mouths SEPATON’s DeltaStor technology which is used in HP’s VLS platform.  The obvious conclusion is “This is an EMC guy so what do you expect?” which is true.  However, I think it deserves a brief response and wanted to respond to his his major point.  To summarize, he is commenting on the fact that DeltaStor requires an understanding of each supported backup application and data type and that this makes testing difficult.  Here is an excerpt:

Given that there are at least 5 or 6 common backup applications, each with at least 2 or 3 currently supported versions, and probably 10 or 15 applications, each with at least 2 or 3 currently supported version, the number of combinations approaches a million pretty rapidly

This is a tremendous overstatement.  HP’s (and SEPATON’s) implementation of DeltaStor is targeted at the enterprise.  If you look at enterprise datacenters there is actually a very small number of backup applications in use.  You will typically see NetBackup, TSM and much less frequently Legato; this narrows the scope of testing substantially.

This is not the case in small environments where you see many more applications such as BackupExec, ARCserve and others which is why HP is selling a backup application agnostic product for these environments.  (SEPATON is focused on the enterprise and so our solution is not targeted here.)

He then talks about how there are many different versions of supported backup applications and many different application modules.  This is true; however it is misleading.  The power of backup applications is their ability to maintain tape compatibility across versions.  This feature means that tape and file formats change infrequently.  In the case of NetBackup, it has not changed substantially since 4.5.  The result is that qualification requirements for a new application are typically minimal.  (Clearly, if the tape format radically changes, the process is more involved.)  The situation is similar with backup application modules.

The final question in his entry is:

do you want to buy into an architecture that severely limits what you can and cannot deduplicate? Or do you want an architecture that can deduplicate anything?

I would suggest an alternative question: do you want a generic deduplication solution that supports all applications reduces your performance by 90% (200 MB/sec vs 2200 MB/sec) and provides mediocre deduplication ratios or do you want an enterprise focused solution that provides the fastest performance, most scalability, most granular deduplication ratios and is optimized for your backup application??

You decide…

Oh as an afterthought, stay tuned to this blog if you are interested in more detail on the first part of my question above.

Backup General Restore

Why is this blog called About Restore?

You might be wondering why I choose the name About Restore for this blog.  The primary objective of data protection is restore. Sure you backup data every night, but you do this so you can restore the data. A recent situation reminded me of this.

I manage numerous programs and one of my responsibilities is an internal portal.  One morning, I found that the portal was inaccessible and the primary culprit seemed to be database corruption.  I had a manual backup, but it was old and so I called my IT department to ask about restoring the data.  As it turned out, my server was older and had never held critical data and so was being not backed up.  (My data is important, but certainly the company won’t come screeching to a halt without it…)  Uggghhh.  The goods news is that I later realized that the problem was due to a minor misconfiguration and my data was intact.

I bring the above story to illustrate a simple point.  Restore is what’s most important.  Sure, you need to backup data every night, but you do it to enable restores.  So as you are going through your daily protection activities remember, it is about restore!

BTW, have you conducted a restore test recently?


Welcome to the new blog!

This blog will focus on the industry where I work, Storage and more specifically data protection.  The comments here are my own.  Comments and feedback are always welcome.