Deduplication Restore

Defragmentation, rehydration and deduplication

W. Curtis Preston recently blogged about The Rehydration Myth. In his post he discusses how restore performance on deduplicated data declines because of the method used to reassemble the fragmented deduplicated data on disk. He also addresses the ways various technologies attempt to overcome these issues, including disk caching, forward referencing (used by SEPATON’s DeltaStor technology) and built-in defrag. In this post I wanted to discuss the last option because it is a widely-used approach for inline deduplication that has some little-known pitfalls.


Data Domain Announcement

Data Domain recently announced that their new OS release dramatically improved appliance performance. On the surface, the announcement seems compelling, but upon further review, it creates a number of questions.

Performance Improvement
Deduplication software such as Data Domain’s is complex and can contain hundreds of thousands of interrelated lines of code. As products mature, companies will fine tune and improve their code for greater efficiency and performance. You would expect to see performance improvements from these changes of about 20-30%. Clearly, if an application is highly inefficiently coded, you will see greater performance gains. However, larger improvements like those quoted in the release are usually only achieved with major product architecture updates and coincide with a major new software release.

In this case, I am not suggesting that Data Domain’s software is bad, but rather that the stated performance improvement is suspect. They positioned this as a dot code release and so it is not a major product re-architecture. Additionally, if it was a major architecture update, they would have highlighted it in the release.

To summarize, the stated performance gains in the release are too large to attribute to a simple code tweak and I believe that the gains are only attainable in very specific circumstances. Data Domain appears to have optimized their appliances for Symantec’s OST and is trumpeting their performance gains. However, OST represents only a small fraction of Data Domain’s customer base and it seems that customers using non-Symantec backup apps will see uncertain performance improvements. Read on to learn more.

Deduplication Virtual Tape

NetApp Dedupe: The Worst of Inline and Post-process Deduplication

NetApp finally entered the world of deduplication in data protection. While they have supported a flavor of the technology in their filers since May 2007, they have never launched the technology for their VTL. Why? Because their VTL does not use any of the core filer IP. It relies on an entirely separate software architecture that they acquired from Alacritus. Thus all the features of ONTAP do not apply to their VTL. However, I digress from the topic at hand.

I posted recently about three different approaches to deduplication timing: inline, post process and concurrent process. I talked about the benefits of each and highlighted the fact that post process and concurrent process benefit from the fastest backup performance since deduplication occurs outside of the primary data path while inline benefits from the smallest possible disk space since undeduplicated data is never written to disk. Now comes NetApp with a whole new take. Their model combines the worst of post process and inline, by requiring a disk holding area and reduced backup performance. After all this time developing the product, this is what they come up with? Hmmm, maybe they should stick to filers.

Deduplication Marketing

Tradeshow perspectives

I spent last week at a tradeshow in New York. These events are interesting because of the various end user perspectives. Those of us in the industry often get embroiled in the minutiae of products and features, and so it is very useful to understand the views of the end users on the show floor. Storage Decisions is a show that prides itself on highly qualified attendees.

One of the most curious things about the show was attendees’ obsession with inline vs post process deduplication. Numerous end users stopped by asking only about when DeltaStor deduplicates data. In the rush of the show, there was little time to discuss the question in much detail. It struck me as odd that these attendees focused on this question which in my opinion is the wrong question to ask. I can only surmise that they had gotten an earful form competing vendors who swore that inline is the best approach.