NetApp Dedupe: The Worst of Inline and Post-process Deduplication

NetApp finally entered the world of deduplication in data protection. While they have supported a flavor of the technology in their filers since May 2007, they have never launched the technology for their VTL. Why? Because their VTL does not use any of the core filer IP. It relies on an entirely separate software architecture that they acquired from Alacritus. Thus all the features of ONTAP do not apply to their VTL. However, I digress from the topic at hand.

I posted recently about three different approaches to deduplication timing: inline, post process and concurrent process. I talked about the benefits of each and highlighted the fact that post process and concurrent process benefit from the fastest backup performance since deduplication occurs outside of the primary data path while inline benefits from the smallest possible disk space since undeduplicated data is never written to disk. Now comes NetApp with a whole new take. Their model combines the worst of post process and inline, by requiring a disk holding area and reduced backup performance. After all this time developing the product, this is what they come up with? Hmmm, maybe they should stick to filers.

Below is a screenshot from NetApp’s web site where they specify performance with deduplication on and off.



according to The Register, they are post processing (albeit with some deduplication work upfront). The article indicates that they generate “rolling hashes” in real time while still storing the data in its entirety for future deduplication.

This is the worst possible combination. They are doing the CPU intensive part of deduplication inline thus reducing performance, and are not gaining any immediate disk space savings! Additionally, since this is post-process, the actual removal of redundant data will take place at a later time and will require substantial I/O resources and will likely slow backup and restore operations. NetApp does not disclose the performance impact of the actual deduplication process, but it is a single node system and so performance is not likely to exceed 400 MB/sec.

One final addition is that NetApp continues to use RAID 5 in this platform. This is dangerous since their deduplication engine will replace data with pointers which can live anywhere on the disk system. A double disk fault on any shelf would likely result in catastrophic data loss. Most other deduplication vendors including SEPATON utilize RAID 6 to further protect data. EMC blogger Scott Waterhouse has posted on this topic on numerous occasions and I defer to his analysis on the issue. Fundamentally, this is a major flaw in NetApp’s VTLs and it is ironic that NetApp in their whitepapers recommend against using RAID 5 on their filers and yet offer no alternative on the VTL.

Be Sociable, Share!
  • Twitter
  • Facebook
  • email
  • StumbleUpon
  • Delicious
  • LinkedIn

No comments yet... Be the first to leave a reply!

Leave a Reply