W. Curtis Preston recently blogged about The Rehydration Myth. In his post he discusses how restore performance on deduplicated data declines because of the method used to reassemble the fragmented deduplicated data on disk. He also addresses the ways various technologies attempt to overcome these issues, including disk caching, forward referencing (used by SEPATON’s DeltaStor technology) and built-in defrag. In this post I wanted to discuss the last option because it is a widely-used approach for inline deduplication that has some little-known pitfalls.
Category: Deduplication
I recently blogged with my thoughts about EMC acquiring Data Domain, and wanted follow-up with a post discussing some key points about a NetApp/Data Domain merger. Since that last post there have been numerous changes including EMC suggesting that they might up their offer; the inevitable threat of a class action lawsuit, Data Domain endorsing the second NetApp offer and the government initiating an antitrust review. In this context I want to dissect some key points to consider regarding this acquisition.
NetApp is backed into a corner
Reuters indicates that EMC will up its bid for Data Domain to as much as $35 per share. As previously posted, Data Domain’s products will fit easily into EMC’s product line replacing EMC’s current Quantum-based appliances. With this increased offer, EMC is increasing the pressure on NetApp and reaffirming their commitment to acquire Data Domain.
What does this mean?
I was surprised when NetApp offered $1.5B for Data Domain and was even more surprised when EMC countered with an all cash offer of $1.8B. NetApp has since upped their offer to $1.9B of cash and stock. It is in the context of this uncertainty that I wanted to comment on a possible EMC/Data Domain acquisition.
What about EMC’s DL3D product line?
EMC sells target deduplication solutions (DL3D product line) through a partnership with Quantum. These products compete directly with those from Data Domain and rely on similar technology. (Data Domain disclosed that it had licensed Quantum’s deduplication patents in their own IPO documents.) Even though EMC strengthened their commitment to Quantum by providing a $100 million loan back in March, the Data Domain announcement raises serious questions about EMC’s commitment to Quantum. If Quantum’s technology was really good, then why bid almost $2B for a competing technology especially since they could buy Quantum for less than half of this amount.
Some have suggested that EMC is bidding on Data Domain because they want to hurt NetApp. This is certainly a possibility. However, EMC provided a very strong counter-offer and has to recognize that they may own Data Domain in the end.
NetApp’s initial bid for Data Domain came as a surprise to many. EMC’s counter was even more of a shock. These discussions have very important implications for data protection and deduplication. Two thoughts immediately come to mind:
It’s hard to do deduplication well.
EMC and NetApp say that they have robust deduplication solutions in their DL3D (Quantum technology) and NearStore VTL series products. Before these negotiations, you might have believed them. Now, they are both bidding aggressively on Data Domain. What does that say about their confidence in their own solutions? Remember, these are large companies with hundreds (thousands?) of engineers with storage experience. Why wouldn’t they just build their own deduplication technology? The simple answer is that developing really good, enterprise-class deduplication technology is difficult.
Software and Hardware Deduplication
CA recently announced the addition of deduplication to ARCserve. Every time an ISV releases deduplication technology, I get inundated with questions about hardware (e.g. appliance-based) vs software (e.g. software-only where separate hardware is required) deduplication. In this post, I will discuss the difference between these two models when using target-based deduplication. (e.g. deduplication happens at the media server or virtual tape appliance.) Client-based deduplication (e.g. deduplication happens at the client) is another option offered by some vendors and will be covered in another post.
Most backup software ISVs offer target-based deduplication in one form or another. In some cases, it is an extra application like PureDisk from Symantec and in other cases it is a plugin like CommVault, ITSM or the new ARCserve release. In all cases, it is packaged as a software option and does not include server or storage infrastructure. Contrast this with appliance-based solution like those from SEPATON that include hardware and storage.
War Stories: Diligent
As I have posted before, IBM/Diligent requires Fibre Channel drives due to the highly I/O intensive nature of their deduplication algorithm. I recently came across a situation that provides an interesting lesson and an important data point for anyone considering IBM/Diligent technology.
A customer was backing up about 25 TB nightly and was searching for a deduplication solution. Most vendors, including IBM/Diligent, initially specified systems in the 40 – 80 TB range using SATA disk drives.
Initial pricing from all vendors was around $500k. However as discussions continued and final performance and capacity metrics were defined, the IBM/Diligent configuration changed dramatically. The system went from 64TB to 400TB resulting in a price increase of over 2x and capacity increase of 6x. The added disk capacity was not due to increased storage requirements (none of the other vendors had changed their configs) but was due to performance requirements. In short, they could not deliver the required performance with 64TB of SATA disk and were forced to include more.
The key takeaway is that if considering IBM/Diligent you must be cognizant of disk configuration. The I/O intensive nature of ProtectTier means that it is highly sensitive to disk technology and so Fibre Channel drives are the standard requirement for Diligent solutions. End users should always request Fibre Channel disk systems for the best performance and SATA configurations must be scrutinized. Appliance-based solutions can help avoid this situation by providing known disk solutions and performance guarantees.
SEPATON Versus Data Domain
One of the questions I often get asked is “how do your products compare to Data Domain’s?” In my opinion, we really don’t compare because we play in different market segments. Data Domain’s strength is in the low-end of the market, think SMB/SME while SEPATON plays in the enterprise segment. These two segments have very different needs, which are reflected in the fundamentally different architectures of the SEPATON and Data Domain products. Here are some of the key differences to consider.
W. Curtis Preston recently wrote an article on the state of physical tape for SearchDataBackup. He talks about the technologies that backup software vendors have created technology to more effectively stream tape drives. As I posted before, if you cannot stream your tape drives, their performance will decline dramatically.
In enterprise environments, performance is the key driver of data protection. You must ensure that you can backup and recover massive amounts of data in prescribed windows, and tape’s inconsistent performance and complex manageability makes it difficult as a primary backup target. This fact can also make tape a challenging solution in small environments.
The problem with tape drive streaming is a common one and Preston agrees that it is the key reason for adopting disk-based backup technologies. Our customers typically see a dramatic improvement in performance with SEPATON’s VTL solutions since they are no longer limited by the streaming requirements of tape.
Even with new disk and deduplication technologies, most customers are still using tape today and will do so into the future. However, tape will likely be used more for archiving than for secondary storage. Deduplication enables longer retention, but most customers are probably not going to retain more than a year online. Tape is a good medium for deep archive where you store data for years, but is complex and costly as a target for enterprise backup.
Recent Comment
Recently an end user commented about how the replication performance on his DL3D 1500 was less than expected. As he retained more data online, his replication speed decreased substantially and EMC support responded that this is normal behavior. This is a major challenge since slow replication times increase replication windows and can make DR goals unachievable.
The key takeaway from the comment is that testing is vital. When considering any deduplication solution, you must thoroughly review it with limited and extended retention. In this case, the degradation appeared when data was retained and would not have been found if the solution was tested with limited retention. The key elements you should test include:
- Backup performance
- On the first backup
- With retention
- Restore performance
- On the first backup
- With retention
- Replication performance
- On the first backup
- With retention