Backup Deduplication Virtual Tape

War Stories: Diligent

As I have posted before, IBM/Diligent requires Fibre Channel drives due to the highly I/O intensive nature of their deduplication algorithm. I recently came across a situation that provides an interesting lesson and an important data point for anyone considering IBM/Diligent technology.

A customer was backing up about 25 TB nightly and was searching for a deduplication solution. Most vendors, including IBM/Diligent, initially specified systems in the 40 – 80 TB range using SATA disk drives.

Initial pricing from all vendors was around $500k. However as discussions continued and final performance and capacity metrics were defined, the IBM/Diligent configuration changed dramatically. The system went from 64TB to 400TB resulting in a price increase of over 2x and capacity increase of 6x. The added disk capacity was not due to increased storage requirements (none of the other vendors had changed their configs) but was due to performance requirements. In short, they could not deliver the required performance with 64TB of SATA disk and were forced to include more.

The key takeaway is that if considering IBM/Diligent you must be cognizant of disk configuration. The I/O intensive nature of ProtectTier means that it is highly sensitive to disk technology and so Fibre Channel drives are the standard requirement for Diligent solutions. End users should always request Fibre Channel disk systems for the best performance and SATA configurations must be scrutinized. Appliance-based solutions can help avoid this situation by providing known disk solutions and performance guarantees.


IBM Deduplication Appliances

I have been on hiatus as of late and apologize for my tardiness in blogging.

IBM released their new deduplication applications based on the technology they acquired from Diligent. At first glance, it might appear that this could be a competitive alternative to SEPATON, but when you look at it, it quickly becomes apparent that this is not the case.

IBM previously sold one product, TS7650G gateway which they now target at the enterprise. The new appliance products use similar server hardware and a de-featured version of the DS4700 disk array. As with all Diligent installations, the solution uses Fibre Channel drives that reduce density and add cost. They will never be price leaders. The configurations are as follows:

Capacity Nodes
7 TB One
18 TB One
36 TB One
36 TB Two

You can’t move beyond the configurations listed above. If you want to grow the system beyond 36 TB, you are out of luck. Your only choice is a forklift upgrade to the TS7650G gateway. What if you want dual nodes and less than 36TB? Same answer. How about replication? Same answer. (That is, if you can consider the array-based approach in the TS7650G a realistic replication option.)

The ultimate irony is that by creating appliance VTLs, IBM has actually made their customers’ lives more difficult. Customers now have to choose whether to purchase a gateway (which adds complexity and cost) or a simple bounded appliance (which has limited configurations). Why should a customer have to make this trade-off? Why not offer an appliance that is simple, cost-effective AND scalable? Well, the simple answer to the question is to get a SEPATON S2100-ES2!

Deduplication Virtual Tape

Customer perspectives on SEPATON, IBM and Data Domain

SEPATON issued a press release on Monday that is worth mentioning here on the blog. SearchStorage also published a related article here. The release highlights MultiCare a SEPATON customer that uses DeltaStor deduplication software in a two-node VTL.

In the release, the customer characterizes their testing of solutions from Diligent/IBM (now IBM TS7650G) and Data Domain. Specifically, they mention that the TS7650G was difficult to configure and get running and that the gateway head nature of the product also made it difficult for them to scale capacity. These difficulties illustrate the challenges of implementing the TS7650G’s head only design. With this solution, the burden of integrating and managing the deduplication software and disk subsystem falls on the end user. Contrast this with a SEPATON appliance that manages the entire device in a fully integrated, completely automated fashion.

They had a typical Data Domain experience. That is, their initial purchase looked simple and cost effective but rapidly become complex and costly. In this case, MultiCare hit the Data Domain scalability wall, requiring them to purchase multiple separate units. The result is that MultiCare had to perform two costly upgrades and had to rip and replace their Data Domain solutions with newer, faster units. Scalability is the challenge with Data Domain solutions and it is not uncommon for customers to purchase one unit to meet their initial needs and then be forced to add additional units or perform a forklift upgrade.

As MultiCare found, customers must thoroughly understand their requirements when considering deduplication solutions. They tested the head-only approach and found it to be too complex to operate and manage to meet their needs. They tried the small appliance approach and found that they outgrew their initial system and were forced to pursue costly upgrades. In the end, they recognized that the best solution for their environment was a highly scalable S2100-ES2 solution which provided the performance and scalability that could not be achieved with either the TS7650G or Data Domain.


TS7650G and Fibre Channel Drives

IBM/Diligent TS7650G uses a pattern matching approach to deduplication, which is different from the hash-based solutions used by many vendors or the ContentAwareTM approach pioneered by SEPATON.

Diligent’s technology requires Fibre Channel (FC) drives for the best performance because pattern matching is highly I/O intensive and needs the additional I/O from FC drives. FC drives in turn, negatively affect disk density, require more power and dramatically increase the price of the system.

The pattern matching technology used in the TS7650G is an inline process. Therefore, all duplicate data has to be identified before data is committed to disk. Pattern matching only provides an approximate match on redundant data and requires a byte-level compare to verify the redundancy. All byte-level compares must be completed before any data is written to disk and the next piece of data accepted. FC drives are required because they provide the random I/O performance needed to handle inline byte-level comparisons. Diligent specified a 110 disk FC array for the ESG performance whitepaper that they sponsored back in July of 2006. (Local copy of the ESG whitepaper.)  This is not to say that the algorithm will not work with SATA, but these drives will dramatically reduce performance.

If you are considering the TS7650G, you must carefully consider the associated disk sub-system. It is not clear what disk system and capacity was used when IBM/Diligent generated their performance specifications. As part of the evaluation you should also test single stream and aggregate backup performance because as previously mentioned single stream performance may be a challenge.

Deduplication Virtual Tape

Falconstor, SIR and OEMs

This article on highlights enhancements to FalconStor’s SIR deduplication platform, but I have to wonder whether anyone cares. FalconStor was a big player in providing VTL software to OEMs; but their deduplication software has been largely ignored.

FalconStor had their heyday in VTL. They aggressively pursued OEM deals with large vendors including EMC, IBM, and Sun. EMC was the most successful with their EDL family of products. As the market moved to deduplication, you would think that FalconStor would be the default OEM supplier of deduplication software as well. You would be wrong.

Ironically, FalconStor’s VTL success was their downfall in deduplication. Their OEMs realized that they were all selling the same VTL software and did not want to repeat the situation with deduplication.  EMC and IBM, have already announced that they are using alternative deduplication providers.

Backup Deduplication

IBM Storage Announcement

As previously posted, I was confused about the muted launch of IBM’s XIV disk platform. Well, the formal launch finally occurred at IBM Storage Symposium in Montpelier, France. Congratulations to IBM, although I am still left scratching my head why they informally announced the product a month ago!

Another part of the announcement was the TS7650G which is Diligent’s software running on an IBM server. Surprisingly, there is not much new; it appears that they are banking on the IBM brand and salesforce to jumpstart Diligent’s sales. Judging by the lack of success in selling the TS75xx series, it will be interesting to see whether they will have any more success with this platform.

From a VTL perspective, IBM has backed themselves into a box. Like EMC, they have a historic relationship with FalconStor and have chosen a different supplier for deduplication. This creates an interesting dichotomy. Let’s look at the specs of their existing FalconStor-based VTL and newly announced technology.