Software and Hardware Deduplication

CA recently announced the addition of deduplication to ARCserve. Every time an ISV releases deduplication technology, I get inundated with questions about hardware (e.g. appliance-based) vs software (e.g. software-only where separate hardware is required) deduplication. In this post, I will discuss the difference between these two models when using target-based deduplication. (e.g. deduplication happens at the media server or virtual tape appliance.) Client-based deduplication (e.g. deduplication happens at the client) is another option offered by some vendors and will be covered in another post.

Most backup software ISVs offer target-based deduplication in one form or another. In some cases, it is an extra application like PureDisk from Symantec and in other cases it is a plugin like CommVault, ITSM or the new ARCserve release. In all cases, it is packaged as a software option and does not include server or storage infrastructure. Contrast this with appliance-based solution like those from SEPATON that include hardware and storage.

Backup Deduplication Virtual Tape

War Stories: Diligent

As I have posted before, IBM/Diligent requires Fibre Channel drives due to the highly I/O intensive nature of their deduplication algorithm. I recently came across a situation that provides an interesting lesson and an important data point for anyone considering IBM/Diligent technology.

A customer was backing up about 25 TB nightly and was searching for a deduplication solution. Most vendors, including IBM/Diligent, initially specified systems in the 40 – 80 TB range using SATA disk drives.

Initial pricing from all vendors was around $500k. However as discussions continued and final performance and capacity metrics were defined, the IBM/Diligent configuration changed dramatically. The system went from 64TB to 400TB resulting in a price increase of over 2x and capacity increase of 6x. The added disk capacity was not due to increased storage requirements (none of the other vendors had changed their configs) but was due to performance requirements. In short, they could not deliver the required performance with 64TB of SATA disk and were forced to include more.

The key takeaway is that if considering IBM/Diligent you must be cognizant of disk configuration. The I/O intensive nature of ProtectTier means that it is highly sensitive to disk technology and so Fibre Channel drives are the standard requirement for Diligent solutions. End users should always request Fibre Channel disk systems for the best performance and SATA configurations must be scrutinized. Appliance-based solutions can help avoid this situation by providing known disk solutions and performance guarantees.

Backup Deduplication Restore

SEPATON Versus Data Domain

One of the questions I often get asked is “how do your products compare to Data Domain’s?” In my opinion, we really don’t compare because we play in different market segments. Data Domain’s strength is in the low-end of the market, think SMB/SME while SEPATON plays in the enterprise segment. These two segments have very different needs, which are reflected in the fundamentally different architectures of the SEPATON and Data Domain products. Here are some of the key differences to consider.


TSM Deduplication

IBM recently announced the addition of deduplication technology to their Tivoli Storage Manager (ITSM) backup application. ITSM is a powerful application that uses a progressive incremental approach to data protection that is completely different from most other backup applications. The addition of deduplication to ITSM provides a benefit in disk space utilization, but also creates some new challenges.

The first challenge for many TSM environments is that administrators are already over-burdened with having to manage numerous discrete processes to ensure that backup operations are meeting their business requirements. The deduplication functionality within ITSM adds another process to an already complex backup environment. In addition to scheduling and managing processes such as reclamation, migration and expiration as part of daily operations, administrators now have to manage deduplication as well. This management may involve activities as disparate as capacity planning, fine-tuning, and system optimization. The alternative is to use a VTL-based deduplication solution like a SEPATON® S2100®-ES2 VTL with DeltaStor® software, which will provide deduplication benefits without having to create and manage a new process.

Deduplication Virtual Tape

Customer perspectives on SEPATON, IBM and Data Domain

SEPATON issued a press release on Monday that is worth mentioning here on the blog. SearchStorage also published a related article here. The release highlights MultiCare a SEPATON customer that uses DeltaStor deduplication software in a two-node VTL.

In the release, the customer characterizes their testing of solutions from Diligent/IBM (now IBM TS7650G) and Data Domain. Specifically, they mention that the TS7650G was difficult to configure and get running and that the gateway head nature of the product also made it difficult for them to scale capacity. These difficulties illustrate the challenges of implementing the TS7650G’s head only design. With this solution, the burden of integrating and managing the deduplication software and disk subsystem falls on the end user. Contrast this with a SEPATON appliance that manages the entire device in a fully integrated, completely automated fashion.

They had a typical Data Domain experience. That is, their initial purchase looked simple and cost effective but rapidly become complex and costly. In this case, MultiCare hit the Data Domain scalability wall, requiring them to purchase multiple separate units. The result is that MultiCare had to perform two costly upgrades and had to rip and replace their Data Domain solutions with newer, faster units. Scalability is the challenge with Data Domain solutions and it is not uncommon for customers to purchase one unit to meet their initial needs and then be forced to add additional units or perform a forklift upgrade.

As MultiCare found, customers must thoroughly understand their requirements when considering deduplication solutions. They tested the head-only approach and found it to be too complex to operate and manage to meet their needs. They tried the small appliance approach and found that they outgrew their initial system and were forced to pursue costly upgrades. In the end, they recognized that the best solution for their environment was a highly scalable S2100-ES2 solution which provided the performance and scalability that could not be achieved with either the TS7650G or Data Domain.

Deduplication General Marketing

Surviving A Down Economy – A vendor Perspective

The outlook on the economy continues to be less than stellar. The National Bureau of Economic Research formally declared that we are in a recession. Thanks guys for stating the obvious! Tough times create difficulties for everyone. We have already seen vendors including NetApp, Quantum and Copan announcing cutbacks. Sequoia Capital added to the bleak forecast with their gloomy outlook slide deck. The big question is what does this mean to technology vendors?

In these difficult times, companies must focus on their bottom line. Every technology purchase will be scrutinized and the payback must be clearly quantified. As I posted previously, ROI is vital.

The good news for data protection companies is that data volumes do not go down in a recession and retention times do not shorten. The current difficulties in the financial sector suggest that we may see even stricter regulations and longer retentions. Deduplication-enabled solutions can still thrive in this environment because they provide compelling value. They reduce backup administration time and cost  while dramatically lowering acquisition cost. However, remember that not all systems are alike and you must consider future performance and capacity requirements. Adding multiple independent systems will negatively impact ROI. The result is that scalable deduplication solutions like those sold by SEPATON can provide strong ROIs and thus can weather the storm of a tough economy better than other technologies with weaker value propositions.

Recently, an independent market research firm who reviews the purchasing trends of companies of all sizes told us that their research indicates that companies over-purchased primary storage in the first half of 2008 and that the outlook for this sector was gloomy. In contrast, deduplication technology was the one bright spot. So far our experience has suggested that their analysis is accurate.

A difficult economy is a test of everyone’s staying power. Companies are scrutinizing every purchase and focus only on those technologies that provide truly compelling value. Deduplication enabled solutions are fortunate because of the value they bring. This is not to say that these technologies are immune, but rather that they will fare better than most.

Backup Deduplication Restore

Deduplication: It’s About Performance

I have recently been thinking about the real benefits of deduplication. Although the technology is all about capacity, when you analyze the cost and benefits in the real world, the thing that jumps out at you is performance.

Performance is the key driver in sizing and assessing the number of units required. That means it also drives cost. Deduplication enables longer retention but usually reduces backup and restore performance. For example a 40 TB system can hold 800 TB of data assuming a ratio of 20:1. This is a large number, but it soon becomes clear that the system’s capacity is limited by backup speed. The graph below shows the relationship between data protected and backup window assuming performance of 400 MB/sec.

Click for larger image


HIFN – Commoditizing hash-based deduplication?

HIFN recently announced a card that accelerates hash-based deduplication. For those unfamiliar with HIFN, they provide infrastructure components that accelerate CPU intensive processes such as compression, encryption and now deduplication. The products are primarily embedded inside appliances, and you may be using one of their products today.

The interesting thing about the HIFN card is that they are positioning it as an all-in-one hash deduplication solution. Here are the key processes that the device performs:

  1. Hash creation
  2. Hash database creation and management
  3. Hash lookups
  4. Write to disk
Backup Deduplication

Inline Deduplication: What Your Mother Never Told You

I was recently attending a show and enjoyed speaking with a variety of end users with different levels of interest and knowledge. One of the things that I found was that attendees were obsessed with the question of inline vs post process vs concurrent process deduplication. Literally, people would come up and say “Do you do inline or post process dedupe?” This is crazy. Certainly there are differences between the approaches, but the real issue should be about data protection not arcane techno speak.

Before I go into details, let me start with the basics, inline deduplication means that deduplication occurs in the primary data path. No data is written to disk until the deduplication process is complete. The other two approaches post process and concurrent process, first store data on disk and then deduplicate. As the name suggests, post process approaches do not begin the deduplication process until all backups are complete. The concurrent process approach begins deduplication can start before the backups are completed and can backup and deduplicate concurrently. Let’s look at each of these in more detail.

Deduplication Marketing

Tradeshow perspectives

I spent last week at a tradeshow in New York. These events are interesting because of the various end user perspectives. Those of us in the industry often get embroiled in the minutiae of products and features, and so it is very useful to understand the views of the end users on the show floor. Storage Decisions is a show that prides itself on highly qualified attendees.

One of the most curious things about the show was attendees’ obsession with inline vs post process deduplication. Numerous end users stopped by asking only about when DeltaStor deduplicates data. In the rush of the show, there was little time to discuss the question in much detail. It struck me as odd that these attendees focused on this question which in my opinion is the wrong question to ask. I can only surmise that they had gotten an earful form competing vendors who swore that inline is the best approach.