A little bit off topic – deduplication and primary storage

I am digressing slightly from my usual data protection focus, but I found a recent announcement from Riverbed very interesting. They are developing a deduplication solution for primary storage. As an employee of a vendor of deduplication solutions, I wanted to provide commentary.

First some background, Riverbed makes a family of WAN acceleration appliances that reduce the amount of traffic sent over a WAN using their proprietary compression and deduplication algorithms. SEPATON is a Riverbed partner and our Site2 software has been certified with their Steelhead platform. (A bit of disclosure here, I have worked with many people from Riverbed in the past including the VP of Marketing.)

Riverbed’s announcement is summarized in posts on ByteandSwitch and The Register. In short, they are developing a deduplication solution for primary storage. It will incorporate their existing Steelhead WAN accelerators and another appliance code named “Atlas” which will contain the deduplication metadata. (The Steelhead platform has a small amount of storage for deduplication metadata since little is needed when accelerating WAN traffic. The Atlas provides the metadata storage space required for deduplicating larger amounts of data and additional functionality.) A customer would place the Steelhead/Atlas appliance combination in front of primary storage and these devices would deduplicate/undeduplicate data as it is written/read from the storage platform. This is an interesting approach and brings up a number of questions:

How much duplication is there in primary data?
In the case of backups, it is obvious that there is lots of duplication (full backups, for example), but it is not clear how many redundancies exist in primary storage. Riverbed states that they can eliminate “as much as 90%” of storage requirements (10:1 reduction); however, we have seen that dedupe ratios can vary widely in data protection which is a more favorable environment than primary storage. The “as much” part of this statement suggests that this is a best case number and so the question is what is a real world number? What kind of deduplication ratios are needed to justify the purchase?

How would this impact I/O performance?
I have talked in the past about the performance impact of deduplication in backups and restores. Riverbed’s inband appliances will probably impact read and/or write. (If they are similar to existing deduplication appliances, then you will need to be particularly careful with read performance.)

What interfaces are supported?
In today’s datacenters, there are numerous storage interfaces including Fibre Channel (FC), iSCSI and NAS. FC is typically used for the highest performing applications and so would not be a good fit for this technology. (If performance is your number one priority, why would you ever implement something that would reduce it?) Alternatively, iSCSI and NAS are often used with applications that have less stringent performance needs and would be better suited. Riverbed already has extensive expertise in IP and these interfaces would be the best fit with their current technologies.

Are you willing to trust some/all your data to these appliances?
With the Steelhead/Atlas modifying the data being written to the disk array, it is unlikely that the disk array would be readable if the deduplication appliance failed. They will offer a clustered solution, but some end users are not comfortable with appliances that sit in their primary storage data path. (This brings up another question, many vendors provide tools for migrating data from one array to another. How would those work in this scenario?)

In summary, Riverbed has created an interesting technology proposition. The success of their solution is going to depend on 3 key metrics:

  1. Deduplication ratios – The better the deduplication, the greater the storage savings
  2. Performance impact – The greater the performance impact, the less compelling the solution
  3. Cost of the system – This is vital and will impact the ROI although administration cost is also important

Riverbed is betting that their solution will provide a compelling combination of the above metrics. Deduplication ratios will vary in primary storage environments (as they do in backup environments) and so the value a customer experiences will vary as well. This is also a sale to a different person than the traditional buyer of WAN optimization technology. Riverbed faces the two challenges of proving the value of the new technology and effectively entering a new market segment.  They are not expecting product GA until the first half of 2009 and so they have time to figure this out.  From my perspective, it will be interesting to see how it plays out in the market.

Leave a Reply

Your email address will not be published. Required fields are marked *