W. Curtis Preston the author of the Mr. Backup Blog recently voiced his frustration with certain bloggers censoring visitor comments. He was annoyed that some folks from EMC configured their blogs for comment moderation (all comments must be approved before they appear on the site) and used the power to delete certain responses. He contrasted this to NetApp whose blogs are not moderated. (As a point of clarification, AboutRestore.com’s comments are not moderated; reader comments are posted immediately.) Whether you believe in comment moderation or not, at least these blogs all provide a mechanism for the visitor to respond.
I was recently attending a show and enjoyed speaking with a variety of end users with different levels of interest and knowledge. One of the things that I found was that attendees were obsessed with the question of inline vs post process vs concurrent process deduplication. Literally, people would come up and say “Do you do inline or post process dedupe?” This is crazy. Certainly there are differences between the approaches, but the real issue should be about data protection not arcane techno speak.
Before I go into details, let me start with the basics, inline deduplication means that deduplication occurs in the primary data path. No data is written to disk until the deduplication process is complete. The other two approaches post process and concurrent process, first store data on disk and then deduplicate. As the name suggests, post process approaches do not begin the deduplication process until all backups are complete. The concurrent process approach begins deduplication can start before the backups are completed and can backup and deduplicate concurrently. Let’s look at each of these in more detail.
I spent last week at a tradeshow in New York. These events are interesting because of the various end user perspectives. Those of us in the industry often get embroiled in the minutiae of products and features, and so it is very useful to understand the views of the end users on the show floor. Storage Decisions is a show that prides itself on highly qualified attendees.
One of the most curious things about the show was attendees’ obsession with inline vs post process deduplication. Numerous end users stopped by asking only about when DeltaStor deduplicates data. In the rush of the show, there was little time to discuss the question in much detail. It struck me as odd that these attendees focused on this question which in my opinion is the wrong question to ask. I can only surmise that they had gotten an earful form competing vendors who swore that inline is the best approach.
W. Curtis Preston the author of the Mr. Backup Blog recently posted an article about the blogs that he frequents. I was honored that he recognized AboutRestore.com along with blogs from other major vendors.
Curtis mentioned his frustration with the comment filtering policies on some blogs and I wanted to clarify AboutRestore.com’s policy. (A synopsis of the policy is contained in the disclaimer in the sidebar.) Comments are not moderated; whatever you post appears on the site instantly. I have little interest in censorship; however, I reserve the right to delete comments containing abusive or personal attacks. I hope I never have to use my power of deletion, but as Uncle Ben said to Peter Parker/Spiderman:
With great power comes great responsibility.
Now back to regularly scheduled programming…..
Scott over at EMC recently posted his thoughts about deduplication ratios and how they vary widely. I agree with his assessment that compression ratios, change rates and retention are key ingredients in deduplication ratios. However, he makes a global statement, “If you don’t know those three things, you simply cannot state a deduplication ratio with any level of honesty….It is impossible”, and uses this point to suggest that SEPATON’s Exchange guarantee program is “ridiculous”. Obviously the blogger, being an EMC employee, brings his own perspectives as do I, a SEPATON employee. Let’s dig into this a bit more.
As the original author mentioned, the key metrics for deduplication include compression, change rate and retention. Clearly these can vary by data types; however, certain data types provide more consistent deduplication results. As you can imagine, these are applications that are backed up fully every night, have fixed data structures and relatively low data change rates. Some examples include Exchange, Oracle, VMware and others.
On the surface, the idea of deduplicated replication is compelling. By replicating deltas, the technology sends data across a WAN and dramaically reduces the required bandwidth. Many customers are looking to this technology to allow them to move to a tapeless environment in the future. However, there is a major challenge that most vendors gloss over.
The most common approach to deduplication in use today is hash-based technology which uses reverse referencing. I covered the implications of this approach in another post. To summarize, the issue is that restore performance is impacted as data is retained in a reverse referenced environment. Now let’s look at how this impacts deduplicated replication.
In part 1, I touched on four of the most common challenges with data restoration in a disaster scenario. In this post, I will review some other key considerations. These examples focus on the infrastructure required after a disaster has occurred.
Hurricane Ike has been in the news lately and my sympathy goes out to all those affected. It is events like these that test IT resiliency. The damage can range from slight to severe and we invest in reliable and robust data protection processes to protect from disasters like this. The unfortunate reality is that, no matter how much you plan for it, the recovery process often takes longer and is more difficult than expected.
In many respects, data protection is an insurance policy. You hate to pay your homeowners premium every month, you do it because you know that it is your only protection if major damage ever happens to your house. In the case of data protection, you invest hours managing your backup environment to enable recovery from incidents like this. The unfortunate reality is that even with the best planning and policies things still may not turn out as expected. Four of the most common pitfalls I hear from customers include:
I am digressing slightly from my usual data protection focus, but I found a recent announcement from Riverbed very interesting. They are developing a deduplication solution for primary storage. As an employee of a vendor of deduplication solutions, I wanted to provide commentary.
First some background, Riverbed makes a family of WAN acceleration appliances that reduce the amount of traffic sent over a WAN using their proprietary compression and deduplication algorithms. SEPATON is a Riverbed partner and our Site2 software has been certified with their Steelhead platform. (A bit of disclosure here, I have worked with many people from Riverbed in the past including the VP of Marketing.)
Riverbed’s announcement is summarized in posts on ByteandSwitch and The Register. In short, they are developing a deduplication solution for primary storage. It will incorporate their existing Steelhead WAN accelerators and another appliance code named “Atlas” which will contain the deduplication metadata. (The Steelhead platform has a small amount of storage for deduplication metadata since little is needed when accelerating WAN traffic. The Atlas provides the metadata storage space required for deduplicating larger amounts of data and additional functionality.) A customer would place the Steelhead/Atlas appliance combination in front of primary storage and these devices would deduplicate/undeduplicate data as it is written/read from the storage platform. This is an interesting approach and brings up a number of questions:
Howard Marks recently posted an interesting article about NEC’s HYDRAstor over on his blog at InformationWeek. He discusses the product and how the device is targeted at backup and archiving applications. He makes some interesting points and mentions SEPATON. I wanted to respond to some of the points he raised.
…[the system starts with] a 1-accelerator node – 2-storage node system at $180,000…