Categories
Deduplication Restore

Deduplication, Restore Performance and the A-Team

I have posted in the past about the challenges of restoring data from a reverse referenced deduplication solution. In short, the impact can be substantial. You might wonder whether I am the only one pointing out this issue, and what the impact really is.

An EMC blogger recently posted on this topic and provided insights on the reduction in restore performance he sees from both the DL3D and Data Domain.  He said, “I will have to rely on what customers tell me: data reads from a DD [Data Domain] system are typically 25-33% of the speed of data writes.” He then goes on to confirm that “…the DL3D performs very similarly to a Data Domain box”. He is referring to restore performance on deduplicated data in reverse referenced environment. (Both Data Domain and EMC/Quantum rely on reverse referencing.) He recommends that you maintain a cache of undeduplicated on the DL3D to avoid this penalty. Of course, this brings up a range of additional questions such as how much extra storage will the holding area require, how many days should you retain and what does this do to deduplication ratios?

The simplest solution to the above problem is to use forward referencing, but neither DD nor EMC/Quantum support this technology. EMC’s workaround is to force the customer to use more disk to store undeduplicated data which adds to the management burden and cost.

This reminds me of a classic quote from John “Hannibal” Smith from the A-Team:

I love it when a plan comes together!

What more confirmation do you need?

Categories
Backup Deduplication Restore

Trials and Tribble-lations of Deduplication

One of my favorite episodes from Star Trek was “Trouble with Tribbles.” In the episode, Uhura adopted a creature called a tribble only to find that it immediately started to reproduce uncontrollably, resulting in an infestation in the Enterprise’s critical business err spaceship systems. You can read a synopsis of the episode here or even better, watch it here. What does this have to do with restoration and deduplication? I’m glad you asked.

As I previously posted, the key driver in sizing deduplication environments and solutions is performance. This is because most solutions are performance constrained by deduplication. Like the tribbles from Star Trek, the risk end-users run is rapid growth in the number of deduplication appliances. It may seem easy to size the environment initially, but what happens if your data growth is faster than expected or stricter SLAs require you to reduce your backup and/or restore windows? The inevitable answer in most cases is more deduplication appliances. All of a sudden what seemed like one cute tribble (err, deduplication appliance) becomes a massive quantity of independent devices with different capacity and performance metrics. This large growth in machines will add complexity to your environment and will dramatically reduce any cost savings that you may have originally expected.

To avoid the above issues, you need to think about your needs not just today but into the future. The ideal solution is to purchase a system today that can meet your needs going forward. This stresses the importance of performance scalability and you must understand how this applies to any given solution.

In the world of Star Trek, Scotty easily beamed the excess tribbles to a nearby Klingon vessel. In the world of the data center, we are not so lucky. Besides who would be the unwilling recipient? Perhaps you could beam them to Data Domain?

Categories
Deduplication Virtual Tape

NetApp Dedupe: The Worst of Inline and Post-process Deduplication

NetApp finally entered the world of deduplication in data protection. While they have supported a flavor of the technology in their filers since May 2007, they have never launched the technology for their VTL. Why? Because their VTL does not use any of the core filer IP. It relies on an entirely separate software architecture that they acquired from Alacritus. Thus all the features of ONTAP do not apply to their VTL. However, I digress from the topic at hand.

I posted recently about three different approaches to deduplication timing: inline, post process and concurrent process. I talked about the benefits of each and highlighted the fact that post process and concurrent process benefit from the fastest backup performance since deduplication occurs outside of the primary data path while inline benefits from the smallest possible disk space since undeduplicated data is never written to disk. Now comes NetApp with a whole new take. Their model combines the worst of post process and inline, by requiring a disk holding area and reduced backup performance. After all this time developing the product, this is what they come up with? Hmmm, maybe they should stick to filers.

Categories
Backup Deduplication Restore

Deduplication: It’s About Performance

I have recently been thinking about the real benefits of deduplication. Although the technology is all about capacity, when you analyze the cost and benefits in the real world, the thing that jumps out at you is performance.

Performance is the key driver in sizing and assessing the number of units required. That means it also drives cost. Deduplication enables longer retention but usually reduces backup and restore performance. For example a 40 TB system can hold 800 TB of data assuming a ratio of 20:1. This is a large number, but it soon becomes clear that the system’s capacity is limited by backup speed. The graph below shows the relationship between data protected and backup window assuming performance of 400 MB/sec.


Click for larger image

Categories
Deduplication

HIFN – Commoditizing hash-based deduplication?

HIFN recently announced a card that accelerates hash-based deduplication. For those unfamiliar with HIFN, they provide infrastructure components that accelerate CPU intensive processes such as compression, encryption and now deduplication. The products are primarily embedded inside appliances, and you may be using one of their products today.

The interesting thing about the HIFN card is that they are positioning it as an all-in-one hash deduplication solution. Here are the key processes that the device performs:

  1. Hash creation
  2. Hash database creation and management
  3. Hash lookups
  4. Write to disk
Categories
Backup Deduplication

Inline Deduplication: What Your Mother Never Told You

I was recently attending a show and enjoyed speaking with a variety of end users with different levels of interest and knowledge. One of the things that I found was that attendees were obsessed with the question of inline vs post process vs concurrent process deduplication. Literally, people would come up and say “Do you do inline or post process dedupe?” This is crazy. Certainly there are differences between the approaches, but the real issue should be about data protection not arcane techno speak.

Before I go into details, let me start with the basics, inline deduplication means that deduplication occurs in the primary data path. No data is written to disk until the deduplication process is complete. The other two approaches post process and concurrent process, first store data on disk and then deduplicate. As the name suggests, post process approaches do not begin the deduplication process until all backups are complete. The concurrent process approach begins deduplication can start before the backups are completed and can backup and deduplicate concurrently. Let’s look at each of these in more detail.

Categories
Deduplication Marketing

Tradeshow perspectives

I spent last week at a tradeshow in New York. These events are interesting because of the various end user perspectives. Those of us in the industry often get embroiled in the minutiae of products and features, and so it is very useful to understand the views of the end users on the show floor. Storage Decisions is a show that prides itself on highly qualified attendees.

One of the most curious things about the show was attendees’ obsession with inline vs post process deduplication. Numerous end users stopped by asking only about when DeltaStor deduplicates data. In the rush of the show, there was little time to discuss the question in much detail. It struck me as odd that these attendees focused on this question which in my opinion is the wrong question to ask. I can only surmise that they had gotten an earful form competing vendors who swore that inline is the best approach.

Categories
Deduplication General Marketing

Exchange deduplication ratio guarantee

Scott over at EMC recently posted his thoughts about deduplication ratios and how they vary widely. I agree with his assessment that compression ratios, change rates and retention are key ingredients in deduplication ratios. However, he makes a global statement, “If you don’t know those three things, you simply cannot state a deduplication ratio with any level of honesty….It is impossible”, and uses this point to suggest that SEPATON’s Exchange guarantee program is “ridiculous”. Obviously the blogger, being an EMC employee, brings his own perspectives as do I, a SEPATON employee. Let’s dig into this a bit more.

As the original author mentioned, the key metrics for deduplication include compression, change rate and retention. Clearly these can vary by data types; however, certain data types provide more consistent deduplication results. As you can imagine, these are applications that are backed up fully every night, have fixed data structures and relatively low data change rates. Some examples include Exchange, Oracle, VMware and others.

Categories
Deduplication Restore

The hidden cost of deduplicated replication

On the surface, the idea of deduplicated replication is compelling. By replicating deltas, the technology sends data across a WAN and dramaically reduces the required bandwidth. Many customers are looking to this technology to allow them to move to a tapeless environment in the future. However, there is a major challenge that most vendors gloss over.

The most common approach to deduplication in use today is hash-based technology which uses reverse referencing. I covered the implications of this approach in another post. To summarize, the issue is that restore performance is impacted as data is retained in a reverse referenced environment. Now let’s look at how this impacts deduplicated replication.

Categories
Deduplication

A little bit off topic – deduplication and primary storage

I am digressing slightly from my usual data protection focus, but I found a recent announcement from Riverbed very interesting. They are developing a deduplication solution for primary storage. As an employee of a vendor of deduplication solutions, I wanted to provide commentary.

First some background, Riverbed makes a family of WAN acceleration appliances that reduce the amount of traffic sent over a WAN using their proprietary compression and deduplication algorithms. SEPATON is a Riverbed partner and our Site2 software has been certified with their Steelhead platform. (A bit of disclosure here, I have worked with many people from Riverbed in the past including the VP of Marketing.)

Riverbed’s announcement is summarized in posts on ByteandSwitch and The Register. In short, they are developing a deduplication solution for primary storage. It will incorporate their existing Steelhead WAN accelerators and another appliance code named “Atlas” which will contain the deduplication metadata. (The Steelhead platform has a small amount of storage for deduplication metadata since little is needed when accelerating WAN traffic. The Atlas provides the metadata storage space required for deduplicating larger amounts of data and additional functionality.) A customer would place the Steelhead/Atlas appliance combination in front of primary storage and these devices would deduplicate/undeduplicate data as it is written/read from the storage platform. This is an interesting approach and brings up a number of questions: