There is an interesting discussion on The Backup Blog related to deduplication and EMC’s DL3D. The conversation relates to performance and the two participants are W. Curtis Preston the author of the Mr. Backup Blog and the The Backup Blog’s author, Scott from EMC. Here are some excerpts that I find particularly interesting with my commentary included. (Note that I am directly quoting Scott below.)
VTL performance is 2,200 MB/sec native. We can actually do a fair bit better than that…. 1,600 MB/sec with hardware compression enabled (and most people do enable it for capacity benefits.)
The 2200 MB/sec is not new, it is what EMC specifies on their datasheet; however, it is interesting that performance declines with hardware compression. The hardware compression card must be a performance bottleneck. Is the reduction in performance of 28% meaningful? It depends on the environment and is certainly worth noting especially for datacenters where backup and restore performance are the primary concern.
Now, on to deduplication:
Deduplication performance is 400 MB/s. That can run 24 hours a day, because it is a post-process deduplication. I typically don’t recommend a scenario in which it would run for more than 20 hours a day on average, because I want to leave room for future growth, for restore requests, and the like. But at 20 hours per day, that is, roughly, 30 TB per day of deduplication capability.
This is interesting, the EDL can run at either 2200 or 1600 MB/sec as mentioned above. Now if we look at deduplication performance, we see that it is 4-5x slower than the EDL ingest rate. Scott rightly suggests that you don’t want to configure the system to deduplicate all day and so recommends that you should not try to deduplicate more than 30TB upon installation. Taking the math a step further, we find that a 36TB nightly backup is the maximum that can be deduplicated with the DL3D. Anything larger and the DL3D will be deduplicating 24 hours a day and will never catch up. This is very important information for customers with larger environments who are looking at deduplication.
Now, on to my favorite topic: Restores!
…some things you will want to be able to restore faster than deduplicated storage permits–so you want to leave them on the VTL for that period of time for which you want high speed restore.
In a future post, I will discuss the impact of deduplication on data restoration. In short, like other hash-based solutions, there is an impact on restoration speed with DL3D. Scott does not quantify the impact, but we have seen performance degradations on the order of magnitude of 60% or greater on solutions from other hash-based deduplication vendors. The restore performance decreases over time and will appear after you retain data on the deduplication device for as little as 3 weeks. SEPATON’s DeltaStor technology uses a different approach that avoids this behavior. Stay tuned for more on this topic.
Net net? The 4406 3D can ingest more than it can dedup in a day. There is a very big VTL space of 675 TB to write data to that is exclusive of the additional 148 TB of deduplicated storage.
The data above proves his point, but of particular note is his comment about storage requirements. He indicates that VTL storage space is separate from the deduplicated storage (e.g. 675 TB vs 148 TB) thus the DL3D requires separate disk capacity for VTL and deduplication. (This is not to suggest that a separate disk array is required, but rather that you need to carve out separate capacityfor VTL and deduplication data.) Most other systems use a common repository for deduplicated and non-deduplicated data and the EMC approach introduces a level of complexity that is unique to the DL3D. Someone (EMC?, end user?, reseller?) needs to constantly monitor the space utilization of the VTL and the deduplication environment and manage the process of provisioning and/or purchasing new storage.
In summary, EMC provides a VTL that can accept data at 2200 MB/sec which is reasonably fast. Conversely, their deduplication engine accepts data at 400 MB/sec which results in a fundamental performance imbalance. Scott suggests that a customer manage the imbalance by not deduplicating rapidly changing data or data that needs high speed restore. This creates a management challenge for the end-user. Unlike a traditional VTL where you backup all your data in the same manner, the DL3D now requires environments to manually decide which data to deduplicate and to allocate storage appropriately based on this assumption. What happens over time as new data is added to the system? How do you decide which data to deduplicate or not? What are the implications for provisioning new disk capacity to the VTL or deduplication repository? These are major challenges that someone has to deal with. Either the customer dedicates time to managing this complexity themselves or they hire EMC to do it for them. In either case, the customer ends up paying.
In closing, this scenario reminds me of a famous quote from Wimpy J. Wellington from Popeye :
I will gladly pay you Tuesday for a hamburger today!
You may get your hamburger/DL3D at a low cost today, but remember it may cost you substantially more in the future…
Update: Due to excessive spam on this post, I have closed it for comments. Comments are still welcome; please submit them via the contact form.