I have recently been thinking about the real benefits of deduplication. Although the technology is all about capacity, when you analyze the cost and benefits in the real world, the thing that jumps out at you is performance.
Performance is the key driver in sizing and assessing the number of units required. That means it also drives cost. Deduplication enables longer retention but usually reduces backup and restore performance. For example a 40 TB system can hold 800 TB of data assuming a ratio of 20:1. This is a large number, but it soon becomes clear that the system’s capacity is limited by backup speed. The graph below shows the relationship between data protected and backup window assuming performance of 400 MB/sec.
The graph shows that you can back up 35TB in 24 hrs, 7TB in 5 hours and everything in between. Any end user whose requirements place them above the line will require multiple systems. Remember, these numbers are based on 400 MB/sec performance which is about the fastest for today’s hash-based solutions and must be adjusted downward for slower systems.
In modeling numerous deduplication solutions, I found that many growing environments will quickly surpass the line above and require additional separate units providing 400 MB/sec or a scalable deduplication appliance like the S2100-ES2. Post-process solutions will also require multiple systems since most of them bottleneck on the same 400 MB/sec hash-based deduplication performance. The solutions that require separate VTL and deduplication hardware and software are often more expensive to purchase since adding an extra system requires the addition of an entirely separate VTL and deduplication infrastructure as compared to the solutions with a unified design.
As always, the blog is entitled AboutRestore for a reason and end users must always focus on restore performance. The challenge of many deduplication solutions is that restore performance is a fraction of backup performance, thus while you may want to size your environment to the smallest possible number of systems to meet your backup window SLAs; remember that restore performance will be much slower than backup performance. (The sole exception is solutions that use a forward referencing approach like SEPATON’s DeltaStor software.) Restore performance is the key and you must size to ensure that you can meet your needs. When you do, think about how many systems you are going to have to manage. Is that complexity worth it?
In summary, the key factor driving the implementation of multiple deduplication solutions is performance. If you end up with multiple systems it is most likely due to a need for more performance NOT more capacity. When selecting a solution, you must focus on your performance needs both today and in the future and not be distracted by the promise of massive deduplication capacity.