One of the hidden landmines of deduplication is its impact on restore performance. Most vendors gloss over this issue in their quest to sell bigger and faster systems. Credit goes to Scott from EMC who acknowledged that restore performance declines on deduplicated data in the DL3D. We have seen other similar solutions suffer restore performance degradation of greater than 60% over time. Remember, the whole point of backing up is to restore when/if necessary. If you are evaluating deduplication solutions, you must consider several questions.
- What are the implications to your business on the decreasing restore performance?
- What is it about deduplication technology that hurts restore performance?
- Can you reduce the impact on restore performance?
- Is there a solution that does not have this limitation?
All deduplication algorithms replace redundant data with pointers. The primary differences between various approaches revolve around how redundancies are found and how the redundant information is stored. Restore performance suffers because the process of re-creating a backup requires the system to traverse hundreds or even thousands of pointers, a process which is very I/O intensive. To better comprehend this process and we need to better understand how deduplication works.
When deduplicating data, there is in an initial backup which is maintained in its entirety. All backups will be analyzed for redundancies with pointers replacing the redundant data. The pointers can link to data that is stored anywhere in the disk system. An example will clarify, suppose you implement a deduplication solution on January 1, 2008 and continually send full nightly backups to it. When you run tonight’s backup, the deduplication algorithm will probably find some data that has not changed since the first backup and therefore create a pointer back to January. It will also likely find redundant data from the previous night’s backup and create a pointer. It follows then that the algorithm can and will create pointers to any past backup and the newest backup will always include the most pointers. (Note that this is a simplistic model that assumes nightly full backups and also assumes that each full backup includes some net new data that is not deduplicated.)
Restore performance (and vaulting or tape copy performance) will suffer due to the fragmentation caused by pointers. As mentioned above, as you retain more data online, the more pointers you create and the more fragmented your data becomes. The process of restoring data then becomes a random I/O intensive process resulting in declining performance. Additionally, the restore performance is always going to be the worst on the newest backup since it typically contains the most pointers and conversely, the best restore performance will be found on the oldest data (e.g. fewest pointers).
What is forward referencing and how can it help?
With forward referencing technology, the pointers are reversed. In the reverse referencing process as described above, the newest backup is made up of pointers that point backwards in time to older backups. With forward referencing, the newest backup is maintained in its entirety and all past backups are made up of pointers that point forward to the newest data. This approach as used in SEPATON’s DeltaStor technology ensures the fastest restore performance on the newest backup because the newest data is maintained in its entirety (e.g. no pointers).
What does this mean to me?
Restore performance is important for companies of all sizes and should be evaluated carefully. When reviewing deduplication solutions, you must test restore performance with retained data. Simply completing one backup and restoring it is not an accurate reflection of performance over the long term and you will be disappointed if you assume otherwise. Forward referencing avoids the restore problems inherent in the algorithms from EMC/Quantum and most other deduplication vendors and should also be considered. Only you can decide how important restore performance is to your business, and it is important to understand the issue so you can make an educated decision.