Defragmentation, rehydration and deduplication
W. Curtis Preston recently blogged about The Rehydration Myth. In his post he discusses how restore performance on deduplicated data declines because of the method used to reassemble the fragmented deduplicated data on disk. He also addresses the ways various technologies attempt to overcome these issues, including disk caching, forward referencing (used by SEPATON’s DeltaStor technology) and built-in defrag. In this post I wanted to discuss the last option because it is a widely-used approach for inline deduplication that has some little-known pitfalls.
Defrag is resource intensive
Despite the benefits of defragmentation technology highlighted by vendors, the actual restore time improvement delivered by it depend on the deduplication technology it is connected to and it always requires substantial processing resources. This technology typically requires numerous reads and writes to the filesystem as well as frequent access to the deduplication database causing significant processing overhead and slowing system performance. Some technologies throttle the process, which may improve performance, but lengthens the time that the system will be affected. It forces you to choose between very slow performance for a shorter time or less slow performance for a longer time.
The other challenge is that defragmentation activities are typically scheduled. Most systems require you to set the window for when the defragmentation operation happens. Data Domain defaults to 6:00 AM on Tuesdays. (Do they have something against Tuesdays? 😀 ) The customer must understand when the operation is scheduled and ideally minimize system activity during the window. This can be a challenge in rapidly changing enterprise environments. This process is typically performed in conjunction with a high overhead deduplication process that Data Domain calls housekeeping – other inline systems have similar processes. During housekeeping – which can take as long as 14 hours, the deduplication software acknowledges expired data, updates pointers, eliminates duplicate data from the disk, and, as the name suggests, cleans up its databases.
Defrag may not be beneficial
This seems counter-intuitive, but in some implementations defrag will actually slow restore performance, particularly when it is performed after cleaning and housekeeping is completed. I know of an actual enterprise customer that ran into this in its test environment. The customer performed 20 weeks of full and incremental backups with the defrag or “clean” process disabled and then tested the restore performance. They then ran the same restore test after running “clean” three times. The results are shown in the graph below.
The customer’s restore performance declined by about 40 percent after cleaning. This is because the clean process replaced data with pointers, increasing the number of pieces of data that had to be retrieved and reassembled before a restore. In effect, this process is actually adding to the data fragmentation. The benefit of the process is that it increased deduplication ratios, but at a cost of a dramatic reduction in restore performance.
The key takeaway is that end users must carefully review their deduplication solutions and understand how backup and restore performance changes over time. Testing a single backup and restore is not an accurate representation of the long-term performance of the device. The concepts of disk caching, forward referencing and defragmentation may seem unimportant, but they can have a major long-term impact on the manageability and performance of a device.