Categories
Deduplication

TSM and Deduplication: 4 Reasons Why TSM Deduplication Ratios Suffer

TSM presents unique deduplication challenges due to its progressive incremental backup strategy and architectural design. This contrasts with the traditional full/incremental model used by competing backup software vendors. The result is that TSM users will see smaller deduplication ratios than their counterparts using NetBackup, NetWorker or Data Protector. This post explores four key reasons why TSM is difficult to deduplicate.

Progressive incremental model
TSM’s incremental only approach presents the biggest challenge to deduplication. By sending only changed data, TSM limits the amount of redundant information stored and hence deduplication benefits. This is in stark contrast to most other backup applications that rely on frequent full backups; it is these jobs that provide the best data reduction. However TSM is often configured to perform nightly full backups of data types such as Exchange, Oracle or SQL. These backups will provide similar deduplication benefits as seen in other applications, but typically represent a fraction of total data protected.

Data movement
Deduplication algorithms use various methods to recognize redundant information. They analyze data as it is written to the system to find the redundancies. In traditional backup applications, data is typically written to a disk or tape device and remains there until expiration. In contrast, TSM writes data to a given pool and then will run multiple processes which move the data such as reclamation. These processes create deduplication challenges because they force the appliance to constantly re-hydrate data and then re-deduplicate data. The frequent data movement creates inconsistent data patterns that can be difficult for deduplication mechanisms to recognize resulting in decreased data reduction.

Data fragmentation
TSM is designed to maximize performance by using multi-streaming. This means that a given job could backup to almost any disk or tape device and its data could be multiplexed or mixed with data from any other server. This can be a challenge because data becomes fragmented and will be written in many different physical tape or disk locations. Deduplication struggles because these segments can be small and the algorithm must effectively recognize the varying block sizes to achieve the best possible data reduction. In practice this reduces data reduction.

Reclamation/Overwrites
Since TSM typically spreads a backup job across multiple pieces of media, data becomes fragmented. TSM has a process called reclamation which reduces fragmentation by moving data onto new cartridges and expiring the previous cartridges. Reclamation is designed to minimize the number of cartridges required in a TSM environment, but it can be I/O intensive. As the process it runs, it moves unexpired data onto new cartridges and expires the old cartridges. This can be a challenge for deduplication systems that utilize a batch cleaning process. These systems run a weekly or more frequent process to deleted expired cartridges; free space is not realized until the cleaning is completed. Deduplication ratios will decline since the expired cartridges will consume disk space. The other challenge is that the clean process can be I/O intensive and could negatively impact other TSM processes.

TSM is a very powerful application with a unique data protection model. Unfortunately, the same technology that reduces backup windows also creates challenges for deduplication. The result is that deduplication ratios decrease in TSM environments. However, even with smaller ratios, deduplication still provides a strong benefit for TSM environments particularly with deduplicated replication.

2 replies on “TSM and Deduplication: 4 Reasons Why TSM Deduplication Ratios Suffer”

Can you talk about how much lower the ratios are? Is it safe to say half of other systems or is it less?

BTW, TSM users will object to the use of the word multiplexing to describe what they do, but I believe it is appropriate. It multiplexes at the job/client level, though, where other products mpx at the block level.

Curtis, thank you for your comment.

Unfortunately, there is no absolute answer to your question. TSM deduplication results will vary depending on the data protected. As mentioned, a TSM environment performing frequent full backups will provide deduplication results that will be closer to those seen in traditional environments. Conversely, if the backups are primarily files then the results will be much less.

In a typical environment, I think that a 50% reduction in ratios is a reasonable assumption.

Leave a Reply

Your email address will not be published. Required fields are marked *