Backup Deduplication

6 Reasons not to Deduplicate Data

Deduplication is a hot buzzword these days. I previously posted about how important it is to understand your business problems before evaluating data protection solutions. Here are six reasons why you might not want to deduplicate data.

1. Your data is highly regulated and/or frequently subpoenaed
The challenge with these types of data is the question of whether deduplicated data meets compliance requirements. John Toigo over at Drunken Data has numerous posts on this topic including feedback from a corporate compliance usergroup. In short, the answer is that companies need to carefully review deduplication in the context of their regulatory requirements. The issue is not of actual data loss, but the risk of someone challenging the validity of subpoenaed data that was stored on deduplicated disk. TThe defendent would then face the added burden of proving the validity of the deduplication algorithm. (Many large financial institutions have decided that they will never deduplicate certain data for this reason.)

2. You are deduplicating at the client level
Products like PureDisk from Symantec, Televaulting from Asigra or Avamar from EMC deduplicate data at the client level. With these solutions, the client bears burden of deduplication and only transfers deduplicated (e.g. net new) data across the LAN. The master server maintains a disk repository containing only deduplicated data. Trying to deduplicate the already deduplicated repository will not result in storage savings.

3. You constantly generate new and unique data
Deduplication is looking for redundancies inside data. If all the data you are generating is completely new and unique then there are no redundancies and deduplication will provide no benefits.

4. You encrypt your data
Encryption secures data by randomizing its contents. This provides security benefits, but makes the data impossible to deduplicate because the randomization hides the redundancies.

5. You compress data at the client level
This is similar to the previous bullet. Like encryption, compression changes the structure of the data making it difficult to find redundancies. If you enable client-side compression in your backup application, you will find little or no benefit to deduplication.

6. You are backing up test data
Test data is information that is used for short term testing and is frequently overwritten. The retention time for this data is very short. The problem with deduplicating this data is that the rapid changes prevent you from realizing the benefits of deduplication and results in a scenario that is similar to 3.

These six points illustrate scenarios where deduplication should be avoided. However, remember that companies can still benefit from deduplication even with one or more of the above issues. Companies must identify the data that cannot be deduplicated and ideally back it up without deduplication.  In these cases, companies should focus on solutions that allow deduplication to be disabled on specific data types or backup jobs.  SEPATON’s DeltaStor meets this requirement and other solutions should be evaluated carefully with this important feature in mind.

6 replies on “6 Reasons not to Deduplicate Data”

Interestingly, the “neatly sorted order” of the row heap is elxatcy the one it takes to make the L1 cache happy, as one walks the row directory during the final “index entry range walk” phase of an INDEX RANGE SCAN. It is also the most natural order to store the prefixes/suffixes as they “arrive from the sort phase” of an index (re)build. Lots of a-ah moments from this article of yours, by the way – thanks a lot 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *