I was recently attending a show and enjoyed speaking with a variety of end users with different levels of interest and knowledge. One of the things that I found was that attendees were obsessed with the question of inline vs post process vs concurrent process deduplication. Literally, people would come up and say “Do you do inline or post process dedupe?” This is crazy. Certainly there are differences between the approaches, but the real issue should be about data protection not arcane techno speak.
Before I go into details, let me start with the basics, inline deduplication means that deduplication occurs in the primary data path. No data is written to disk until the deduplication process is complete. The other two approaches post process and concurrent process, first store data on disk and then deduplicate. As the name suggests, post process approaches do not begin the deduplication process until all backups are complete. The concurrent process approach begins deduplication can start before the backups are completed and can backup and deduplicate concurrently. Let’s look at each of these in more detail.
Inline
The benefit of this approach is that it uses the least disk space because duplicate data is never written to disk. In the case of deduplicated replication, this also means that the replication can begin immediately. However, there is a major trade off. By deduplicating inline, your performance is limited by the speed of your deduplication engine and scalability is typically limited as well. A classic example of this is Data Domain whose fastest solution is 388 MB/sec and 48 TB raw. This may sound impressive, but remember that we still have the problem of declining restore performance (due to reverse referencing) and there is no way to grow this system’s capacity or performance without adding separate systems. Diligent provides a better scalability model although their performance is still limited and they require costly Fibre Channel drives.
Post Process
The benefit of this approach is that data is moved to the safety of the VTL without being slowed by deduplication. Thus these systems will typically accept data at much faster rates than the inline approach. They need more disk space than inline systems because they need extra storage space to hold the data before deduplication begins. When implementing this approach, you must ensure that enough holding space is provisioned to prevent the device from running out of holding space resulting in failed backups. Replication with these systems may be slightly delayed because you cannot replicate data until the deduplication process is complete. Sample vendors using this approach include the EMC/Quantum.
Concurrent Process
This approach is a mix of the two others. Like inline, this deduplicates data while backups are ongoing. Like post-process, it can move data to the safety of the VTL without being limited by the deduplication engine. With this approach, data is backed up to a holding area inside the VTL and the deduplication process on that data will begin as soon as the first backup job is completed. Thus unlike post-process, deduplication begins immediately and runs as a background process even as other backups are ongoing. The benefit of this approach is that it will use a smaller disk holding area than the post process approach and replication can begin sooner. It also provides scalability benefits when compared to inline. SEPATON’s DeltaStor software uses this approach
In summary, each of these approaches has different strengths and weaknesses. Inline has the benefit of the smallest disk footprint and fastest time to start replication, but trade off is slower time to safety, limited capacity and performance. Inline is often the best for smaller environments. Post process uses the most disk space and has the longest delay until replication can begin, but it provides faster performance and greater scalability than inline. Concurrent process provides a balance between the other two alternatives. It requires more disk space than inline and less than post process, and has a shorter replication lag than post process. Finally, it also provides strong performance and scalability.
From a business standpoint, each customer must evaluate the technologies in the context of their requirements. The limited scalability inherent in most inline deduplication solutions may be an issue in larger environments. If you are a small enterprise then the approach is irrelevant since either solution will work for you. You need to evaluate the options in terms of acquisition costs and business SLAs.
Larger environments have much more stringent needs for performance and capacity. It can be difficult to use inline solutions in these enterprises given the limited scalability and performance. An inline approach will often require the end user to implement multiple systems to meet their SLAs. This adds to cost and complexity and will negatively affect the ROI. Post process and concurrent process solutions make the most sense for these environments.
Before choosing a deduplication solution, an end user must first understand their SLAs, and business requirements should be the driver of which solutions to evaluate. The opposite approach of choosing the technology first and then trying to fit it into the environment is problematic. As an end user, you should focus on how individual deduplication approaches will enable you to meet your business requirements. This post was meant to clarify the general differences between the various approaches; however, I encourage end users to focus on business value because that is what matters in the long run.