Deduplication

Deduplication Definition

Deduplication refers to a method of eliminating a dataset’s redundant data. In a secure data deduplication process, a deduplication assessment tool identifies extra copies of data and deletes them, so a single instance can then be stored.

Data deduplication software analyzes data to identify duplicate byte patterns. In this way, the deduplication software ensures the single-byte pattern is correct and valid, then uses that stored byte pattern as a reference. Any further requests to store the same byte pattern will result in an additional pointer to the previously stored byte pattern.
 

What is deduplication?

Data deduplication allows users to reduce redundant data and more effectively manage backup activity, as well as ensuring more effective backups, cost savings, and load balancing benefits.

 

 

Data deduplication explained

There is more than one kind of data deduplication. In its most basic form, the process happens at the level of single files, eliminating identical files. This is also called single instance storage (SIS) or file-level deduplication.

At the next level, deduplication identifies and eliminates redundant segments of data that are the same, even when the files they’re in are not entirely identical. This is called block-level deduplication or sub-file deduplication, and it frees up storage space. When most people say deduplication, they are referring to block-level deduplication. If they are referring to file-level deduplication, they will use that modifier.

Most block-level deduplication occurs at fixed block boundaries, but there is also variable-length deduplication or variable block deduplication, where data is split up at non-fixed block boundaries. Once the dataset has been split into a series of small pieces of data, referred to as chunks or shards, the rest of the process usually remains the same.

The deduplication system runs each shard through a hashing algorithm, such as SHA-1, SHA-2, or SHA-256, which creates a cryptographic alpha-numeric (referred to as a hash) for the shard. The value of that hash is then checked against a hash table or hash database to see if it’s ever been seen before. If it has never been seen before, the new shard is written to storage and the hash is added to the hash table/database; if not, it is discarded and an additional reference added to the hash table/database.

 

 

What are the benefits of deduplication?

Imagine how many times you make a tiny change to a document. An incremental backup will back up the entire file, even though you may have changed only one byte. Every critical business asset has the potential to hold duplicate data. In many organizations, up to 80 percent of corporate data is duplicate.

A customer using target deduplication (also called target-side deduplication), where the deduplication process runs inside a storage system once the native data is stored there, can save a lot of money on storage, cooling, floor space, and maintenance. A customer using source deduplication (also called source-side deduplication, or client-side deduplication), where redundant is identified at the source before being sent across the network, can save money both on storage and network bandwidth. This is because the redundant segments of data are identified before being transmitted.

Source deduplication works very well with cloud storage and can improve backup speed notably. By reducing the amount of data and network bandwidth backup processes demand, deduplication streamlines the backup and recovery process. To decide when to use deduplication, consider if your business could benefit from these improvements.

 

 

What is a real-life deduplication example?

Imagine the manager of a business sends out 500 copies of the same 1 MB file, a financial outlook report with graphics, to the whole team. The company’s email server is now storing all 500 copies of that file. If all email inboxes then use a data backup system, all 500 copies are saved, eating up 500 MB of server space. Even a basic file-level data duplication system would save just one instance of the report. Every other instance just refers back to that single stored copy. This means the end bandwidth and storage burden on the server is only 1 MB from the unique data.

Another example is what happens when companies perform full-file incremental backups of files, where only a few bytes have changed, and occasionally perform full backups due to age-old design challenges in backup systems. A 10 TB file server would create 800 TB of backups just from eight weekly fulls, and probably another 8 TB or so of incremental backups over the same amount of time. A good deduplication system can reduce this 808 TB down to less than 100 TB – without lowering restore speed.

 

 

How does deduplication ratio to percentage work?

The deduplication ratio refers to the ratio of the amount of data that would be transmitted or stored without deduplication, vs the amount stored with deduplication. Deduplication can have a great impact on the backup size, reducing it by up to 25:1 in a standard enterprise backup setting. Obviously this depends on how much duplicative data exists and how efficient the file deduplication algorithm is.

However, a customer’s deduplication ratio can represent an inaccurate picture of the effectiveness of a dedupe system. If you backed up the same file 400 times, you would get a dedupe ratio of 400:1, but that speaks more to the inefficiency of your storage system, vs saying anything about how good your dedupe system is. When comparing different dedupe

 

 

What is post-process deduplication?

Post-process deduplication (PPD) characterizes a system in which deduplication software identifies and deletes redundant data only after it resides in a target deduplication data storage system. This technique may be necessary if it is not feasible or efficient to delete duplicate data during transfer or beforehand. This is also sometimes referred to as asynchronous deduplication, as the dedupe process is often performed as backups are being written, but each segment is only deduped after it is first written to storage.

 

 

How to implement deduplication

The best way to implement data deduplication technology will change depending on the user’s data protection goals, the data deduplication vendors used, and the sort of deduplication application in question. For example, a backup deduplication appliance or storage solution often includes deduplication technology and therefore has a much different implementation process than a freestanding deduplication software tool.

However, document deduplication technology is generally deployed either at the target or at the source. The differences here concern not just where, but when — before storage in the backup system or after the data is already there — the deduplication process takes place.

 

 

How does deduplication encryption work?

There is an intimate relationship between deduplication and encryption because a tool can only detect duplicate data and delete it if it can read that data. This means that any encryption must always happen after the dedupe process. If it were to happen before the dedupe process, no duplicate data would be found.

 

 

Druva data deduplication solutions

Druva defines its patented approach to global data deduplication using these four unique qualities:

  • Global deduplication. Druva compares all of a given customer’s backup data against all other data from that customer, even from other locations. This reduces duplicate data more than any other vendor.
  • Source-side deduplication (i.e., client-side deduplication). Druva’s deduplication process starts at the client, not the backup system, reducing how much data must be transmitted over the network. However, Druva does use its service running in the cloud to do most of the work, in order to reduce the load on the client.
  • Block-level, sub-file analysis.     This level of deduplication allows the tool to identify duplicate data within files.
  • Awareness of data-generating applications. Druva inSync searches inside application data for duplicate data.
  • Scalable performance. Druva’s deduplication locates and deletes duplicate data far beyond a single user, scaling across multiple devices and users.

Discover Druva's innovative products to optimize data storage, improve network performance, and accelerate backup on the deduplication page of the website.

 

 

Related terms

Now that you’ve learned about deduplication, brush up on these related terms with Druva’s glossary: