What Is Data Deduplication? Methods and Benefits

David Johnson

Technology

3

3

min read

Feb 14, 2024

Feb 14, 2024

The data deduplication process systematically eliminates redundant copies of data and files, which can help reduce storage costs and improve version control. In an era when every device generates data and entire organizations share files, data deduplication is a vital part of IT operations. It’s also a key part of the data protection and continuity process. When data deduplication is applied to backups, it identifies and eliminates duplicate files and blocks, storing only one instance of each unique piece of information. This not only can help save money but also can help improve backup and recovery times because less data needs to be sent over the network.


What Is Data Deduplication?

Data deduplication is the process of removing identical files or blocks from databases and data storage. This can occur on a file-by-file, block-by-block, or individual byte level or somewhere in between as dictated by an algorithm. Results are often measured by what’s called a “data deduplication ratio.” After deduplication, organizations should have more free space, though just how much varies because some activities and file types are more prone to duplication than others. While IT departments should regularly check for duplicates, the benefits of frequent deduplication also vary widely and depend on several variables.

Key Takeaways

  • Data deduplication is the process of scanning for and eliminating duplicate data.

  • Deduplication tools offer a range of precision levels, from file-by-file to file segment or block dedupe.

  • The more precise a deduplication process, the more compute power it requires.

  • For backups and archiving, deduplication can take place before or after data transfer. The former uses less bandwidth, while the latter consumes more bandwidth but fewer local resources.


Data Deduplication Explained

In the data deduplication process, a tool scans storage volumes for duplicate data and removes flagged instances. To find duplicates, the system compares unique identifiers, or hashes, attached to each piece of data. If a match is found, only one copy of the data is stored, and duplicates are replaced with references to the original copy.

The dedupe system searches in local storage, in management tools such as data catalogs, and in data stores and scans both structured and unstructured data.

To fully understand what’s involved, the following terms and definitions are key:

  • Data deduplication ratio: A metric used to measure the success of the deduplication process. This ratio compares the size of the original data store with its size following deduplication. While a high ratio indicates an effective process, variables such as frequency of deduplication, type of data, and other factors can skew the final ratio. Virtualization technology, for example, creates virtual machines that can be backed up and replicated easily, providing multiple copies of data. Keeping some copies is important for redundancy and to recover from data loss.

  • Data retention: The length of time that data is kept in storage, usually defined by policy. Financial reports must be kept longer than, say, emails. Typically, the longer the retention span, the greater the chance that data will be duplicated during backups, transfers, or through the use of virtual machines.

  • Data type: The format of data kept in storage. Typical data types are executables, documents, and media files. The file’s purpose, criticality, access frequency, and other factors define whether it’s duplicated and how long it’s retained.

  • Change rate: A metric measuring the frequency with which a file is updated or changed. Files with higher change rates are often duplicated less frequently.

  • Location: The place data is stored. Duplicate files often stem from the same exact files existing in multiple locations, either intentionally, as with a backup, or unintentionally via a cut-and-paste process that accidentally used a copy-and-paste operation. In some cases, virtual machines stored in multiple locations contain duplicate files.

Read more