In the past few years data deduplication has emerged as one of the most sought after technologies in data storage. Data deduplication is a technology wherein duplicate data within a storage system are identified and instead of storing multiple copies of the same data, the duplicates are eliminated and only the unique data is stored. Deduplication translates to large savings in storage capacity requirements and in the case of a backup storage system allows to retain data on the storage system for a longer period of time. While the goals of deduplication are clear, the approach adopted for deduplication has varied among vendors. In this document we describe the various deduplication technologies available with respect to disk based backup systems and their pros and cons. Finally we describe QUADStor's VTL deduplication technology and its benefits.
File, Sub‐file, Block level Deduplication
The first aspect that needs to be addressed is how exactly data determined as duplicate or unique. Newer data can be compared against older data at a file level, where in the contents of the file are compared against a previously stored file. However in such a case even if only a single byte in the newer file has changed when compared to an older one, the entire file is considered as changed.
Files can be compared at a sub file level where in a file data is broken down into chunks and each chunk compared against chunks of an older file. Such an approach gives a higher percentage chance of finding duplicate data.
The other approach is the block level approach where in blocks of data are compared against previously stored blocks of data. The blocks compared can be in fixed sizes such as 32kb, 64kb etc or in some vendor implementations, variable sized blocks. This approach is independent of the type of data that is stored on the system.
Block fingerprinting vs Byte-level comparison
The second aspect is that needs to be addresses is how exactly the comparison done. The comparison between newer and older data can be at a byte level where in each byte between the newer and older data are compared. The comparison can also be between data fingerprints calculated for the newer and older data and if the fingerprints match the newer and older data are considered identical. Fingerprints are usually calculated by hash algorithms such as MD5, SHA‐1, SHA-256 etc. Hash algorithms are usually preferred over byte level comparison due to the lesser processing time.
Source vs. Target based deduplication
Source based deduplication is wherein data is duplicated at the source itself, the advantage being that duplicate data never travels over the network. Currently provided by leading backup application software vendors, the approach is preferred for data backups from remote offices to a central backup location. Target based deduplication is when the performed by the storage subsystem itself. For example a NAS system, a VTL system or a D2D backup system. Target deduplication is preferred for high volume backups usually performed locally in a data center.
Inline vs. Post‐process Deduplication
In the inline deduplication approach as and when data is received, the system tries to determine if the data received is unique or a duplicate exists. Duplicate data in the write stream is discarded and only the unique data is written to disk. The advantage of such an approach is that duplicate data is never stored and hence space savings are immediately reflected. The downside is that additional CPU processing is required for fingerprint and data comparison, but most modern systems employ a large number of CPU or CPU cores.
Post‐process (Out-of-Band) Deduplication
In the post‐process deduplication approach, backup streams are written to disk as before. Only after a backup has completed that the deduplication is performed. The advantage of such an approach is that the backup performance is never affected. However the deduplication operation is now slower as the newer data has to be read from disk again. Recent implementations optimize this by performing the hash computation during backup itself, thereby avoiding the need to read back data from disk. The hashes computed are stored on disk to be read later during the actual deduplication process. Since the deduplication is done at a later stage, the deduplication may impact subsequent backups if not properly scheduled.
With most deduplication technologies, the restores from deduplicated data can be slower when compared with restores without/before deduplication. As a result vendors resort to workarounds such as the last backup data that is received is stored without any deduplication optimizations, since the last backup is the one most likely needed for a restore. The factors that may affect restore performance are:
- Data may have to be reconstructed during a restore.
- Additional disk lookups may be required to retrieve the restore data
- Data may be spread across the disk due to a mix of unique and duplicate data
QUADStor virtual tape library with inline data deduplication
QUADStor VTL implements inline data duplication for a simplistic approach and maximum performance. Benefits are
- With inline deduplication, only unique data is stored. Byte-per-byte comparison of duplicate and stored data does not require additional disk operations
- SHA-256 hash function is utilized for fingerprint computation and byte-per-byte comparison possible duplicate data and the stored data can also be enabled.
- Unlike delta based deduplication systems, impact on restore performance is reduced since the metadata required for restores always directly points to the data on disk.
- Inline data deduplication with support for application software from leading vendors.
- Employs an intelligent read cache to minimize the effect of deduplication on restore performance