Deduplication is a technology used to save precious space on your storage systems. In short, deduplication removes redundant data but at the same time leaves the logical view intact and is thus transparent to all applications.
Redundant data is everywhere - on many file servers each member of a project team has her own copy of a document which only marginally differs from other versions. If we compare two Windows servers they contain lots of identical data: the whole operating system, many applications… and - at the disk level - even the free space is often identical! A third area which originally sparked interest in deduplication is backup: if you backup hundreds of almost identical workstation PCs there is a lot of potential for deduplication because many files are identical.
Data deduplication vs. data compression
The idea of removing redundancy is also used in data compression: if you consider the string
ABCDDDDDDEEEEFAAAAAAABBBBBB
you can save space by writing it as
ABC,6xD,4xE,7xA,6xB
This is a very primitive example of run-length compression but you get the idea. Another possibility is to find frequent patterns in data and then assign shortcuts to them. Unfortunately data compression has three disadvantages for filesystem applications:
- “finding patterns” is extremely time-intensive for 100s of GB of data as you have to remember lots of possible combinations to be efficient
- data has to be de-compressed before it can be used (performance)
- it is difficult to change something “in the middle” as everything starts moving around
Because of these problems - which converge to bad performance - transparent compression is not widely used in filesystems - that means it’s available (e.g. NFTS) but hardly utilized for production data.
How deduplication works
Deduplication divides up data into chunks, finds all chunks which are identical and then stores only one copy of each chunk and all locations where it appeared. To identify chunks hash algorithms like MD5 or SHA can be applied which generate a digital “fingerprint” for a given piece of data. If the fingerprints of two chunks are equal we can compare the chunks and hopefully find a replicate. The trick is that comparing fingerprints (which are small) is much faster than comparing the data itself - it’s basically a “did I see that fingerprint before?”.
The size of a “chunk” is the magic key to efficiency: one of the few filesystems which use deduplication is Microsofts Single Instance Storage in Windows Server 2003 R2 and there a “chunk” is a whole file. That means if two files are identical Microsoft SIS can detect that but if they differ only in one single bit you’ll not see any savings.
Lets assume two file are identical - what can we do now? Microsoft’s SIS basically removes all but one duplicate files and replaces them with a hardlink to the “original”. In this way all duplicates are still visible but don’t take up space on disk. In fact some virtualization systems (e.g. Virtuozzo) can do the same thing for multiple copies of virtual environments (VEs) where multiple VEs share a single copy of the operating system. The “hardlink” analogy is however problematic if you consider write operations to a duplicate: we dont’t want to overwrite the original but to obtain a separate copy. Thats what differenciates Microsofts SIS from normal hardlinks.
Netapp’s ASIS
Lets go back to Netapp’s deduplication system called ASIS (Advanced Single Instance Storage). It is integrated into the WAFL file system used on all Netapp filers and used the 4 kB filesystem block as a chunk. If multiple 4 kB blocks are equal ASIS will find them using a fingerprint database. Of course this is much more efficient than Microsoft’s solution because small differences between files leave a large part of savings intact.
ASIS deduplicates in the background - new data written to the filesystem is not deduplicated immediately but only in scheduled intervals. Read performance is comparable to a normal volume without ASIS. I won’t dive into the gory technical details how this is done as there is already material available on the net:
- Netapp patents
- The technical report TR-3505
VMware datastores and deduplication
When you look into a typical VMware datastore with dozens of virtual disks you can already imagine the potential for deduplication: many virtual disks are based on the same templates and thus per definition largely identical. ASIS can help to recover the wasted space.
Part 2 of this series will contain a short hand-on-session using a VMware datastore and a Netapp. Stay tuned…
Comments welcome! (moderated)
0 comments ↓
There are no comments yet...Kick things off by filling out the form below.
You must log in to post a comment.