How to check your deduplication potential

Filesystem deduplication can save up to 90% of disk space (see my three-part series on deduplication: Part 1, Part 2 and Part 3) but exact numbers depend on the type and structure of your files.

A de-dupe estimator

In practice it is sometimes useful to verify potential savings using your own data. I didn’t find an easy way to to this without actually deduplicating real data so I decided to write a little program which estimates the savings. It’s a little Python script (dedupe-estimate.py) and you can find it here.

How does it work?

The script goes recursively through a given directory and reads all files in 4 kB chunks. Why 4 kB? Because that is exactly the filesystem block size on a Netapp. A “fingerprint” (I used an MD5 checksum) is calculated for every block. The script keeps a list of all fingerprints already known - so it is easy to find duplicated pieces of data. Finally a ratio of unique to total blocks is calculated which tells us how much real data there is.

Of course the script gives only an estimate of potential savings as in reality keeping the fingerprint database and other internal metadata also needs some space.

Reality check

To find out how accurate the results are I created a new volume on a Netapp and copied some real data into it:

nas6070> df -i deduptest
Filesystem               iused      ifree  %iused  Mounted on
/vol/deduptest/          72768    1656638      4%  /vol/deduptest/        

nas6070> df -h deduptest
Filesystem               total       used      avail capacity  Mounted on
/vol/deduptest/           40GB       17GB       22GB      43%  /vol/deduptest/
You can see there are approximately 72000 files and directories using 17 GB. I then started dedupe-estimate.py on a client which mounted the test volume:
bash-3.2# time ./dedupe-estimate.py /tmp/dedup/
Checking directory: /tmp/dedup/
[... lots of output here ...]
Stats: # files: 66935, Used: 3545390 blocks, total 4496052 blocks, Ratio 78.8556271146%      

real 18m22.185s
user 4m34.638s
sys 2m54.753s
We can see the script recognized almost 67000 files (the rest were directories) and estimated a ratio of 79% - that means 79% of total data blocks were unique or other way round 21% are redundant and could be de-duplicated.
nas6070> sis on /vol/deduptest
SIS for "/vol/deduptest" is enabled.     

nas6070> sis start -s /vol/deduptest
The file system will be scanned to process existing data in /vol/deduptest.
This operation may initialize related existing metafiles.
Are you sure you want to proceed with scan (y/n)? y

Fri May  2 10:51:53 CEST [nas6070: wafl.scan.start:info]: Starting SIS volume scan on volume deduptest.
The SIS operation for "/vol/deduptest" is started.

[... time passes ...]

nas6070b> sis status -l
Path:                    /vol/deduptest
State:                   Enabled
Status:                  Idle
Progress:                Idle for 00:03:19
Type:                    Regular
Schedule:                sun-sat@0
Last Operation Begin:    Fri May  2 10:53:52 CEST 2008
Last Operation End:      Fri May  2 11:09:32 CEST 2008
Last Operation Size:     17 GB
Last Operation Error:    -

nas6070> df -h deduptest
Filesystem               total       used      avail capacity  Mounted on
/vol/deduptest/           40GB       13GB       26GB      35%  /vol/deduptest/

nas6070> df -s deduptest
Filesystem                used      saved       %saved
/vol/deduptest/       14571364    3721248          20%
The “real” de-duplication saved us 20% so in the case our estimate of 21% was quite OK :-) 

Big fat obligatory warning 

Please remember: this is an experimental tool - USE AT YOUR OWN RISK and only if you exactly understand what it does. It is based on my understanding of how things work and is not at all guaranteed to be accurate. It does not take into account sparse files. While intermediate results are printed after every file for “real” results you need to process ALL data (why?).

Deduplication and file formats

If you experiment a bit with the script some important stuff comes up. Create a directory and put a file into it. Run the estimator. Make a copy of the file in the same directory. Run the estimator again. Now modify the copy and test again. Surprise? Yes, it depends on the file format - if you take a (larger) plain text file and add something at the end savings will remain big. If you delete a word in the beginning… try it yourself :-) 
Best results are thus achieved with files which do not lose their 4 kB alignment when something changes - like for example virtual disk files from VMware or other virtualization products. It might be interesting to research how popular file formats can be made more “deduplication-friendly”.

 

Feel free to post your own numbers or feedback - comments are highly welcome (moderated).

 

Netapp filesystem deduplication and VMware, Part 1

Deduplication is a technology used to save precious space on your storage systems. In short, deduplication removes redundant data but at the same time leaves the logical view intact and is thus transparent to all applications.

Redundant data is everywhere - on many file servers each member of a project team has her own copy of a document which only marginally differs from other versions. If we compare two Windows servers they contain lots of identical data: the whole operating system, many applications… and - at the disk level - even the free space is often identical! A third area which originally sparked interest in deduplication is backup: if you backup hundreds of almost identical workstation PCs there is a lot of potential for deduplication because many files are identical.

Data deduplication vs. data compression

The idea of removing redundancy is also used in data compression: if you consider the string

ABCDDDDDDEEEEFAAAAAAABBBBBB

you can save space by writing it as 

ABC,6xD,4xE,7xA,6xB

This is a very primitive example of run-length compression but you get the idea. Another possibility is to find frequent patterns in data and then assign shortcuts to them. Unfortunately data compression has three disadvantages for filesystem applications:

  • “finding patterns” is extremely time-intensive for 100s of GB of data as you have to remember lots of possible combinations to be efficient
  • data has to be de-compressed before it can be used (performance)
  • it is difficult to change something “in the middle” as everything starts moving around

Because of these problems - which converge to bad performance - transparent compression is not widely used in filesystems - that means it’s available (e.g. NFTS) but hardly utilized for production data. 

How deduplication works

Deduplication divides up data into chunks, finds all chunks which are identical and then stores only one copy of each chunk and all locations where it appeared. To identify chunks hash algorithms like MD5 or SHA can be applied which generate a digital “fingerprint” for a given piece of data. If the fingerprints of two chunks are equal we can compare the chunks and hopefully find a replicate. The trick is that comparing fingerprints (which are small) is much faster than comparing the data itself - it’s basically a “did I see that fingerprint before?”.

The size of a “chunk” is the magic key to efficiency: one of the few filesystems which use deduplication is Microsofts Single Instance Storage in Windows Server 2003 R2 and there a “chunk” is a whole file. That means if two files are identical Microsoft SIS can detect that but if they differ only in one single bit you’ll not see any savings.

Lets assume two file are identical - what can we do now? Microsoft’s SIS basically removes all but one duplicate files and replaces them with a hardlink to the “original”. In this way all duplicates are still visible but don’t take up space on disk. In fact some virtualization systems (e.g. Virtuozzo) can do the same thing for multiple copies of virtual environments (VEs) where multiple VEs share a single copy of the operating system. The “hardlink” analogy is however problematic if you consider write operations to a duplicate: we dont’t want to overwrite the original but to obtain a separate copy. Thats what differenciates Microsofts SIS from normal hardlinks.

Netapp’s ASIS

Lets go back to Netapp’s deduplication system called ASIS (Advanced Single Instance Storage). It is integrated into the WAFL file system used on all Netapp filers and used the 4 kB filesystem block as a chunk. If multiple 4 kB blocks are equal ASIS will find them using a fingerprint database. Of course this is much more efficient than Microsoft’s solution because small differences between files leave a large part of savings intact.

ASIS deduplicates in the background - new data written to the filesystem is not deduplicated immediately but only in scheduled intervals. Read performance is comparable to a normal volume without ASIS. I won’t dive into the gory technical details how this is done as there is already material available on the net:

VMware datastores and deduplication

When you look into a typical VMware datastore with dozens of virtual disks you can already imagine the potential for deduplication: many virtual disks are based on the same templates and thus per definition largely identical. ASIS can help to recover the wasted space.

Part 2 of this series will contain a short hand-on-session using a VMware datastore and a Netapp. Stay tuned…

Comments welcome! (moderated)