How to check your deduplication potential

Filesystem deduplication can save up to 90% of disk space (see my three-part series on deduplication: Part 1, Part 2 and Part 3) but exact numbers depend on the type and structure of your files.

A de-dupe estimator

In practice it is sometimes useful to verify potential savings using your own data. I didn’t find an easy way to to this without actually deduplicating real data so I decided to write a little program which estimates the savings. It’s a little Python script (dedupe-estimate.py) and you can find it here.

How does it work?

The script goes recursively through a given directory and reads all files in 4 kB chunks. Why 4 kB? Because that is exactly the filesystem block size on a Netapp. A “fingerprint” (I used an MD5 checksum) is calculated for every block. The script keeps a list of all fingerprints already known - so it is easy to find duplicated pieces of data. Finally a ratio of unique to total blocks is calculated which tells us how much real data there is.

Of course the script gives only an estimate of potential savings as in reality keeping the fingerprint database and other internal metadata also needs some space.

Reality check

To find out how accurate the results are I created a new volume on a Netapp and copied some real data into it:

nas6070> df -i deduptest
Filesystem               iused      ifree  %iused  Mounted on
/vol/deduptest/          72768    1656638      4%  /vol/deduptest/        

nas6070> df -h deduptest
Filesystem               total       used      avail capacity  Mounted on
/vol/deduptest/           40GB       17GB       22GB      43%  /vol/deduptest/
You can see there are approximately 72000 files and directories using 17 GB. I then started dedupe-estimate.py on a client which mounted the test volume:
bash-3.2# time ./dedupe-estimate.py /tmp/dedup/
Checking directory: /tmp/dedup/
[... lots of output here ...]
Stats: # files: 66935, Used: 3545390 blocks, total 4496052 blocks, Ratio 78.8556271146%      

real 18m22.185s
user 4m34.638s
sys 2m54.753s
We can see the script recognized almost 67000 files (the rest were directories) and estimated a ratio of 79% - that means 79% of total data blocks were unique or other way round 21% are redundant and could be de-duplicated.
nas6070> sis on /vol/deduptest
SIS for "/vol/deduptest" is enabled.     

nas6070> sis start -s /vol/deduptest
The file system will be scanned to process existing data in /vol/deduptest.
This operation may initialize related existing metafiles.
Are you sure you want to proceed with scan (y/n)? y

Fri May  2 10:51:53 CEST [nas6070: wafl.scan.start:info]: Starting SIS volume scan on volume deduptest.
The SIS operation for "/vol/deduptest" is started.

[... time passes ...]

nas6070b> sis status -l
Path:                    /vol/deduptest
State:                   Enabled
Status:                  Idle
Progress:                Idle for 00:03:19
Type:                    Regular
Schedule:                sun-sat@0
Last Operation Begin:    Fri May  2 10:53:52 CEST 2008
Last Operation End:      Fri May  2 11:09:32 CEST 2008
Last Operation Size:     17 GB
Last Operation Error:    -

nas6070> df -h deduptest
Filesystem               total       used      avail capacity  Mounted on
/vol/deduptest/           40GB       13GB       26GB      35%  /vol/deduptest/

nas6070> df -s deduptest
Filesystem                used      saved       %saved
/vol/deduptest/       14571364    3721248          20%
The “real” de-duplication saved us 20% so in the case our estimate of 21% was quite OK :-) 

Big fat obligatory warning 

Please remember: this is an experimental tool - USE AT YOUR OWN RISK and only if you exactly understand what it does. It is based on my understanding of how things work and is not at all guaranteed to be accurate. It does not take into account sparse files. While intermediate results are printed after every file for “real” results you need to process ALL data (why?).

Deduplication and file formats

If you experiment a bit with the script some important stuff comes up. Create a directory and put a file into it. Run the estimator. Make a copy of the file in the same directory. Run the estimator again. Now modify the copy and test again. Surprise? Yes, it depends on the file format - if you take a (larger) plain text file and add something at the end savings will remain big. If you delete a word in the beginning… try it yourself :-) 
Best results are thus achieved with files which do not lose their 4 kB alignment when something changes - like for example virtual disk files from VMware or other virtualization products. It might be interesting to research how popular file formats can be made more “deduplication-friendly”.

 

Feel free to post your own numbers or feedback - comments are highly welcome (moderated).