Filesystem deduplication can save up to 90% of disk space (see my three-part series on deduplication: Part 1, Part 2 and Part 3) but exact numbers depend on the type and structure of your files.
A de-dupe estimator
In practice it is sometimes useful to verify potential savings using your own data. I didn’t find an easy way to to this without actually deduplicating real data so I decided to write a little program which estimates the savings. It’s a little Python script (dedupe-estimate.py) and you can find it here.
How does it work?
The script goes recursively through a given directory and reads all files in 4 kB chunks. Why 4 kB? Because that is exactly the filesystem block size on a Netapp. A “fingerprint” (I used an MD5 checksum) is calculated for every block. The script keeps a list of all fingerprints already known - so it is easy to find duplicated pieces of data. Finally a ratio of unique to total blocks is calculated which tells us how much real data there is.
Of course the script gives only an estimate of potential savings as in reality keeping the fingerprint database and other internal metadata also needs some space.
Reality check
To find out how accurate the results are I created a new volume on a Netapp and copied some real data into it:
nas6070> df -i deduptest Filesystem iused ifree %iused Mounted on /vol/deduptest/ 72768 1656638 4% /vol/deduptest/ nas6070> df -h deduptest Filesystem total used avail capacity Mounted on /vol/deduptest/ 40GB 17GB 22GB 43% /vol/deduptest/
bash-3.2# time ./dedupe-estimate.py /tmp/dedup/ Checking directory: /tmp/dedup/ [... lots of output here ...] Stats: # files: 66935, Used: 3545390 blocks, total 4496052 blocks, Ratio 78.8556271146% real 18m22.185s user 4m34.638s sys 2m54.753s
nas6070> sis on /vol/deduptest SIS for "/vol/deduptest" is enabled. nas6070> sis start -s /vol/deduptest The file system will be scanned to process existing data in /vol/deduptest. This operation may initialize related existing metafiles. Are you sure you want to proceed with scan (y/n)? y Fri May 2 10:51:53 CEST [nas6070: wafl.scan.start:info]: Starting SIS volume scan on volume deduptest. The SIS operation for "/vol/deduptest" is started. [... time passes ...] nas6070b> sis status -l Path: /vol/deduptest State: Enabled Status: Idle Progress: Idle for 00:03:19 Type: Regular Schedule: sun-sat@0 Last Operation Begin: Fri May 2 10:53:52 CEST 2008 Last Operation End: Fri May 2 11:09:32 CEST 2008 Last Operation Size: 17 GB Last Operation Error: - nas6070> df -h deduptest Filesystem total used avail capacity Mounted on /vol/deduptest/ 40GB 13GB 26GB 35% /vol/deduptest/ nas6070> df -s deduptest Filesystem used saved %saved /vol/deduptest/ 14571364 3721248 20%
Big fat obligatory warning
Please remember: this is an experimental tool - USE AT YOUR OWN RISK and only if you exactly understand what it does. It is based on my understanding of how things work and is not at all guaranteed to be accurate. It does not take into account sparse files. While intermediate results are printed after every file for “real” results you need to process ALL data (why?).
Deduplication and file formats
Feel free to post your own numbers or feedback - comments are highly welcome (moderated).
1 comment so far ↓
Hi.
Thanks for the script.
Unfortunatly, it won’t work inside Vmware ESX 3.5 due to complaints regarding os.walk call.
I’m too lazy to find out why and it works on a regular Linux CentOS5 so I’ll run it there.
Now, if Netapp will finaly release Ontap 7.3 so I can use ASIS on our Gateways..
You must log in to post a comment.