How to check your deduplication potential

Filesystem deduplication can save up to 90% of disk space (see my three-part series on deduplication: Part 1, Part 2 and Part 3) but exact numbers depend on the type and structure of your files.

A de-dupe estimator

In practice it is sometimes useful to verify potential savings using your own data. I didn’t find an easy way to to this without actually deduplicating real data so I decided to write a little program which estimates the savings. It’s a little Python script (dedupe-estimate.py) and you can find it here.

How does it work?

The script goes recursively through a given directory and reads all files in 4 kB chunks. Why 4 kB? Because that is exactly the filesystem block size on a Netapp. A “fingerprint” (I used an MD5 checksum) is calculated for every block. The script keeps a list of all fingerprints already known - so it is easy to find duplicated pieces of data. Finally a ratio of unique to total blocks is calculated which tells us how much real data there is.

Of course the script gives only an estimate of potential savings as in reality keeping the fingerprint database and other internal metadata also needs some space.

Reality check

To find out how accurate the results are I created a new volume on a Netapp and copied some real data into it:

nas6070> df -i deduptest
Filesystem               iused      ifree  %iused  Mounted on
/vol/deduptest/          72768    1656638      4%  /vol/deduptest/        

nas6070> df -h deduptest
Filesystem               total       used      avail capacity  Mounted on
/vol/deduptest/           40GB       17GB       22GB      43%  /vol/deduptest/
You can see there are approximately 72000 files and directories using 17 GB. I then started dedupe-estimate.py on a client which mounted the test volume:
bash-3.2# time ./dedupe-estimate.py /tmp/dedup/
Checking directory: /tmp/dedup/
[... lots of output here ...]
Stats: # files: 66935, Used: 3545390 blocks, total 4496052 blocks, Ratio 78.8556271146%      

real 18m22.185s
user 4m34.638s
sys 2m54.753s
We can see the script recognized almost 67000 files (the rest were directories) and estimated a ratio of 79% - that means 79% of total data blocks were unique or other way round 21% are redundant and could be de-duplicated.
nas6070> sis on /vol/deduptest
SIS for "/vol/deduptest" is enabled.     

nas6070> sis start -s /vol/deduptest
The file system will be scanned to process existing data in /vol/deduptest.
This operation may initialize related existing metafiles.
Are you sure you want to proceed with scan (y/n)? y

Fri May  2 10:51:53 CEST [nas6070: wafl.scan.start:info]: Starting SIS volume scan on volume deduptest.
The SIS operation for "/vol/deduptest" is started.

[... time passes ...]

nas6070b> sis status -l
Path:                    /vol/deduptest
State:                   Enabled
Status:                  Idle
Progress:                Idle for 00:03:19
Type:                    Regular
Schedule:                sun-sat@0
Last Operation Begin:    Fri May  2 10:53:52 CEST 2008
Last Operation End:      Fri May  2 11:09:32 CEST 2008
Last Operation Size:     17 GB
Last Operation Error:    -

nas6070> df -h deduptest
Filesystem               total       used      avail capacity  Mounted on
/vol/deduptest/           40GB       13GB       26GB      35%  /vol/deduptest/

nas6070> df -s deduptest
Filesystem                used      saved       %saved
/vol/deduptest/       14571364    3721248          20%
The “real” de-duplication saved us 20% so in the case our estimate of 21% was quite OK :-) 

Big fat obligatory warning 

Please remember: this is an experimental tool - USE AT YOUR OWN RISK and only if you exactly understand what it does. It is based on my understanding of how things work and is not at all guaranteed to be accurate. It does not take into account sparse files. While intermediate results are printed after every file for “real” results you need to process ALL data (why?).

Deduplication and file formats

If you experiment a bit with the script some important stuff comes up. Create a directory and put a file into it. Run the estimator. Make a copy of the file in the same directory. Run the estimator again. Now modify the copy and test again. Surprise? Yes, it depends on the file format - if you take a (larger) plain text file and add something at the end savings will remain big. If you delete a word in the beginning… try it yourself :-) 
Best results are thus achieved with files which do not lose their 4 kB alignment when something changes - like for example virtual disk files from VMware or other virtualization products. It might be interesting to research how popular file formats can be made more “deduplication-friendly”.

 

Feel free to post your own numbers or feedback - comments are highly welcome (moderated).

 

An introduction to Ontap GX

If you never heard of Ontap GX you might want to first read Netapp’s Technical Report 3468 and have a look at Mike Eislers excellent presentation and paper from FAST ‘07 conference. You can find them on Mike personal website. There are also additional answer to questions from the session on the blog.

In an nutshell Ontap GX combines multiple file servers into a single, clustered system with a single name space. This is very similar to mounting a volume in UNIX but in GX different volumes can reside on different servers. You can ask any server about any volume and it will either fetch your data itself or internally redirect the request to the server hosting the volume.

 

The concept itself is not really new - in fact its quite old: the Andrew File System AFS introduced this concept almost 15 years ago. AFS is still the only production-quality, secure filesystem with a real world-wide, global namespace. Unfortunately AFS relies on client software to be able to locate volumes and this software has to be installed on every client. In Ontap GX standard NFS and CIFS protocols are used an as all the magic happens inside the cluster clients are not aware of what is really happening behind the scenes.

What is the advantage of a single name space across multiple file servers? You can for example migrate volumes between physical machines without disrupting client access and this can improve the utilization of your storage. For example if you add new file servers and storage existing data can be moved to the new hardware transparently. You can also mix different filer models (add them in pairs) and disk types (FC/SATA). The goal is to be able to scale both capacity and performance without adding too much administration overhead or creating new islands of storage.

In the basic scenario a single volume still resides on a single filer. If you need larger volumes or better performance you can use a striped volume which distributes data across multiple filers. Normal and striped volumes can be mixed in the single namespace.

Of course this is not really news as other companies like Bluearc or Isilon offer similar capabilities: at Bluearc its called Cluster Name Space but here striping across multiple cluster nodes is not possible. In contrary Isilon clusters always stripe everything across multiple nodes. With Ontap GX you have both options and can use them depending on workload type and data size.

At the moment many features of Netapp’s “Ontap Classic” aka Ontap 7 are not available in GX (for example iSCSI, Quality-of-service (Flexshare) and many more) but both platforms will be integrated in the future. We have two different GX clusters on site (8 nodes and 2 nodes, 70 + 12 TB) and in future posts I’ll try to show a couple of practical hands-on examples how stuff works on GX and what is different compared to “normal” filers and AFS (yes, we also use AFS).

Of course I’d love to hear from other GX users out there - please comment :-)