In part 1 we talked about data deduplication technology in general and how live filesystem deduplication works on Netapp filers. Lets have a look how this really works.
The license to use ASIS is free should be free but ask your Netapp dealer. For the sake of this demonstration I made a copy of one of our VMware NFS volumes so we can watch how the whole process works.
nas6070> df -h vmware_data_asistest Filesystem total used avail capacity Mounted on /vol/vmware_data_asistest/ 1200GB 504GB 695GB 42% /vol/vmware_data_asistest/ /vol/vmware_data_asistest/.snapshot 300GB 50GB 249GB 17% /vol/vmware_data_asistest/.snapshot
The volume has a total capacity of 1500 GB with 1200 GB available for data and the standard 20% snapshot reserve. A VMware cluster uses the volume as a datastore and there are approximately 35 virtual machines (both Windows and Linux) inside with disk sizes from 5-25 GB using a total of 504 GB. This is our inital situation before we start the deduplication process.
Initial deduplication
To enable deduplication and deduplicate existing data two simple commands on the filers command line are used:
nas6070> sis on /vol/vmware_data_asistest SIS for "/vol/vmware_data_asistest" is enabled. Already existing data could be processed by running "sis start -s /vol/vmware_data_asistest". nas6070> sis start -s /vol/vmware_data_asistest The file system will be scanned to process existing data in /vol/vmware_data_asistest. This operation may initialize related existing metafiles. Are you sure you want to proceed with scan (y/n)? y Fri Apr 18 06:13:23 CEST [nas6070: wafl.scan.start:info]: Starting SIS volume scan on volume vmware_data_asistest. The SIS operation for "/vol/vmware_data_asistest" is started.
The initial scan of the volume will build the fingerprint database and then deduplicate data. This is a long background process which can be monitored:
nas6070> sis status Path State Status Progress /vol/vmware_data_asistest Enabled Active 47 GB Scanned
For this particular volume the first scan and the deduplication needed 4h 45 minutes. Remember - this is one-time process as all subsequent deduplication runs are incremental. Of course data can be accessed and changed during deduplication. Lets see the results using the df -s command which shows the savings:
nas6070> sis status Path State Status Progress /vol/vmware_data_asistest Enabled Idle Idle for 05:28:35 nas6070> df -sh vmware_data_asistest Filesystem used saved %saved /vol/vmware_data_asistest/ 219GB 378GB 63%
At first sight I thought: 63% savings? Nice, but not that spectacular. But remember - the WAFL file system will never overwrite existing data on your disk, so if you had snapshots before the deduplication started all the “old” data is frozen and still takes up space - normally only in the area reserved for snapshots but here lots of data was changed so the 20% were not enough. Lets look at the amount of space currently used by the snapshot and compare to the original state above.
nas6070> df -h vmware_data_asistest Filesystem total used avail capacity Mounted on /vol/vmware_data_asistest/ 1200GB 219GB 980GB 18% /vol/vmware_data_asistest/ /vol/vmware_data_asistest/.snapshot 300GB 388GB 0GB 130% /vol/vmware_data_asistest/.snapshot
You can now either wait until the old snapshots expire or delete them manually (only if you don’t need them). I did exactly that for the purpose of demonstration and here is how the volume looks like now:
nas6070> df -sh vmware_data_asistest Filesystem used saved %saved /vol/vmware_data_asistest/ 130GB 378GB 74%
Before we used 509 GB of storage and now it’s only 130 GB - 74% saved. This additional space is immediately available for more VMware data inside the datastore.
VMware servers are not even aware of the process: their view of the NFS datastore is exactly the same as before deduplication and reads deliver the same data. If new data is written changes go to disk normally and deduplicated in regular intervals during (no inline deduplication). There is a schedule for every deduplicated volume:
nas6070> sis status -l Path: /vol/vmware_data_asistest State: Enabled Status: Idle Progress: Idle for 07:46:29 Type: Regular Schedule: sun-sat@0 Last Operation Begin: Sat Apr 19 00:59:00 CEST 2008 Last Operation End: Sat Apr 19 00:59:52 CEST 2008 Last Operation Size: 23221 KB Last Operation Error: -
Here you can see the default schedule (Sat+Sun @ midnight). You can change it to suit your needs (useful in combination with backup snapshots).
Of course you can replicate your deduplicated volumes to another machine using SnapMirror and you will see the same savings on the target side. Caveat: if you backup to tape using NDMP you’ll get the original size.
In my opinion VMware and filesystem deduplication is a disruptive technology: with 60-90% savings the amount of data you can store on the filer changes dramatically - and there is no new software, no new appliances or additional complexity (see above - thats all!). It also saves energy. Beginning with Ontap 7.3 the fingerprint database is per-aggregate instead of per-volume so there will be additional savings if you have multiple datastore volumes (we do).
In part 3 we’ll discuss how ASIS can help to optimize thin provisioning. Stay tuned…
Do you use ASIS with VMware? What savings do you see? Comments welcome! (moderated).