Entries from April 2008 ↓

VMware on NFS datastores: sequential performance

We recently tested sequential performance of a virtual disk in an ESX virtual machine when the datastore is placed on NFS. The ESX servers have a separate VMKernel Gigabit Ethernet interface which is used exclusively for NFS. On the other side a Netapp FAS 6070 attached using 10 GigE exports a volume placed on a 2×16 disk (300 GB 10k FC) RAID DP aggregate.

The server is physically a Sun 4150 with additional GigE interfaces. The VM has been configured with 4 virtual CPUs, a 25 GB virtual disk (placed on the NFS datastore) and an OpenSuse 10.3 template.

On the virtual disk I created an XFS file system and then tested with iozone (iozone -t1 -i0 -i1 -r1m -s10g). Results are pretty good:

write:   106970 kB/s
rewrite: 108379 kB/s
read:    103768 kB/s
reread:  109418 kB/s

In this setup the single GigE network connection of the ESX host is the limit but I’m quite satisfied with these numbers. There are two knobs I used to tune performance in this benchmark:

options nfs.tcp.recvwindowsize 64240

on the filer increases writes from 75 to 100 MB/s (Netapp’s defaults seems pretty low). On the Linux VM side you can turn up readahead using /sys/block/<your virtual disk device>/queue/read_ahead_kb (I used 2048 kB instead of the default 128 kB). This will help alot with the read numbers.

I’m really looking forward to Neterions X3100 10 GigE cards - we’ll put a couple of them into our VMWare servers and then the current bottleneck (GigE interface) will go away. The simplicity of VMware with NFS datastores is really amazing (particularly for larger numbers of VMware servers) and I didn’t even tell you about Netapps ASIS deduplication yet :-)

Edit: there is also a post on NFS performance inside a virtual machine

What numbers do you see in your environment? Comments welcome.

Tuning sequential I/O performance on LSI RAID controllers

RAID controllers based on the LSI 6998 platform - for example the InfiniteStorage IS 4500 from SGI or equivalent machines from IBM and Sun/Storagetek are workhorses in many data centers. Don’t worry if you do not know this particular model - its basically a standard FibreChannel-based RAID system with two redundant controllers, 4×4 Gb FC host connections and 4×4 Gb FC drive loop connections.

RAID group size

There is a couple of tricks if you want to achieve top sequential performance from this type of controller: first make sure your RAID groups are configured using the “magic” numbers 4+1 or 8+1. That means you should use either 5 or 9 disks per RAID5 group. While other numbers also do work these configurations are the fastest. Considering the size of modern drives and the admittedly aged RAID5-protection you might want to opt for the 4+1 config.

Mirror magic

The next factor is an interaction of the write cache in the controller, the stripe size on the RAID groups and the I/O size on your host. Lets begin with the write cache: the 6998 can have up to 16 GB of cache and in the default configuration data coming from the host is first written to the cache and then commited to disk. As read-modify-write cycles in RAID5 are expensive the controller has an opportunity to batch operations.

Unfortunately there is a catch: if cache mirroring is turned on the controller hosting your LUN will send all updates to its partners cache using the drive-side FC loops. This is not a problem for mixed workloads but if you need maximum sequential performance write traffic to disk and inter-cache traffic will interfere. 

There are two options how to avoid cache mirroring: the first is to completely turn off mirroring (which is bad for data integrity). The other one is to bypass the cache. How does it work? If the controller receives a write which has the size of a complete RAID5 stripe and is aligned to the beginning of the stripe it will write it directly to disk.

What does that mean for our config? When building RAID5 groups you can specify the “stripe size” in the configuration using value sup to 512 kB. For example 128 kB with a “4+1″ config means that four disks will receive 128 kB of data each (the fifth one has parity information) and a full RAID stripe needs 4×128 kB=512 kB of data. If the stripe size is 512 kB you’ll need 2048 kB.

Host I/O size

Lets assume you have a linux-based host attached to your 6998. To make sure your data writes align nicely with RAID stripes make sure that files “begin” on a stripe boundary. How to do that really depends on the filesystem: with XFS for example your can pass information about the RAID stripe size to XFS when running mkfs.xfs. And remember to leave use raw devices without a partition table or (alternately) align the beginning of the partition to the next stripe.

The I/O size can be tuned using /sys/block/<device>/queue/max_sectors_kb up to the value given in max_hw_sectors_kb. The default of 512 kB is fine if your setting on the 6998 is 128 kB (using 4+1). If you use the device mapper (dm) remember there is a bug in the linux kernel at least until 2.6.23 which causes devices using dm to ignore increased limits (more than 512 kB). 

An example configuration of two SGI Altix XE 240 with 2x 4 Gb FC each, one SGI IS 4500 system with 8 shelves and 20x(4+1) LUNs on 300 GB 10k FC drives delivers more than 1 GB/s to and from disk when measured using iozone on XFS. The usable capacity of the system is approximately 21 TB. When comparing to other setups: remember this configuration is redundant without a single point of failure.

 

Comments welcome (moderated).