Tuning sequential I/O performance on LSI RAID controllers

RAID controllers based on the LSI 6998 platform - for example the InfiniteStorage IS 4500 from SGI or equivalent machines from IBM and Sun/Storagetek are workhorses in many data centers. Don’t worry if you do not know this particular model - its basically a standard FibreChannel-based RAID system with two redundant controllers, 4×4 Gb FC host connections and 4×4 Gb FC drive loop connections.

RAID group size

There is a couple of tricks if you want to achieve top sequential performance from this type of controller: first make sure your RAID groups are configured using the “magic” numbers 4+1 or 8+1. That means you should use either 5 or 9 disks per RAID5 group. While other numbers also do work these configurations are the fastest. Considering the size of modern drives and the admittedly aged RAID5-protection you might want to opt for the 4+1 config.

Mirror magic

The next factor is an interaction of the write cache in the controller, the stripe size on the RAID groups and the I/O size on your host. Lets begin with the write cache: the 6998 can have up to 16 GB of cache and in the default configuration data coming from the host is first written to the cache and then commited to disk. As read-modify-write cycles in RAID5 are expensive the controller has an opportunity to batch operations.

Unfortunately there is a catch: if cache mirroring is turned on the controller hosting your LUN will send all updates to its partners cache using the drive-side FC loops. This is not a problem for mixed workloads but if you need maximum sequential performance write traffic to disk and inter-cache traffic will interfere. 

There are two options how to avoid cache mirroring: the first is to completely turn off mirroring (which is bad for data integrity). The other one is to bypass the cache. How does it work? If the controller receives a write which has the size of a complete RAID5 stripe and is aligned to the beginning of the stripe it will write it directly to disk.

What does that mean for our config? When building RAID5 groups you can specify the “stripe size” in the configuration using value sup to 512 kB. For example 128 kB with a “4+1″ config means that four disks will receive 128 kB of data each (the fifth one has parity information) and a full RAID stripe needs 4×128 kB=512 kB of data. If the stripe size is 512 kB you’ll need 2048 kB.

Host I/O size

Lets assume you have a linux-based host attached to your 6998. To make sure your data writes align nicely with RAID stripes make sure that files “begin” on a stripe boundary. How to do that really depends on the filesystem: with XFS for example your can pass information about the RAID stripe size to XFS when running mkfs.xfs. And remember to leave use raw devices without a partition table or (alternately) align the beginning of the partition to the next stripe.

The I/O size can be tuned using /sys/block/<device>/queue/max_sectors_kb up to the value given in max_hw_sectors_kb. The default of 512 kB is fine if your setting on the 6998 is 128 kB (using 4+1). If you use the device mapper (dm) remember there is a bug in the linux kernel at least until 2.6.23 which causes devices using dm to ignore increased limits (more than 512 kB). 

An example configuration of two SGI Altix XE 240 with 2x 4 Gb FC each, one SGI IS 4500 system with 8 shelves and 20x(4+1) LUNs on 300 GB 10k FC drives delivers more than 1 GB/s to and from disk when measured using iozone on XFS. The usable capacity of the system is approximately 21 TB. When comparing to other setups: remember this configuration is redundant without a single point of failure.

 

Comments welcome (moderated).

Netapp’s on-board disk diagnostics

If you ever wondered what Netapp learned from studying causes of disk failures here is how current filers handle disk problems. Last week a disk failed on one of our storage systems and I pasted a couple of line from the log files to show what happened.

Continue reading →