Entries Tagged 'RAID' ↓
June 7th, 2008 — Filesystems, Hardware, RAID
Lustre is the most popular network file system for linux clusters and high-bandwidth applications. I’d like to show you our basic server “building block” for lustre storage. Each “building block”…
- fits into a single rack
- has no single point of failure
- delivers at least 850-900 MB/s of write/read bandwidth to disk as seen from Lustre clients
- provides 20 Lustre Object Storage Targets (OSTs) with ~1 TB each
- has a total net usable capacity of 21+ TB
- and a power consumption of 3,3 kW.
The hardware

This is not really an el-cheapo storage system - we used Fibre Channel disks, redundant controllers and redundant servers which make this solution much more expensive than a non-redundant setup based on, say, a Sun X4500.
You will need the following components:
- a standard 42U rack (e.g. APC)
- two SGI Altix XE 240 servers with a 10 GigE network card and a dual-port 4 GBit FC-HBA
- one SGI InfiniteStorage 4500 RAID controller
- eight disk shelves
- 104 disks
All these components easily fit into the single 42U rack and need approximately 3,3 kW when running on dual PDUs. What is nice about the setup is that all FibreChannel cables stay inside the rack and you do not need expensive FC switches: both servers are connected directly to the RAID controllers. The only connections to the outer world are two 10 GigE network cables, some management interfaces and the power cables.
Disk subsystem setup
In our configuration we used 104 disks. Why 104? We needed 100 disks for data and 4 hot-spares. Every RAID set has 4+1 disks so there are 20 RAID sets available and each one is visible as a 1,1 TB LUNs.
The IS 4500 controller (which is an OEM system from LSI, you can get similar hardware from Sun/STK and IBM) has two controller blades which - during normal operations - are responsible for their “own” LUNs. Here it was important to distribute them evenly, so each server has 5 LUNs on one RAID controller blade and 5 on the other one. The stripe size on the RAID controller is set to 128 kB - with 4 disks and 512 kB writes the controller firmware will recognize full-stripe writes and bypass the cache (and thus the cache mirroring) which improves performance. By the way: 4+1 and 8+1 are “magic” configurations on these controllers which deliver the best performance.
Our RAID controller only supported RAID5 but with newer controllers you might want to use RAID6 - in this case you’ll need 20 additional disks which will also fit into the 8 shelves. Alternately one could also use 8+2 RAID groups but this could require additional testing because of the full-stripe write issue.

Each server has a dual-port FibreChannel card and is connected to both RAID controller blades. In case the FC cable or the controller fails the server can switch to the other controller.
On the client side there is a 10 Gigabit Ethernet network card - we used Myricom 10 GigE cards. Please make sure you use the PCIe 8x slot in the server and not the 4x one (be careful while ordering the server to get the right riser cards). Check the “*dma*” speed entries in “ethtool -S” - they should show more than 1200 MB/s.
More details on the lustre setup in Part 2 (tbd) - Comments welcome (moderated).
April 14th, 2008 — Hardware, RAID
RAID controllers based on the LSI 6998 platform - for example the InfiniteStorage IS 4500 from SGI or equivalent machines from IBM and Sun/Storagetek are workhorses in many data centers. Don’t worry if you do not know this particular model - its basically a standard FibreChannel-based RAID system with two redundant controllers, 4×4 Gb FC host connections and 4×4 Gb FC drive loop connections.
RAID group size
There is a couple of tricks if you want to achieve top sequential performance from this type of controller: first make sure your RAID groups are configured using the “magic” numbers 4+1 or 8+1. That means you should use either 5 or 9 disks per RAID5 group. While other numbers also do work these configurations are the fastest. Considering the size of modern drives and the admittedly aged RAID5-protection you might want to opt for the 4+1 config.
Mirror magic
The next factor is an interaction of the write cache in the controller, the stripe size on the RAID groups and the I/O size on your host. Lets begin with the write cache: the 6998 can have up to 16 GB of cache and in the default configuration data coming from the host is first written to the cache and then commited to disk. As read-modify-write cycles in RAID5 are expensive the controller has an opportunity to batch operations.
Unfortunately there is a catch: if cache mirroring is turned on the controller hosting your LUN will send all updates to its partners cache using the drive-side FC loops. This is not a problem for mixed workloads but if you need maximum sequential performance write traffic to disk and inter-cache traffic will interfere.
There are two options how to avoid cache mirroring: the first is to completely turn off mirroring (which is bad for data integrity). The other one is to bypass the cache. How does it work? If the controller receives a write which has the size of a complete RAID5 stripe and is aligned to the beginning of the stripe it will write it directly to disk.
What does that mean for our config? When building RAID5 groups you can specify the “stripe size” in the configuration using value sup to 512 kB. For example 128 kB with a “4+1″ config means that four disks will receive 128 kB of data each (the fifth one has parity information) and a full RAID stripe needs 4×128 kB=512 kB of data. If the stripe size is 512 kB you’ll need 2048 kB.
Host I/O size
Lets assume you have a linux-based host attached to your 6998. To make sure your data writes align nicely with RAID stripes make sure that files “begin” on a stripe boundary. How to do that really depends on the filesystem: with XFS for example your can pass information about the RAID stripe size to XFS when running mkfs.xfs. And remember to leave use raw devices without a partition table or (alternately) align the beginning of the partition to the next stripe.
The I/O size can be tuned using /sys/block/<device>/queue/max_sectors_kb up to the value given in max_hw_sectors_kb. The default of 512 kB is fine if your setting on the 6998 is 128 kB (using 4+1). If you use the device mapper (dm) remember there is a bug in the linux kernel at least until 2.6.23 which causes devices using dm to ignore increased limits (more than 512 kB).
An example configuration of two SGI Altix XE 240 with 2x 4 Gb FC each, one SGI IS 4500 system with 8 shelves and 20x(4+1) LUNs on 300 GB 10k FC drives delivers more than 1 GB/s to and from disk when measured using iozone on XFS. The usable capacity of the system is approximately 21 TB. When comparing to other setups: remember this configuration is redundant without a single point of failure.
Comments welcome (moderated).