Entries Tagged 'Filesystems' ↓

How to build a lustre server (Part 1)

Lustre is the most popular network file system for linux clusters and high-bandwidth applications. I’d like to show you our basic server “building block” for lustre storage. Each “building block”…

  • fits into a single rack
  • has no single point of failure
  • delivers at least 850-900 MB/s of write/read bandwidth to disk as seen from Lustre clients
  • provides 20 Lustre Object Storage Targets (OSTs) with ~1 TB each
  • has a total net usable capacity of 21+ TB 
  • and a power consumption of 3,3 kW.

The hardware 

This is not really an el-cheapo storage system - we used Fibre Channel disks, redundant controllers and redundant servers which make this solution much more expensive than a non-redundant setup based on, say, a Sun X4500.

You will need the following components:

  • a standard 42U rack (e.g. APC)
  • two SGI Altix XE 240 servers with a 10 GigE network card and a dual-port 4 GBit FC-HBA
  • one SGI InfiniteStorage 4500 RAID controller
  • eight disk shelves
  • 104 disks

All these components easily fit into the single 42U rack and need approximately 3,3 kW when running on dual PDUs. What is nice about the setup is that all FibreChannel cables stay inside the rack and you do not need expensive FC switches: both servers are connected directly to the RAID controllers. The only connections to the outer world are two 10 GigE network cables, some management interfaces and the power cables.

 

Disk subsystem setup

In our configuration we used 104 disks. Why 104? We needed 100 disks for data and 4 hot-spares. Every RAID set has 4+1 disks so there are 20 RAID sets available and each one is visible as a 1,1 TB LUNs.

The IS 4500 controller (which is an OEM system from LSI, you can get similar hardware from Sun/STK and IBM) has two controller blades which - during normal operations - are responsible for their “own” LUNs. Here it was important to distribute them evenly, so each server has 5 LUNs on one RAID controller blade and 5 on the other one. The stripe size on the RAID controller is set to 128 kB - with 4 disks and 512 kB writes the controller firmware will recognize full-stripe writes and bypass the cache (and thus the cache mirroring) which improves performance. By the way: 4+1 and 8+1 are “magic” configurations on these controllers which deliver the best performance.

Our RAID controller only supported RAID5 but with newer controllers you might want to use RAID6 - in this case you’ll need 20 additional disks which will also fit into the 8 shelves. Alternately one could also use 8+2 RAID groups but this could require additional testing because of the full-stripe write issue.

Each server has a dual-port FibreChannel card and is connected to both RAID controller blades. In case the FC cable or the controller fails the server can switch to the other controller.

On the client side there is a 10 Gigabit Ethernet network card - we used Myricom 10 GigE cards. Please make sure you use the PCIe 8x slot in the server and not the 4x one (be careful while ordering the server to get the right riser cards). Check the “*dma*” speed entries in “ethtool -S” - they should show more than 1200 MB/s.

More details on the lustre setup in Part 2 (tbd) - Comments welcome (moderated).

 

An introduction to Ontap GX

If you never heard of Ontap GX you might want to first read Netapp’s Technical Report 3468 and have a look at Mike Eislers excellent presentation and paper from FAST ‘07 conference. You can find them on Mike personal website. There are also additional answer to questions from the session on the blog.

In an nutshell Ontap GX combines multiple file servers into a single, clustered system with a single name space. This is very similar to mounting a volume in UNIX but in GX different volumes can reside on different servers. You can ask any server about any volume and it will either fetch your data itself or internally redirect the request to the server hosting the volume.

 

The concept itself is not really new - in fact its quite old: the Andrew File System AFS introduced this concept almost 15 years ago. AFS is still the only production-quality, secure filesystem with a real world-wide, global namespace. Unfortunately AFS relies on client software to be able to locate volumes and this software has to be installed on every client. In Ontap GX standard NFS and CIFS protocols are used an as all the magic happens inside the cluster clients are not aware of what is really happening behind the scenes.

What is the advantage of a single name space across multiple file servers? You can for example migrate volumes between physical machines without disrupting client access and this can improve the utilization of your storage. For example if you add new file servers and storage existing data can be moved to the new hardware transparently. You can also mix different filer models (add them in pairs) and disk types (FC/SATA). The goal is to be able to scale both capacity and performance without adding too much administration overhead or creating new islands of storage.

In the basic scenario a single volume still resides on a single filer. If you need larger volumes or better performance you can use a striped volume which distributes data across multiple filers. Normal and striped volumes can be mixed in the single namespace.

Of course this is not really news as other companies like Bluearc or Isilon offer similar capabilities: at Bluearc its called Cluster Name Space but here striping across multiple cluster nodes is not possible. In contrary Isilon clusters always stripe everything across multiple nodes. With Ontap GX you have both options and can use them depending on workload type and data size.

At the moment many features of Netapp’s “Ontap Classic” aka Ontap 7 are not available in GX (for example iSCSI, Quality-of-service (Flexshare) and many more) but both platforms will be integrated in the future. We have two different GX clusters on site (8 nodes and 2 nodes, 70 + 12 TB) and in future posts I’ll try to show a couple of practical hands-on examples how stuff works on GX and what is different compared to “normal” filers and AFS (yes, we also use AFS).

Of course I’d love to hear from other GX users out there - please comment :-)