Netapp’s on-board disk diagnostics

If you ever wondered what Netapp learned from studying causes of disk failures here is how current filers handle disk problems. Last week a disk failed on one of our storage systems and I pasted a couple of line from the log files to show what happened.

Sun Apr  6 13:16:30 CEST [nas: shm.threshold.mediumErrors7days:error]: shm: Disk 0d.20 has crossed the medium error threshold in a 7 day window.
Sun Apr  6 13:16:35 CEST [nas: raid.rg.diskcopy.start:notice]: /tsmiscsi/plex0/rg1: starting disk copy from 0d.20 to 0a.45
Sun Apr  6 13:17:17 CEST [nas: raid.disk.predictiveFailure:warning]: Disk /tsmiscsi/plex0/rg1/0d.20 Shelf 1 Bay 4 [NETAPP   X276_HPYTA288F10 NA03] S/N [VAX123] reported a predictive failure and it is prefailed; it will be copied to a spare and failed
Sun Apr  6 13:19:14 CEST [nas: shm.threshold.allMediaErrors:error]: shm: Disk 0d.20 has crossed the combination media error threshold in a 10 minute window.
Sun Apr  6 13:19:14 CEST [nas: raid.disk.maint.start:notice]: Disk /tsmiscsi/plex0/rg1/0d.20 Shelf 1 Bay 4 [NETAPP   X276_HPYTA288F10 NA03] S/N [VAX123] will be tested.
Sun Apr  6 13:19:14 CEST [nas: raid.rg.recons.missing:notice]: RAID group /tsmiscsi/plex0/rg1 is missing 1 disk(s).
Sun Apr  6 13:19:14 CEST [nas: raid.rg.recons.info:notice]: Spare disk 0d.45 will be used to reconstruct one missing disk in RAID group /tsmiscsi/plex0/rg1.
Sun Apr  6 13:19:14 CEST [nas: raid.rg.recons.start:notice]: /tsmiscsi/plex0/rg1: starting reconstruction, using disk 0d.45
Sun Apr  6 13:19:14 CEST [nas: raid.rg.diskcopy.aborted:notice]: /tsmiscsi/plex0/rg1: disk copy from 0d.20 to 0a.45 aborted at disk block 1224576 after 2:39.29

 

The filer is aware of the number of media errors on every single disk in the system and there are built-in thresholds for how many recoverable errors are allowed in a given period. The defective disk “0d.20″ apparently went beyond this threshold and the filer decided to take it out of service. As the disk still seemed to work the simplest way to do that is to copy the whole disk to a spare (”0a.45″ in this case). The performance impact of a disk copy is much lower than a complete RAID rebuild

Unfortunately the copy failed because the disk crossed another error threshold. The filer aborted the copy and started a normal RAID DP-reconstruct where redundancy data is read from other disks. This time spare disk “0d.45″ was the replacement. 

At the same time a diagnosis process began on the problematical disk “0d.20″. I don’t know what exactly  Netapp does here but I suspect some kind of write/re-read test for every sector of the disk. Of course this takes some time…

Sun Apr  6 14:40:48 CEST [nas: raid.disk.maint.failed:error]: Disk 0d.20 Shelf 1 Bay 4 [NETAPP   X276_HPYTA288F10 NA03] S/N [VAX123] tests failed.
Sun Apr  6 14:40:48 CEST [nas: raid.config.disk.failed:error]: Disk 0d.20 Shelf 1 Bay 4 [NETAPP   X276_HPYTA288F10 NA03] S/N [VAX123] failed.
Sun Apr  6 14:40:48 CEST [nas: raid.disk.unload.done:info]: Unload of Disk 0d.20 Shelf 1 Bay 4 [NETAPP   X276_HPYTA288F10 NA03] S/N [VAX123] has completed successfully
Sun Apr  6 14:41:00 CEST [nas: monitor.globalStatus.nonCritical:warning]: Disk on adapter 0d, shelf 1, bay 4, failed.  

Eighty minutes later the disk test process decided that the disk is really defective and marked it as “failed”. At the same time an autosupport message (not shown here) was sent to Netapp and a replacement disk was ordered.

Sun Apr  6 15:52:17 CEST [nas: raid.rg.recons.done:notice]: /tsmiscsi/plex0/rg1: reconstruction completed for 0d.45 in 2:33:02.18

Another 70 minutes later the RAID reconstruction completed: it took 2,5 hours for the 300 GB, 10k FC disk “0d.45″ to obtain all information of “0d.20″. During this period of time the RAID group was still parity protected as we use RAID DP (double parity): even if one disk fails there is still enough redundancy information available to tolerate another disk failure (rare) or defective sectors on other disks (more frequent).

It seems that there are more companies who consider on-board disk diagnostics. It definitely cheaper to check how bad the situation it at the customers site than to ship the disk back to factory and do it there.

Comments welcome (moderated).

 

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.

You must log in to post a comment.