Today I finally upgraded my home file server from Debian 6 ("Squeeze") to Debian 7 ("Wheezy"). The upgrade itself was fairly smooth, particularly because the file server now has relatively few key functions (file storage, rsnapshot backup server, DHCP/DNS), so certain shortcuts are possible in the upgrade. After the upgrade, I rebooted onto the new (Linux 3.2) kernel.

One of the key reasons for upgrading was to be able to install git annex, to be able to also keep (backup) copies of the files that I am tracking in git annex. That also went fairly smoothly, copying over gigabytes of data... until the file server locked up with disk errors.

After some investigation, including trying booting with just a single drive, I found that running smartctl seemed to trigger the first error in the kernel messages, and that things seemed to get worse after that error. I also noticed -- as I had with the Debian Squeeze version -- that the information reported by smartctl seemed somewhat invalid, which I had hoped would be fixed by upgrading to Debian Wheezy. Apparently not.

On a hunch I tried disabling smartmontools from starting up, which is now done with:

insserv -r smartmontools
insserv -r smartd

(since insserv is used to handle Debian Dependency Based Boot, which is newly enabled by default in Debian Wheezy; update-rc.d ... remove no longer seemes to work.)

Once I shut the system down, powered the drives off, waited a minute, powered the drives on again, and then booted up again, the system seemed much more stable than it had been when I was trying running various smartctl -a commands to try to figure out if there was something wrong with the drive all of a sudden...

Possible cause: SMART commands

My current theory is that the problem was being caused by smartd and/or smartctl sending SMART commands that the drive did not understand, similiar to what is reported in Ubuntu Bug #1092622 (additional notes). They report the problem with Western Digital Green drives:

Model Family: Western Digital Caviar Green (Adv. Format)
Device Model: WDC WD20EARX-32PASB0

and my drives are also Western Digital drives, with a similar model number:

Device Model: WDC WD20EFRX-68AX9N0

whch are Western Digital Red NAS Drives (related Wiki page), and definitely have some similarities with the Caviar Green drives; there are reports for other WDC drives too

The kernel error report looks something like this:

Dec 22 11:56:56 wheezy kernel: [   57.824032] ata3: lost interrupt (Status 0x51)
Dec 22 11:56:56 wheezy kernel: [   57.825148] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 22 11:56:56 wheezy kernel: [   57.826288] ata3.00: failed command: SMART
Dec 22 11:56:56 wheezy kernel: [   57.827436] ata3.00: cmd b0/d1:01:01:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 22 11:56:56 wheezy kernel: [   57.827437]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 22 11:56:56 wheezy kernel: [   57.829877] ata3.00: status: { DRDY }
Dec 22 11:56:56 wheezy kernel: [   57.831120] ata3: soft resetting link
Dec 22 11:56:57 wheezy kernel: [   58.052281] ata3.00: configured for UDMA/133
Dec 22 11:56:57 wheezy kernel: [   58.064280] ata3.01: configured for UDMA/133
Dec 22 11:56:57 wheezy kernel: [   58.065575] ata3: EH complete

which may happen alone, or may then be followed by more disk errors (possibly related to how quickly the drive can be recovered and what else happens in the meantime -- or what else was outstanding at the time).

In Ubuntu it seemed worse than Debian, since udisks2 seems to run automatically and uses libatasmart which was sending commands that the drive did not understand (AFAICT SMART commands that were previously legal, but were removed from more recent SMART versions). On my Debian 7 system, so far, it appears running smartctl manually was triggering it, which at least I could easily avoid -- and I could disable smartd in case it periodically issued similar problems.

Upstream, libatasmart bug #63883 was reported for the issue, but did not seem to get any response. However for Ubuntu it is reported as fixed in Ubuntu 14.10 if not earlier.

Further investigation turns up smartctl error reports, which relate to the old SMART commands being removed from newer devices. It appears be SmartMonTools Ticket #89 (on their new Trac server rather than the old SourceForge one). But since that indicates the problem was fixed in SmartMonTools 5.41 (included in Debian Squeeze) it is not clear if it is the same issue.

There are hangs reported with smartctl -a on Western Digital Drives, and of issues running smartctl commands while the drive is under load on Western Digital Drives (which they speculate is a firmware bug, in the handling of SMART commands -- possibly "once were legal, but now are not" SMART commands).

And there is this WDC WD30EFRX-68EUZN0 with similar SMART errors and lockups from a year ago, which tends to confirm the issue is related to Western Digital and SMART commands. In particular that one reports the

Warning! SMART Attribute Data Structure error: invalid SMART checksum.

output that I also saw when I was having this problem (someone else reported that similar errors went away when they successfully ran a selftest, which may be related to invalid checksums on empty selftest logs -- and may or may not be related to my issue with the "Attribute Data Structure"s).

I also found RedHat Bug #549981, which has a long discussion of SATA:

exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

lockups, dating from 2009/2010, along with the note that the "the exception often occurs when doing SMART commands", and relatively easy to trigger in that situation, which matches my symptoms.

The RedHat bug links to kernel.org bug #14831, which has its own long discussion of the issue -- which seems to suggest that it SMART commands make it much more likely to happen, but may not be the only cause. There is addditional discussion on AskUbuntu which confirms disabling smartd makes it happen much less frequently, but does not seem to be a complete fix -- although as seem above in the case of Ubuntu there are other causes of SMART requests, such as udisks-probe-ata-smart (see also Ubuntu bug #445852 and Ubuntu bug #574462.)

For the record apparently using smartctl -x ... instead of smartctl -a ... will try the new logs ("xerror") first, before the old logs ("error"), and thus sidestep at least some of these problems. In particular the old "error" log cannot handle 48-bit LBA drives, whereas the "xerror" log can handle them; `smartctl -x" will try to fetch the "xerror" log before the "error" log. Possibly the Western Digital drive firmware does not recover well from being asked for the older "error" information.

However a test running smartctl -x showed (a) that it cannot be run with "-d ata" (the old ATA driver path which I had been using for historical reasons, but does not support LBA-48), and (b) it will still trigger the errors resulting in a lockup :-(

Also FTR, it appears smartd.conf defaults to -a which pulls a bunch of SMART items including -l error off the drive periodically (if no other specific list is given), and that is potentially the trigger. In which case it may be possible to resume running smartd, but explicitly tell it to fetch specific things including "-l xerror", and avoid "-l error". But it would take some careful testing of individual things to figure out what can and cannot safely be fetched. Which I'm not going to try to do for a while.

So for now, smartd is staying disabled. The main thing I lose is reports of developing drive issues, and periodic scheduled self-tests. Which would be nice to have, but not if they're causing more problems than they are finding...

Of note, hdparm -J may be necessary on Western Digital Green drives to at least decrease the frequency of idle drive parking -- or using the idle3-tools utility. On my WD Red drives I've long had:

/dev/sda {
    spindown_time = 0
}

/dev/sdb {
    spindown_time = 0
}

to clear the (vendor generic) spindown time, which previously seemed to be causing issues with these drives (possibly just because they took long enough to spin up that the RAID set was throwing them out). But "hdparm -J" is for a Western Digital specific "idle3" timer, which defaults to 8 seconds; that particular value seems to be saved into the drive, and require a (drive) power cycle for the update to take effect.

One might think that Red drives should park themselves less often than Green drives, by default. But apparently a batch of more recent Red drives also have the idle3 timer set to 8 seconds. Fortunately my drives are the older -68AX9N0 variant, which seems to avoid this issue , but any replacement drives -- and any updated drive firmware -- may have this issue. (There are some notes that late 2014 drives are supplied with a 300 second idle3 timer by default, which seems more likely to be robust.)

(It appears Western Digital also have a Linux tool to configure this idle3 timeout on Red drives. However it appears there is no drive firmware update, since none is listed in the drive downloads section.)

Recovery from disk "failures"

Unfortunately by the point I figured all of this out, some of the Linux Software Raid (md) arrays had become out of sync, particularly because I had rebooted a couple of drives with only one (eSATA) drive powered on. In this situation it is sometimes possible to --re-add a drive that was previously part of the array, for a quick catchup sync. You can tell if this is possible with:

sudo mdadm --examine /dev/sd[ab]1 | egrep 'Event|^/dev/sd'

if the "Events" counts are fairly close then there is a chance that it will work to do:

sudo mdadm --manage /dev/md21 --re-add /dev/sdb1

for a quick "catch up" sync, but depending on the changes that have happened in between, it may report:

mdadm: --re-add for /dev/sdb1 to /dev/md21 is not possible

In which case it is necessary to do a full resync with:

sudo mdadm --manage /dev/md21 --add /dev/sdb1

This will be particularly true if the two devices have received different events -- ie, they were both run separately at different times. Which I think happened in my case due to different drives being ejected from the RAID array by the (apparently SMART induced) drive lockups.

The difference with --add is that you are manually picking a winner of the different sets of changes that happened; whereas I suspect that --re-add just rolls forward missed changes, assuming one got changes and the other did not.

You can request that mdadm trigger an array verification with:

sudo /usr/share/mdadm/checkarray /dev/md21

and then look in the kernel log messages for something like:

[ 1518.529148] md: data-check of RAID array md21
[ 1518.529200] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[ 1518.529257] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[ 1518.529345] md: using 128k window, over a total of 248896k.
[ 1521.991529] md: md21: data-check done.

to check that it went okay. For non trivial sized arrays you can also watch the progress in /proc/mdstat (for trivial sized arrays it will be done before you can even look there!).

Fortunately I usually divide larger disks up into several smaller RAID sets (and then reassemble them into a larger block with LVM), so that recovery runs can be done in smaller chunks. Several of these smaller RAID arrays did not have any changes, due to "(auto read-only)" -- ie, there were no changes for them while I was testing with a single drive. But there was still a non-trivial amount of resyncing to be done, and re-add did not work for any of them :-(