Today I finally upgraded my home file server from Debian 6 ("Squeeze") to Debian 7 ("Wheezy"). The upgrade itself was fairly smooth, particularly because the file server now has relatively few key functions (file storage, rsnapshot backup server, DHCP/DNS), so certain shortcuts are possible in the upgrade. After the upgrade, I rebooted onto the new (Linux 3.2) kernel.
One of the key reasons for upgrading was to be able to install
git annex
,
to be able to also keep (backup) copies of the files that I am
tracking in git annex. That also went fairly smoothly, copying over
gigabytes of data... until the file server locked up with disk errors.
After some investigation, including trying booting with just a
single drive, I found that running smartctl
seemed to trigger the
first error in the kernel messages, and that things seemed to get
worse after that error. I also noticed -- as I had with the Debian
Squeeze version -- that the information reported by smartctl
seemed somewhat invalid, which I had hoped would be fixed by upgrading
to Debian Wheezy. Apparently not.
On a hunch I tried disabling smartmontools
from starting up, which is
now done with:
insserv -r smartmontools
insserv -r smartd
(since insserv
is used to handle Debian Dependency Based
Boot, which
is newly enabled by default in Debian
Wheezy;
update-rc.d ... remove
no longer seemes to work.)
Once I shut the system down, powered the drives off, waited a minute,
powered the drives on again, and then booted up again, the system
seemed much more stable than it had been when I was trying running various
smartctl -a
commands to try to figure out if there was something
wrong with the drive all of a sudden...
Possible cause: SMART commands
My current theory is that the problem was being caused by smartd
and/or smartctl
sending SMART commands that the drive did not
understand, similiar to what is reported in Ubuntu Bug #1092622 (additional
notes).
They report the problem with Western Digital Green drives:
Model Family: Western Digital Caviar Green (Adv. Format)
Device Model: WDC WD20EARX-32PASB0
and my drives are also Western Digital drives, with a similar model number:
Device Model: WDC WD20EFRX-68AX9N0
whch are Western Digital Red NAS Drives (related Wiki page), and definitely have some similarities with the Caviar Green drives; there are reports for other WDC drives too
The kernel error report looks something like this:
Dec 22 11:56:56 wheezy kernel: [ 57.824032] ata3: lost interrupt (Status 0x51)
Dec 22 11:56:56 wheezy kernel: [ 57.825148] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 22 11:56:56 wheezy kernel: [ 57.826288] ata3.00: failed command: SMART
Dec 22 11:56:56 wheezy kernel: [ 57.827436] ata3.00: cmd b0/d1:01:01:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 22 11:56:56 wheezy kernel: [ 57.827437] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 22 11:56:56 wheezy kernel: [ 57.829877] ata3.00: status: { DRDY }
Dec 22 11:56:56 wheezy kernel: [ 57.831120] ata3: soft resetting link
Dec 22 11:56:57 wheezy kernel: [ 58.052281] ata3.00: configured for UDMA/133
Dec 22 11:56:57 wheezy kernel: [ 58.064280] ata3.01: configured for UDMA/133
Dec 22 11:56:57 wheezy kernel: [ 58.065575] ata3: EH complete
which may happen alone, or may then be followed by more disk errors (possibly related to how quickly the drive can be recovered and what else happens in the meantime -- or what else was outstanding at the time).
In Ubuntu it seemed worse than Debian, since udisks2
seems to run
automatically and uses libatasmart
which was sending commands
that the drive did not understand (AFAICT SMART commands that were
previously legal, but were removed from more recent SMART versions).
On my Debian 7 system, so far, it appears running smartctl
manually
was triggering it, which at least I could easily avoid -- and I
could disable smartd
in case it periodically issued similar
problems.
Upstream, libatasmart bug #63883 was reported for the issue, but did not seem to get any response. However for Ubuntu it is reported as fixed in Ubuntu 14.10 if not earlier.
Further investigation turns up smartctl
error
reports,
which relate to the old SMART commands being removed from newer
devices. It appears be SmartMonTools Ticket #89 (on their new Trac
server rather than the old SourceForge one). But since that indicates
the problem was fixed in SmartMonTools 5.41 (included in Debian
Squeeze) it is not clear if it is the same issue.
There are hangs reported with smartctl -a
on Western
Digital Drives, and of issues
running smartctl
commands while the drive is under
load on
Western Digital Drives (which they speculate is a firmware bug, in the
handling of SMART commands -- possibly "once were legal, but now are not"
SMART commands).
And there is this WDC WD30EFRX-68EUZN0 with similar SMART errors and lockups from a year ago, which tends to confirm the issue is related to Western Digital and SMART commands. In particular that one reports the
Warning! SMART Attribute Data Structure error: invalid SMART checksum.
output that I also saw when I was having this problem (someone else reported that similar errors went away when they successfully ran a selftest, which may be related to invalid checksums on empty selftest logs -- and may or may not be related to my issue with the "Attribute Data Structure"s).
I also found RedHat Bug #549981, which has a long discussion of SATA:
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
lockups, dating from 2009/2010, along with the note that the "the exception often occurs when doing SMART commands", and relatively easy to trigger in that situation, which matches my symptoms.
The RedHat bug links to
kernel.org bug #14831,
which has its own long discussion of the issue -- which seems to suggest
that it SMART commands make it much more likely to happen, but may
not be the only cause. There is addditional discussion on AskUbuntu
which confirms disabling smartd
makes it happen much less
frequently, but does not seem to be a complete fix -- although
as seem above in the case of Ubuntu there are other causes of SMART
requests, such as udisks-probe-ata-smart
(see also
Ubuntu bug #445852
and Ubuntu bug #574462.)
For the record apparently using smartctl -x ...
instead of smartctl
-a ...
will try the new logs ("xerror") first, before the old logs
("error"), and thus sidestep at least some of these problems. In
particular the old "error" log cannot handle 48-bit LBA drives,
whereas the "xerror" log can handle them; `smartctl -x" will try
to fetch the "xerror" log before the "error" log. Possibly the
Western Digital drive firmware does not recover well from being
asked for the older "error" information.
However a test running smartctl -x
showed (a) that it cannot be
run with "-d ata
" (the old ATA driver path which I had been using
for historical reasons, but does not support LBA-48), and (b) it will
still trigger the errors resulting in a lockup :-(
Also FTR, it appears smartd.conf
defaults to -a
which pulls a
bunch of SMART items including -l error
off the drive periodically
(if no other specific list is given), and that is potentially the
trigger. In which case it may be possible to resume running smartd,
but explicitly tell it to fetch specific things including "-l
xerror
", and avoid "-l error
". But it would take some careful
testing of individual things to figure out what can and cannot
safely be fetched. Which I'm not going to try to do for a while.
So for now, smartd
is staying disabled. The main thing I lose is
reports of developing drive issues, and periodic scheduled self-tests.
Which would be nice to have, but not if they're causing more problems
than they are finding...
Of note, hdparm -J may be
necessary on
Western Digital Green drives to at least decrease the frequency of
idle drive parking -- or using the idle3-tools
utility. On my
WD Red drives I've long had:
/dev/sda {
spindown_time = 0
}
/dev/sdb {
spindown_time = 0
}
to clear the (vendor generic) spindown time, which previously seemed
to be causing issues with these drives (possibly just because they
took long enough to spin up that the RAID set was throwing them
out). But "hdparm -J
" is for a Western Digital specific "idle3" timer,
which defaults to 8 seconds; that particular value seems
to be saved into the drive, and require a (drive) power cycle for the
update to take effect.
One might think that Red drives should park themselves less often
than Green drives,
by default. But apparently a batch of more recent Red drives also have the
idle3 timer set to 8
seconds.
Fortunately my drives are the older -68AX9N0
variant, which seems
to avoid this issue , but any replacement drives -- and any updated drive
firmware -- may have this issue. (There are some notes that late 2014
drives are supplied with a 300 second idle3 timer by default, which seems
more likely to be robust.)
(It appears Western Digital also have a Linux tool to configure this idle3 timeout on Red drives. However it appears there is no drive firmware update, since none is listed in the drive downloads section.)
Recovery from disk "failures"
Unfortunately by the point I figured all of this out, some of the
Linux Software Raid (md
) arrays had become out of sync, particularly
because I had rebooted a couple of drives with only one (eSATA)
drive powered on. In this situation it is sometimes possible to
--re-add
a drive that was previously part of the array, for a
quick catchup sync. You can tell if this is
possible
with:
sudo mdadm --examine /dev/sd[ab]1 | egrep 'Event|^/dev/sd'
if the "Events" counts are fairly close then there is a chance that it will work to do:
sudo mdadm --manage /dev/md21 --re-add /dev/sdb1
for a quick "catch up" sync, but depending on the changes that have happened in between, it may report:
mdadm: --re-add for /dev/sdb1 to /dev/md21 is not possible
In which case it is necessary to do a full resync with:
sudo mdadm --manage /dev/md21 --add /dev/sdb1
This will be particularly true if the two devices have received different events -- ie, they were both run separately at different times. Which I think happened in my case due to different drives being ejected from the RAID array by the (apparently SMART induced) drive lockups.
The difference with --add
is that you are manually picking a winner of
the different sets of changes that happened; whereas I suspect that
--re-add
just rolls forward missed changes, assuming one got changes
and the other did not.
You can request that mdadm
trigger an array verification with:
sudo /usr/share/mdadm/checkarray /dev/md21
and then look in the kernel log messages for something like:
[ 1518.529148] md: data-check of RAID array md21
[ 1518.529200] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[ 1518.529257] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[ 1518.529345] md: using 128k window, over a total of 248896k.
[ 1521.991529] md: md21: data-check done.
to check that it went okay. For non trivial sized arrays you can also
watch the progress in /proc/mdstat
(for trivial sized arrays it will
be done before you can even look there!).
Fortunately I usually divide larger disks up into several smaller RAID sets (and then reassemble them into a larger block with LVM), so that recovery runs can be done in smaller chunks. Several of these smaller RAID arrays did not have any changes, due to "(auto read-only)" -- ie, there were no changes for them while I was testing with a single drive. But there was still a non-trivial amount of resyncing to be done, and re-add did not work for any of them :-(