Some weeks ago I experienced something a bit weird on my FreeNAS system. I received an email at 02:40 AM with this content:
The volume Disk1 (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
From the kernel log messages (also received via email) about half an hour later, I see that the first drive was disconnected:
ahcich0: Timeout on slot 26 port 0 ahcich0: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd c0 serr 00000000 cmd 0000fa17 (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <ST31000333AS SD35> s/n 6TE03QP9 detached GEOM_ELI: Device ada0p1.eli destroyed. GEOM_ELI: Detached ada0p1.eli on last close. (ada0:ahcich0:0:0:0): Periph destroyed
Not a problem, I thought, luckily this is a mirrored volume so I will just put in another drive and I won’t lose anything (hoping that nothing goes wrong during the resilvering process of course). First, though, I took a backup of my data and, while doing so, I receive a third email:
Device: /dev/ada0, 10 Currently unreadable (pending) sectors
Device: /dev/ada0, 10 Offline uncorrectable sectors
This really looked like the drive’s life was coming to an end. At the end of the backup, however, I decided to play around with the system a bit as I had never seen a drive fail before in FreeNAS. So I force a scrub and I run short and long smart test on the drive. Boy was I surprised.
The result? 0 (zero) unreadable or offline sectors on that drive. That looked weird. But that drive is still working to this day, about three months after I received that first email.
I’ll continue keeping an eye on this of course, but this goes to show that perhaps sometimes these error messages do not mean that the drive is going to fail immediately. Keep a new drive handy when this happens though, these things are not going to last forever anyway.
Update 18 January 2016
An update from late in the evening of January 18th, 2016, the day after writing this blog post :D
Device: /dev/ada0, 7 Offline uncorrectable sectors
Device: /dev/ada0, 7 Currently unreadable (pending) sectors
So now the uncorrectable and unreadable sectors are back! But they are still less than they were when this problem first appeared.
At this point this has become a personal challenge: I know this drive is going to fail sometime, but I have nothing on my FreeNAS system that I have not backed up somewhere else, so I am just going to leave things as they are and see what new developments come out of this magical drive.
Update 7 April 2017 January 2017
The situation has remained stable for a while, but this morning I received this email:
Device: /dev/ada0, 25 Currently unreadable (pending) sectors
Device: /dev/ada0, 25 Offline uncorrectable sectors
So yeah, now the situation got worse for real, even though it took my drive almost a year to get there. Time for a replacement drive!