How To Deal With Checksum Errors In ZFS – FreeNAS

Some days ago I received an email from one of my FreeNAS boxes letting me know that a pool scrub was starting:

starting scrub of pool 'Disk1'

About 3 hours after this, I received another email with an excerpt from the kernel logs:

freenas.local kernel log messages:
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 80 df 4d 40 46 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada1:ahcich1:0:0:0): RES: 51 40 98 df 4d 40 46 00 00 c0 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 80 df 4d 40 46 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada1:ahcich1:0:0:0): RES: 51 40 98 df 4d 40 46 00 00 c0 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 80 df 4d 40 46 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada1:ahcich1:0:0:0): RES: 51 40 98 df 4d 40 46 00 00 c0 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 80 df 4d 40 46 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada1:ahcich1:0:0:0): RES: 51 40 98 df 4d 40 46 00 00 c0 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 80 df 4d 40 46 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada1:ahcich1:0:0:0): RES: 51 40 98 df 4d 40 46 00 00 c0 00
(ada1:ahcich1:0:0:0): Error 5, Retries exhausted

-- End of security output --

Immediately followed by a third email, with a status summary of my pool:

Checking status of zfs pools:
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
Disk1   928G   494G   434G    53%  1.00x  ONLINE  /mnt

 pool: Disk1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
 scan: scrub repaired 96K in 1h38m with 0 errors on Sun Aug  9 01:38:43 2015
config:

	NAME                                            STATE     READ WRITE CKSUM
	Disk1                                           ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    gptid/4f54a386-8f88-11e4-a049-009c02975356  ONLINE       0     0     0
	    gptid/62701945-eb48-11e4-bdc4-009c02975356  ONLINE       0     0     2

errors: No known data errors

-- End of daily output --

So the summary of the situation is that one of the two drives reported 2 checksum errors, and these are likely connected to the read errors reported by the kernel logs.

Often, this is because the drive is failing, but it could also be a SATA cable issue, or perhaps it could be something temporary. In my case, I was told that there had been a blackout recently, so when you see an output like this it doesn’t necessarily mean you need to go out and replace the disk straight away.

From the FreeNAS error message page:

ind the device with a non-zero error count for READ, WRITE, or CKSUM. This indicates that the device has experienced a read I/O error, write I/O error, or checksum validation error. Because the device is part of a mirror or RAID-Z device, ZFS was able to recover from the error and subsequently repair the damaged data.

If these errors persist over a period of time, ZFS may determine the device is faulty and mark it as such. However, these error counts may or may not indicate that the device is unusable. It depends on how the errors were caused, which the administrator can determine in advance of any ZFS diagnosis. For example, the following cases will all produce errors that do not indicate potential device failure:

  • A network attached device lost connectivity but has now recovered
  • A device suffered from a bit flip, an expected event over long periods of time
  • An administrator accidentally wrote over a portion of the disk using another program

In these cases, the presence of errors does not indicate that the device is likely to fail in the future, and therefore does not need to be replaced.

However, there are some things you can do to find out whether this is indeed a hard drive issue or not:

  1. Clear the error count:
    zpool clear pool_name

    This will reset all the read, write and checksum error counters.

  2. You can either wait for a while, wait for the next scrub to take place or, if you are impatient, you can force one immediately:
    zpool scrub pool_name
  3. Check the output of
    zpool status pool_name

    If the error counters are still 0, the issue you experienced was likely a temporary one. If no issue appears over the next few days, you hard drive should be fine for the foreseeable future.

  4. If you still see errors (or even if you really want to make sure everything is ok), you can check the SMART status of the drive. To run a short SMART check, type
    smartctl -t short /dev/ada0

    To run a long SMART check, type

    smartctl -t long /dev/ada0

    instead. Of course, replace

    /dev/ada0

    with your drive identifier. Note: a short test will take a minute, a long test will take around 4 hours.

  5. Check the output of
    smartctl –a /dev/ada0

    for errors.

  6. At this point, if you don’t see any error messages, your drives might be ok. You can force a scrub again with
    zpool scrub pool_name

    and you should be good.

This is all to say that, even though this type of message usually points to a failing drive, it doesn’t necessarily have to be a hardware issue. Go through this list before deciding to buy a new drive and monitor the status of your system for a while after this happens for the first time to make sure this wasn’t just a temporary issue.

6 Comments

  1. Very helpful – Thank you

  2. Grate article, helped me a lot.

  3. Very Insightful

Leave a Reply

© 2017 Daniel's TechBlog

Theme by Anders NorénUp ↑

%d bloggers like this: