OpenBSD HDD health status monitoring

A lot of articles have surfaced during the past couple of months with regards to hard drive health checks and failures. So the following paper is an attempt to gather all necessary information in order to allow us to check and monitor the health status of our hard drives and hopefully be able to detect potential failures before exposing them selves in a more catastrophic manner.

Toolset

Thankfully OpenBSD comes with utilities that can achieve what we want without installing much.

The tool that we will use extensively for the tasks is atactl(8), a program to manipulate ATA (IDE) devices.

Furthermore, the package for the SMART monitoring Tools smartmontools-x.y.tgz is available for OpenBSD if you want to keep your checking scripts consistent across platforms.

Depending on your hard drive vendor and how closely he follows the SMART specification it might be a good idea to search for the hardware specs and have them at hand before you start. This will prove particularly useful when reading the device exposed attributes.

Status codes

The following SMART metrics seem to indicate conditions where the disk is about to die (values taken by the BackBlaze article):

SMART 5: Reallocated Sectors Count Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks that sector as “reallocated” and transfers data to a special reserved area (spare area). This process is also known as remapping, and reallocated sectors are called “remaps”. The raw value normally represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This allows a drive with bad sectors to continue operation; however, a drive which has had any reallocations at all is significantly more likely to fail in the near future. While primarily used as a metric of the life expectancy of the drive, this number also affects performance. As the count of reallocated sectors increases, the read/write speed tends to become worse because the drive head is forced to seek to the reserved area whenever a remap is accessed. If sequential access speed is critical, the remapped sectors can be manually marked as bad blocks in the file system in order to prevent their use.
SMART 187: Reported Uncorrectable Errors The count of errors that could not be recovered using hardware ECC (see SMART 195).
SMART 188: Command Timeout The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable.
SMART 197: Current Pending Sector Count Count of “unstable” sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it’s written. However some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the “Reallocation Event Count” (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.
SMART 198: Uncorrectable Sector Count or Offline Uncorrectable or Off-Line Scan Uncorrectable Sector Count The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

It seems that our disks on that server report SMART [5,197,198]

Read hard drive attributes

atactl /dev/sd0c readattr
Attributes table revision: 10
ID      Attribute name                  Threshold       Value   Raw
   Raw Read Error Rate               50             95     0x0000006fe42b
   Reallocated Sector Count           3            100     0x000000000000
   Power-On Hours Count               0             95     0xacda00001396
   Device Power Cycle Count           0            100     0x00000000002b
  *Unknown                            0              0     0x000000000000
  *Unknown                            0              0     0x000000000000
  *Unknown                            0              0     0x000000000018
  *Unknown                            0              0     0x000000000000
  *Unknown                            0              0     0x000000000000
  *Unknown                            0              0     0x000000000000
   Unknown                            0            100     0x000000000000
   High Fly Writes                    0             40     0x0010002d0028
   Temperature                        0             40     0x0010002d0028
   Hardware ECC Recovered             0            120     0x0000006fe42b
   Reallocation Event Count           3            100     0x000000000000
   Soft Read Error Rate               0            120     0x0000006fe42b
   Soft ECC Correction                0            120     0x0000006fe42b
   GMR Head Amplitude                 0            100     0x000000000064
   Temperature                       10            100     0x000000000000
  *Unknown                            0              0     0x000000000824
  *Unknown                            0              0     0x000000000960
  *Unknown                            0              0     0x000000000960
  *Unknown                            0              0     0x0000000000e1
One or more threshold values exceeded!

Identify device characteristics

atactl /dev/sd0c identify
Model: KINGSTON SV300S37A60G, Rev: 527ABBF0, Serial #: 50026B774501EAB0
Device type: ATA, fixed
Cylinders: 16383, heads: 16, sec/track: 63, total sectors: 117231408
Device capabilities:
        ATA standby timer values
        IORDY operation
        IORDY disabling
Device supports the following standards:
ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7 ATA-8
Master password revision code 0xfffe
Device supports the following command sets:
        NOP command
        READ BUFFER command
        WRITE BUFFER command
        Host Protected Area feature set
        Read look-ahead
        Write cache
        Power Management feature set
        Security Mode feature set
        SMART feature set
        Flush Cache Ext command
        Flush Cache command
        48bit address feature set
        Set Max security extension commands
        Set Features subcommand required
        Power-up in standby feature set
        Advanced Power Management feature set
        DOWNLOAD MICROCODE command
        IDLE IMMEDIATE with UNLOAD FEATURE
        SMART self-test
        SMART error logging
Device has enabled the following command sets/features:
        NOP command
        READ BUFFER command
        WRITE BUFFER command
        Host Protected Area feature set
        Read look-ahead
        Write cache
        Power Management feature set
        SMART feature set
        Flush Cache Ext command
        Flush Cache command
        48bit address feature set
        Set Features subcommand required
        Advanced Power Management feature set
        DOWNLOAD MICROCODE command

Enable SMART status on the device

atactl /dev/sd0c smartenable

Read SMART values

atactl /dev/sd0c smartread
Off-line data collection:
    status: never started
    activity completion time: 0 seconds
    capabilities:
        execute immediate
        read scanning
        self-test routines
Self-test execution:
    status: completed ok or not started
    recommended polling time:
        short routine: 1 minutes
        extended routine: 36 minutes
SMART capabilities:
    saving SMART data
    enable/disable attribute autosave
Error logging: supported

Check SMART status violations

atactl /dev/sd0c smartstatus
No SMART threshold exceeded

Begin short device self checks

Takes aproximately 90 seconds

atactl /dev/sd0c smartoffline shortoffline

While the offline checks are executed on the device the smartread will report something like this

atactl /dev/sd0c smartread
Off-line data collection:
    status: (null)
    activity completion time: 32 seconds
    capabilities:
        execute immediate
        read scanning
        self-test routines
Self-test execution:
    status: (null)
remains 50% of total time
    recommended polling time:
        short routine: 1 minutes
        extended routine: 36 minutes
SMART capabilities:
    saving SMART data
    enable/disable attribute autosave
Error logging: supported

Once the short self-check routine is done the output will look like

sudo atactl /dev/sd0c smartread
Off-line data collection:
    status: completed ok
    activity completion time: 0 seconds
    capabilities:
        execute immediate
        read scanning
        self-test routines
Self-test execution:
    status: completed ok or not started
    recommended polling time:
        short routine: 1 minutes
        extended routine: 36 minutes
SMART capabilities:
    saving SMART data
    enable/disable attribute autosave
Error logging: supported

Initiate off-line status collection

atactl /dev/sd0c smartoffline collect

References

https://www.backblaze.com/blog/hard-drive-smart-stats/
Comment by Martijn van Duren on misc

From a quick glance in the code atactl calls ATAPI_SMART/ATA_SMART_STATUS, which maps to section 7.31.6 of the ATA-3 spec[0] (there’s newer standards, but shouldn’t have changed). Here it states: NORMAL OUTPUTS - If the device has not detected a thrshold exceeded condition, the device sets the Cylinder Low register to 4Fh and the Cylinder High register to C2h. If the device has detected a threshold exceeded condition, the device sets the Cylinder Low register to F4h and the Cylinder High register to 2Ch.

The message “No SMART threshold exceeded” only shows when req.cylinder == 0xc24f. So unless I’m missing something it’s your disk’s SMART that returns something that doesn’t make sense.

echothrust/howtos

A list of OpenBSD (mostly) material