echothrust/howtos

A list of OpenBSD (mostly) material

View on GitHub

OpenBSD HDD health status monitoring

A lot of articles have surfaced during the past couple of months with regards to hard drive health checks and failures. So the following paper is an attempt to gather all necessary information in order to allow us to check and monitor the health status of our hard drives and hopefully be able to detect potential failures before exposing them selves in a more catastrophic manner.

Toolset

Thankfully OpenBSD comes with utilities that can achieve what we want without installing much.

The tool that we will use extensively for the tasks is atactl(8), a program to manipulate ATA (IDE) devices.

Furthermore, the package for the SMART monitoring Tools smartmontools-x.y.tgz is available for OpenBSD if you want to keep your checking scripts consistent across platforms.

Depending on your hard drive vendor and how closely he follows the SMART specification it might be a good idea to search for the hardware specs and have them at hand before you start. This will prove particularly useful when reading the device exposed attributes.

Status codes

The following SMART metrics seem to indicate conditions where the disk is about to die (values taken by the BackBlaze article):

It seems that our disks on that server report SMART [5,197,198]

Read hard drive attributes

atactl /dev/sd0c readattr
Attributes table revision: 10
ID      Attribute name                  Threshold       Value   Raw
  1     Raw Read Error Rate               50             95     0x0000006fe42b
  5     Reallocated Sector Count           3            100     0x000000000000
  9     Power-On Hours Count               0             95     0xacda00001396
 12     Device Power Cycle Count           0            100     0x00000000002b
171    *Unknown                            0              0     0x000000000000
172    *Unknown                            0              0     0x000000000000
174    *Unknown                            0              0     0x000000000018
177    *Unknown                            0              0     0x000000000000
181    *Unknown                            0              0     0x000000000000
182    *Unknown                            0              0     0x000000000000
187     Unknown                            0            100     0x000000000000
189     High Fly Writes                    0             40     0x0010002d0028
194     Temperature                        0             40     0x0010002d0028
195     Hardware ECC Recovered             0            120     0x0000006fe42b
196     Reallocation Event Count           3            100     0x000000000000
201     Soft Read Error Rate               0            120     0x0000006fe42b
204     Soft ECC Correction                0            120     0x0000006fe42b
230     GMR Head Amplitude                 0            100     0x000000000064
231     Temperature                       10            100     0x000000000000
233    *Unknown                            0              0     0x000000000824
234    *Unknown                            0              0     0x000000000960
241    *Unknown                            0              0     0x000000000960
242    *Unknown                            0              0     0x0000000000e1
One or more threshold values exceeded!

Identify device characteristics

atactl /dev/sd0c identify
Model: KINGSTON SV300S37A60G, Rev: 527ABBF0, Serial #: 50026B774501EAB0
Device type: ATA, fixed
Cylinders: 16383, heads: 16, sec/track: 63, total sectors: 117231408
Device capabilities:
        ATA standby timer values
        IORDY operation
        IORDY disabling
Device supports the following standards:
ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7 ATA-8
Master password revision code 0xfffe
Device supports the following command sets:
        NOP command
        READ BUFFER command
        WRITE BUFFER command
        Host Protected Area feature set
        Read look-ahead
        Write cache
        Power Management feature set
        Security Mode feature set
        SMART feature set
        Flush Cache Ext command
        Flush Cache command
        48bit address feature set
        Set Max security extension commands
        Set Features subcommand required
        Power-up in standby feature set
        Advanced Power Management feature set
        DOWNLOAD MICROCODE command
        IDLE IMMEDIATE with UNLOAD FEATURE
        SMART self-test
        SMART error logging
Device has enabled the following command sets/features:
        NOP command
        READ BUFFER command
        WRITE BUFFER command
        Host Protected Area feature set
        Read look-ahead
        Write cache
        Power Management feature set
        SMART feature set
        Flush Cache Ext command
        Flush Cache command
        48bit address feature set
        Set Features subcommand required
        Advanced Power Management feature set
        DOWNLOAD MICROCODE command

Enable SMART status on the device

atactl /dev/sd0c smartenable

Read SMART values

atactl /dev/sd0c smartread
Off-line data collection:
    status: never started
    activity completion time: 0 seconds
    capabilities:
        execute immediate
        read scanning
        self-test routines
Self-test execution:
    status: completed ok or not started
    recommended polling time:
        short routine: 1 minutes
        extended routine: 36 minutes
SMART capabilities:
    saving SMART data
    enable/disable attribute autosave
Error logging: supported

Check SMART status violations

atactl /dev/sd0c smartstatus
No SMART threshold exceeded

Begin short device self checks

Takes aproximately 90 seconds

atactl /dev/sd0c smartoffline shortoffline

While the offline checks are executed on the device the smartread will report something like this

atactl /dev/sd0c smartread
Off-line data collection:
    status: (null)
    activity completion time: 32 seconds
    capabilities:
        execute immediate
        read scanning
        self-test routines
Self-test execution:
    status: (null)
remains 50% of total time
    recommended polling time:
        short routine: 1 minutes
        extended routine: 36 minutes
SMART capabilities:
    saving SMART data
    enable/disable attribute autosave
Error logging: supported

Once the short self-check routine is done the output will look like

sudo atactl /dev/sd0c smartread
Off-line data collection:
    status: completed ok
    activity completion time: 0 seconds
    capabilities:
        execute immediate
        read scanning
        self-test routines
Self-test execution:
    status: completed ok or not started
    recommended polling time:
        short routine: 1 minutes
        extended routine: 36 minutes
SMART capabilities:
    saving SMART data
    enable/disable attribute autosave
Error logging: supported

Initiate off-line status collection

atactl /dev/sd0c smartoffline collect

References

From a quick glance in the code atactl calls ATAPI_SMART/ATA_SMART_STATUS, which maps to section 7.31.6 of the ATA-3 spec[0] (there’s newer standards, but shouldn’t have changed). Here it states: NORMAL OUTPUTS - If the device has not detected a thrshold exceeded condition, the device sets the Cylinder Low register to 4Fh and the Cylinder High register to C2h. If the device has detected a threshold exceeded condition, the device sets the Cylinder Low register to F4h and the Cylinder High register to 2Ch.

The message “No SMART threshold exceeded” only shows when req.cylinder == 0xc24f. So unless I’m missing something it’s your disk’s SMART that returns something that doesn’t make sense.