FreeNAS smartctl disks behind RAID controller

Captain’s Log:
Recently I got strange error messages in dmesg on my FreeNAS 9.10.2 (Dell PowerEdge 2950):

mfi0: 26208 (536790704s/0x0020/info) - Patrol Read started
mfi0: 26261 (536801289s/0x0002/info) - Unexpected sense: PD 01(e0x20/s1) Path 1221000001000000, CDB: 2f 00 34 8c 00 00 00 10 00 00, Sense: 3/11/00
mfi0: 26267 (536802664s/0x0002/info) - Unexpected sense: PD 01(e0x20/s1) Path 1221000001000000, CDB: 2f 00 36 33 68 99 00 10 00 00, Sense: 3/11/00

… and around 50 more. That is definitely not fun. But what is the problem?

I checked the SCSI keycodes to find the following for “Sense: 3/11/00”:

3     11     00     Medium Error - unrecovered read error

Now thats fun! But how to access each disk behind the LSI-MEGARAID card with smartctl?

The Ubuntu part was easy, for my RAID 10 setup with 4 disk the following commads had to be used:

$ sudo smartctl -d megaraid,N -a /dev/sda

Where the N means the disk number behind the RAID group. For my 4 disk RAID10 setup: [0-3].

For FreeBSD/FreeNAS the drive can only be accessed directly when using /dev/passN. And not with mfiN. Here N means the disk number:

# smartctl -a /dev/pass1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/pass0 [SAT] failed: SATA device detected,
MegaRAID SAT layer is reportedly buggy, use '-d sat' to try anyhow

… and we need to pass the device type…

# smartctl -d sat -A /dev/pass1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       5630
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1191
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       46
  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always       -       18
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       53698
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       44
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       46
194 Temperature_Celsius     0x0022   115   108   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   188   188   000    Old_age   Always       -       12
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

You might notice two high values:

  • Raw_Read_Error_Rate: 5630
  • Power_On_Hours: 53698 (6 years, 45 days, 22 hours)

I might need a replacement disk for HDD2 (PD 01). Well, this is why you buy server grade stuff. It will last.

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE3 Serial ATA
Device Model:     WDC WD7502ABYS-18A6B0
Serial Number:    WD-WMAU00124517
LU WWN Device Id: 5 0014ee 0abf40031
Add. Product Id:  DELL�
Firmware Version: 03.00C06
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.5, 3.0 Gb/s
Local Time is:    Wed Jan  4 09:19:22 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

What’s next?

  • My FreeNAS storage was built with a RaidZ2 pool, so it will die after two disk failure. Well that is a relief.
  • As it is work hours, I am only able to run short SMART tests on the disks. They didn’t fail, well for now. At night I can run the long tests (~3h).
  • Do an additional/extraordinary backup.
  • Order the replacement disk…

The story continues:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       33238
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1191
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       46
  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always       -       18
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   023   023   000    Old_age   Always       -       56216
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       44
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       46
194 Temperature_Celsius     0x0022   116   108   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   188   188   000    Old_age   Always       -       12
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       7
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

I am waiting on a deco to get my hands on a replacement disk. Living dangerously, aren’t we?

Gemordet: 2017-05-04

mfi0: COMMAND 0xfffffe0000985850 TIMEOUT AFTER 38 SECONDS
mfi0: 34209 (547188641s/0x0002/info) - Unexpected sense: PD 01(e0x20/s1) Path 1221000001000000, CDB: 28 00 46 ed bf 80 00 00 80 00, Sense: 3/11/00
mfi0: 34210 (547188650s/0x0002/info) - Unexpected sense: PD 01(e0x20/s1) Path 1221000001000000, CDB: 28 00 46 ed c9 98 00 00 c0 00, Sense: 3/11/00
mfi0: 34211 (547188658s/0x0002/WARN) - PD 01(e0x20/s1) Path 1221000001000000  reset (Type 03)
mfi0: 34212 (547188659s/0x0002/WARN) - Removed: PD 01(e0x20/s1)
mfi0: I/O error, cmd=0xfffffe0000986378, status=0x33, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1189986176-1189986303
mfi0: I/O error, cmd=0xfffffe0000982110, status=0x33, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1189991440-1189991631
mfi0: I/O error, cmd=0xfffffe0000985850, status=0x33, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1189988760-1189988951
mfi0: I/O error, cmd=0xfffffe0000982ee0, status=0x33, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=write 1189978320-1189978383
mfi0: I/O error, cmd=0xfffffe0000982aa0, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1190011784-1190012007
mfi0: I/O error, cmd=0xfffffe0000983650, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1189994096-1189994159
mfi0: I/O error, cmd=0xfffffe0000984db0, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1190011560-1190011783
mfi0: I/O error, cmd=0xfffffe0000985388, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 4194960-4194975
mfi0: I/O error, cmd=0xfffffe0000982330, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1464072848-1464072863
mfi0: I/O error, cmd=0xfffffe00009848e8, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=write 1189980896-1189981023
mfi0: I/O error, cmd=0xfffffe0000984750, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1464073360-1464073375
mfi0: 34213 (547188659s/0x0002/info) - Removed: PD 01(e0x20/s1) Info: enclPd=20, scsiType=0, portMap=01, sasAddr=1221000001000000,00mfi0: 00000000000000
I/O error, cmd=0xfffffe0000982220, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1190012008-1190012239
mfi0: I/O error, cmd=0xfffffe0000982b28, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1190012240-1190012463
mfi0: I/O error, cmd=0xfffffe0000982e58, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1190012464-1190012519
mfi0: I/O error, cmd=0xfffffe0000982550, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=write 1189983496-1189983623
pass1 at mfi0 bus 0 scbus0 target 1 lun 0
pass1: <ATA WDC WD7502ABYS-1 0C06> s/n      WD-WMAU00124517 detached
(pass1:mfi0:0:1:0): Periph destroyed
mfi0: 34214 (547188659s/0x0002/info) - State change on PD 01(e0x20/s1) from ONLINE(18) to FAILED(11)
mfi0: 34215 (547188659s/0x0001/info) - State change on VD 01/1 from OPTIMAL(3) to OFFLINE(0)
mfid1: detached
mfi0: 34216 (547188659s/0x0001/FATAL) - VD 01/1 is now OFFLINE
mfi0: 34217 (547188659s/0x0002/info) - State change on PD 01(e0x20/s1) from FAILED(11) to UNCONFIGURED_BAD(1)
mfi0: 34218 (547188659s/0x0041/info) - Deleted VD 01/1
Advertisements

Author: Gajdos Tamás

A "barefoot physicist" with some IT skills in system administration.

One thought on “FreeNAS smartctl disks behind RAID controller”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s