Replacing a dying HDD on FreeNAS

I had a degrading HDD, but the LSI Megaraid SAS card was keeping it alive, till its last breath. The HDD is PD 01 and presented to the OS as /dev/mfid1. This can be checked using the mfiutil:

# mfiutil show drives
mfi0 Physical Drives:
 0 (  699G) ONLINE <WDC WD7502ABYS-1 0C06 serial=WD-WMAU00124313> SATA E1:S0
 1 (  699G) ONLINE <WDC WD7502ABYS-1 0C06 serial=WD-WMAU00124517> SATA E1:S1
 2 (  699G) ONLINE <WDC WD7502ABYS-1 0C06 serial=WD-WMAU00124488> SATA E1:S2
 3 (  699G) ONLINE <WDC WD7502ABYS-1 0C06 serial=WD-WMAU00120759> SATA E1:S3
 4 (  699G) ONLINE <WDC WD7502ABYS-1 0C06 serial=WD-WMAU00124038> SATA E1:S4
 5 (  699G) ONLINE <WDC WD7502ABYS-1 0C06 serial=WD-WMAU00124851> SATA E1:S5

The LED on the enclosure can be blinked using mfiutil locate 1 on. I enabled this make the replacement easier.

As we have a FreeNAS system, on each HDD there is a swap partition. Before the procedure it has to be taken offline with swapoff not to lose data.

$ swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/mfid0p1.eli   2097152    65676  2031476     3%
/dev/mfid1p1.eli   2097152    63760  2033392     3%
/dev/mfid2p1.eli   2097152    63280  2033872     3%
/dev/mfid3p1.eli   2097152    64104  2033048     3%
/dev/mfid4p1.eli   2097152    65728  2031424     3%
/dev/mfid5p1.eli   2097152        0  2097152     0%
Total            12582912   322548 12260364     3%

There were some hiccups, and the I/O operations became slow. Then finally, the last dying words:

mfi0: COMMAND 0xfffffe0000985850 TIMEOUT AFTER 38 SECONDS
mfi0: 34209 (547188641s/0x0002/info) - Unexpected sense: PD 01(e0x20/s1) Path 1221000001000000, CDB: 28 00 46 ed bf 80 00 00 80 00, Sense: 3/11/00
mfi0: 34210 (547188650s/0x0002/info) - Unexpected sense: PD 01(e0x20/s1) Path 1221000001000000, CDB: 28 00 46 ed c9 98 00 00 c0 00, Sense: 3/11/00
mfi0: 34211 (547188658s/0x0002/WARN) - PD 01(e0x20/s1) Path 1221000001000000  reset (Type 03)
mfi0: 34212 (547188659s/0x0002/WARN) - Removed: PD 01(e0x20/s1)
mfi0: I/O error, cmd=0xfffffe0000986378, status=0x33, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1189986176-1189986303
mfi0: I/O error, cmd=0xfffffe0000982110, status=0x33, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1189991440-1189991631
mfi0: I/O error, cmd=0xfffffe0000985850, status=0x33, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1189988760-1189988951
mfi0: I/O error, cmd=0xfffffe0000982ee0, status=0x33, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=write 1189978320-1189978383
mfi0: I/O error, cmd=0xfffffe0000982aa0, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1190011784-1190012007
mfi0: I/O error, cmd=0xfffffe0000983650, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1189994096-1189994159
mfi0: I/O error, cmd=0xfffffe0000984db0, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1190011560-1190011783
mfi0: I/O error, cmd=0xfffffe0000985388, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 4194960-4194975
mfi0: I/O error, cmd=0xfffffe0000982330, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1464072848-1464072863
mfi0: I/O error, cmd=0xfffffe00009848e8, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=write 1189980896-1189981023
mfi0: I/O error, cmd=0xfffffe0000984750, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1464073360-1464073375
mfi0: 34213 (547188659s/0x0002/info) - Removed: PD 01(e0x20/s1) Info: enclPd=20, scsiType=0, portMap=01, sasAddr=1221000001000000,00mfi0: 00000000000000
I/O error, cmd=0xfffffe0000982220, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1190012008-1190012239
mfi0: I/O error, cmd=0xfffffe0000982b28, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=read 1190012240-1190012463
mfi0: I/O error, cmd=0xfffffe0000982e58, status=0xc, scsi_status=0
mfi0: sense error 0, sense_key 0, asc 0, ascq 0
mfid1: hard error cmd=read 1190012464-1190012519
mfi0: I/O error, cmd=0xfffffe0000982550, status=0xc, scsi_status=0
mfi0: sense error 112, sense_key 3, asc 0, ascq 0
mfid1: hard error cmd=write 1189983496-1189983623
pass1 at mfi0 bus 0 scbus0 target 1 lun 0
pass1: <ATA WDC WD7502ABYS-1 0C06> s/n      WD-WMAU00124517 detached
(pass1:mfi0:0:1:0): Periph destroyed
mfi0: 34214 (547188659s/0x0002/info) - State change on PD 01(e0x20/s1) from ONLINE(18) to FAILED(11)
mfi0: 34215 (547188659s/0x0001/info) - State change on VD 01/1 from OPTIMAL(3) to OFFLINE(0)
mfid1: detached
mfi0: 34216 (547188659s/0x0001/FATAL) - VD 01/1 is now OFFLINE
mfi0: 34217 (547188659s/0x0002/info) - State change on PD 01(e0x20/s1) from FAILED(11) to UNCONFIGURED_BAD(1)
mfi0: 34218 (547188659s/0x0041/info) - Deleted VD 01/1

I did not plan to do an online replace (this can be done with mfiutil), so I rebooted, and created a new virtual disk with the LSI tool after post.

The OS detected the new disk, and I used the web interface to do the replace.

pool: vd-storage
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri May  5 06:33:23 2017
        130G scanned out of 3.43T at 272M/s, 3h31m to go
        21.6G resilvered, 3.71% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        vd-storage                                      ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/18ca038b-453b-11e4-9b9b-0024e86886b3  ONLINE       0     0     0
            gptid/f14f4816-314b-11e7-ba2f-0024e86886b3  ONLINE       0     0     0  (resilvering)
            gptid/193b7e17-453b-11e4-9b9b-0024e86886b3  ONLINE       0     0     0
            gptid/196ea492-453b-11e4-9b9b-0024e86886b3  ONLINE       0     0     0
            gptid/19a17178-453b-11e4-9b9b-0024e86886b3  ONLINE       0     0     0
            gptid/19d78741-453b-11e4-9b9b-0024e86886b3  ONLINE       0     0     0

errors: No known data errors

Takeaway: The original drive survived 6,5 years of runtime, and after the first errors started appearing it still survived ~3000 hours.

Advertisements

Author: Gajdos Tamás

A "barefoot physicist" with some IT skills in system administration.

One thought on “Replacing a dying HDD on FreeNAS”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s