This is a post about the robustness of zfs, and can serve as a mini how-to for people who wants to replace disks and do not have hot spare in the system.
Last Monday, our local area was hit by a tremendous rainfall which caused our basement to be flooded. You can see the pictures of the flood here. Sorry about the quality. The primary objective was to savage various floating hardware :-\
Wet hardware is also the reason fort this post. Upon entering the basement I remembered my fileserver that was standing on the floor and quickly (and heroically) dashed to its rescue.
Unfortunately the server had already taken in quite a lot of water and three of its four raid-z (raid5) disks were already ankle deep in water.
I did not manage to take any pictures at the time, but took some today in order to illustrate where the waterline was.
My crude drawing skills was put to the test in order to create this.
Needless to say, I was quite worried about the state of my data. I quickly removed the power plug and rushed the computer off to dry land (the living room) where a brave team consisting of my girlfriend and son; started drying the disk components after I had disassembled them – well, removed the circuit board at least.
After each disk had been dried, I carefully put them back together and tried to power them on – one by one.
Surprisingly, they all spun up, meaning that the motors were okay – yay!
Next step was to put them back into the fileserver and hope for the best.
And, to my relief, It booted! And the zpool came online! That was amazing! Apparently, nothing was lost. But just to be sure i ran a scrub on the pool.
This is the result:
pool: pool1p0 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub scrub completed after 5h0m with 0 errors on Tue Aug 2 03:20:10 2011 config: NAME STATE READ WRITE CKSUM pool1p0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad10 ONLINE 51 0 0 1.50M repaired ad12 ONLINE 0 0 0 errors: No known data errors
I consider myself a very lucky man. Only 1.5M of corruption? 3 of 4 disks partially submerged in water. Wow!
Anyway. I rushed to buy three new disks, and replaced one of them (ad10) as soon as it arrived I started replacing them, one by one.
I of course did a full rsync of the date in the storage pool to a another computer.
Replacing the disks
Upon replacing the first diske, (I chose ad10 as this was the one that was marked as bad) I got this error:
nas1:~# zpool status state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub resilver in progress for 6h22m, 86.62% done, 0h59m to go config: NAME STATE READ WRITE CKSUM pool1p0 DEGRADED 0 0 10 raidz1 DEGRADED 0 0 60 ad4 ONLINE 0 0 0 194M resilvered ad6 ONLINE 0 0 0 194M resilvered replacing DEGRADED 0 0 0 6658299902220606505 REMOVED 0 0 0 was /dev/ad10/old ad10 ONLINE 0 0 0 353G resilvered ad12 ONLINE 0 0 0 161M resilvered errors: Permanent errors have been detected in the following files: <metadata>:<0x32>
The zfs administrators guide explains that the corruption is located in the meta-object set (MOS), but does not give any hint on how to remove or replace the set. Admitted, I have not looked thoroughly into what the MOS actually is.
I put the original (faulted) ad10 disk back in, and the error went away (after a reboot).
Then I decided to try again. This time with ad4. Physical replacing the disk on the sata channel revealed this:
nas1:~# zpool status pool: pool1p0 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAME STATE READ WRITE CKSUM pool1p0 DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 2439714831674233987 UNAVAIL 0 32 0 was /dev/ad4 ad6 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad12 ONLINE 0 0 0 errors: No known data errors
Okay, then the replacement.
nas1:~# zpool replace pool1p0 2439714831674233987 /dev/ad4
… And the resilvering started. The eta eventually settled at ~5:00 but took about 7,5 hours – which was probably caused by the relative slow Atom processor, being the bottleneck.
nas1:~# zpool status pool: pool1p0 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 0.00% done, 708h0m to go config: NAME STATE READ WRITE CKSUM pool1p0 DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 replacing DEGRADED 0 0 0 2439714831674233987 REMOVED 0 0 0 was /dev/ad4/old ad4 ONLINE 0 0 0 2.30M resilvered ad6 ONLINE 0 0 0 1.53M resilvered ad10 ONLINE 0 0 0 1.52M resilvered ad12 ONLINE 0 0 0 1.38M resilvered errors: No known data errors
The resilvering revealed a total of 4 corrupted files, which I could replace from backup.
However, this lead me to the next challenge:
Clearing errors, and merging replacement disks
I could get rid of the errors, effectively leaving the zpool in a permanent degraded state. Every document I could dust up lead me to conclusion that I should remove the files – which I did, and then run zfs clean on the pool to clear the errors.
The solution was to reboot after I had removed the files, and let it resilver again. This worked and let me to believe that I could have simply done a clean and thena scrub to verify the consistency of the data.
After this, I could repeat the somewhat lengthy process for the next disk.
In total I have had ~10 minutes downtime, caused by replacing the disks.
Plus of course a couple of hours downtime while the server dried. This is, in my opinion, very impressive. Another vote for zfs, or +1 on google+ 🙂
I have actually found this zfs recovery exercise very enlightening. It is something you usually do not get to do under such “relaxed” circumstances as I had been privileged with.
Update: The new disks does not support temperature polling, apparently Western Digital has removed the feature.