Data recovery

Yet another goodbye

For those who are about to halt -p, we salute you.

I’m moving out – of my first vserver that is.

I have had the server running for, well

lucretia:~# uptime 
 19:19:59 up 700 days, 22:51,  2 users,  load average: 0.02, 0.06, 0.02

Wow.. Time flies.

I have been migrating the services running in the machine to another, server. Not is the time to do the inevitable.

Sorry old chap, better you than me.

lucretia:~# halt -p

Broadcast message from root@lucretia (pts/1) (Fri Sep 30 19:23:24 2011):

The system is going down for system halt NOW!
lucretia:~# Connection to lucretia.greenpc.dk closed by remote host.
Connection to lucretia.greenpc.dk closed.

Of course, prior to this, I made sure to make a complete copy of the file system like so. Isn’t rsync the best tool?

rsync --progress -poazuHK -e ssh --delete --exclude /proc --exclude sys --exclude dev / home.greenpc.dk:/mnt/primary/backup/lucretia.greenpc.dk

Is ZFS overly hyped?

When ZFS first appeared, it was well received and praised.
The time was ripe for a modernization of file systems to eliminate the tedious task of having to plan and create volume groups.

My first experience with ZFS was when I was hired to build a NAS for a small business.
Initially, I chose to go the tradtional way and make the raid as a raid5. This took ages (about a day) to complete.
I therefore quite surprised when I saw the difference between creation time of a traditional raid5 and a raidz. The raidz was ready seconds after I had created it.

Also, when my server drowned ZFS helped my to salvage my data very easily .

So, what are the cool things about ZFS:

  • Creating raids take seconds insteads of hours (days)
  • It has filesystem level data integrity
  • Built-in snapshots that uses copy-on-write to preserve disk space. It’s like Apple’s Time machine – only on filesystem level
  • The L2ARC feature is brilliant. I have not yet had a chance to try it out yet though.

I know btrfs is said to be the upcoming zfs killer, but from what I have read, it still lacks most of the features that are fundamental for zfs now – and has been for some time.
The advantage of btrfs is that it has the potential of being a clean room implementation of zfs, addressing some of the design flaws that zfs has.

I’ve only seen the tip of the iceberg, and for all my usage it has proved itself more than worthy. I am still trying to convince a business partner to engage in a more bold and lager scale zfs implementation, but this is still in the idea stage.

The next step for file systems will probably be more from the userspace perspective. Modern computers, like the ipad requires a different filesystem layout – or none at all. The next evolutionary step for file systems will be metadata-driven, and storage will be a large pool distributed over different mediums, like cloud based storage.

The applicability of ZFS is enormous – and in my opinion the only thing holding people back is the lack of trust in the technology.
Even though, its technology as I like it best. Complexity made simple – with rich opportunity to dive into the sea of technical details. So to answer my own question; no, I don’t think

ZFS drive replacement

This is a post about the robustness of zfs, and can serve as a mini how-to for people who wants to replace disks and do not have hot spare in the system.

Background

Last Monday, our local area was hit by a tremendous rainfall which caused our basement to be flooded. You can see the pictures of the flood  here. Sorry about the quality. The primary objective was to savage various floating hardware :-\

Wet hardware is also the reason fort this post. Upon entering the basement I remembered my fileserver that was standing on the floor and quickly (and heroically) dashed to its rescue.

Unfortunately the server had already taken in quite a lot of water and three of its four raid-z (raid5) disks were already ankle deep in water.

I did not manage to take any pictures at the time, but took some today in order to illustrate where the waterline was.

 

This is the inside of the case side. If you look carefully, you can see the traces after the water.

My crude drawing skills was put to the test in order to create this.

An approximation of the waterlevel

Needless to say, I was quite worried about the state of my data. I quickly removed the power plug and rushed the computer off to dry land (the living room) where a brave team consisting of my girlfriend and son; started drying the disk components after I had disassembled them – well, removed the circuit board at least.

After each disk had been dried, I carefully put them back together and tried to power them on – one by one.
Surprisingly, they all spun up, meaning that the motors were okay – yay!

Next step was to put them back into the fileserver and hope for the best.

And, to my relief, It booted! And the zpool came online! That was amazing! Apparently, nothing was lost. But just to be sure i ran a scrub on the pool.

This is the result:

  pool: pool1p0
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub scrub completed after 5h0m with 0 errors on Tue Aug  2 03:20:10 2011
config:

	NAME        STATE     READ WRITE CKSUM
	pool1p0     ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0
	    ad10    ONLINE      51     0     0  1.50M repaired
	    ad12    ONLINE       0     0     0

errors: No known data errors

I consider myself a very lucky man. Only 1.5M of corruption? 3 of 4 disks partially submerged in water. Wow!

Anyway. I rushed to buy three new disks, and replaced one of them (ad10) as soon as it arrived I started replacing them, one by one.

I of course did a full rsync of the date in the storage pool to a another computer.

Replacing the disks

Upon replacing the first diske, (I chose ad10 as this was the one that was marked as bad) I got this error:

nas1:~# zpool status
state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub resilver in progress for 6h22m, 86.62% done, 0h59m to go
config:

	NAME                       STATE     READ WRITE CKSUM
	pool1p0                    DEGRADED     0     0    10
	  raidz1                   DEGRADED     0     0    60
	    ad4                    ONLINE       0     0     0  194M resilvered
	    ad6                    ONLINE       0     0     0  194M resilvered
	    replacing              DEGRADED     0     0     0
	      6658299902220606505  REMOVED      0     0     0  was /dev/ad10/old
	      ad10                 ONLINE       0     0     0  353G resilvered
	    ad12                   ONLINE       0     0     0  161M resilvered

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x32>

The zfs administrators guide explains that the corruption is located in the meta-object set (MOS), but does not give any hint on how to remove or replace the set. Admitted, I have not looked thoroughly into what the MOS actually is.

I put the original (faulted) ad10 disk back in, and the error went away (after a reboot).

Then I decided to try again. This time with ad4. Physical replacing the disk on the sata channel revealed this:

nas1:~# zpool status
  pool: pool1p0
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

	NAME                     STATE     READ WRITE CKSUM
	pool1p0                  DEGRADED     0     0     0
	  raidz1                 DEGRADED     0     0     0
	    2439714831674233987  UNAVAIL      0    32     0  was /dev/ad4
	    ad6                  ONLINE       0     0     0
	    ad10                 ONLINE       0     0     0
	    ad12                 ONLINE       0     0     0

errors: No known data errors

Okay, then the replacement.

nas1:~# zpool replace pool1p0 2439714831674233987 /dev/ad4

… And the resilvering started. The eta eventually settled at ~5:00 but took about 7,5 hours – which was probably caused by the relative slow Atom processor, being the bottleneck.

nas1:~# zpool status
  pool: pool1p0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 708h0m to go
config:

	NAME                       STATE     READ WRITE CKSUM
	pool1p0                    DEGRADED     0     0     0
	  raidz1                   DEGRADED     0     0     0
	    replacing              DEGRADED     0     0     0
	      2439714831674233987  REMOVED      0     0     0  was /dev/ad4/old
	      ad4                  ONLINE       0     0     0  2.30M resilvered
	    ad6                    ONLINE       0     0     0  1.53M resilvered
	    ad10                   ONLINE       0     0     0  1.52M resilvered
	    ad12                   ONLINE       0     0     0  1.38M resilvered

errors: No known data errors

The resilvering revealed a total of 4 corrupted files, which I could replace from backup.

However, this lead me to the next challenge:

Clearing errors, and merging replacement disks

I could get rid of the errors, effectively leaving the zpool in a permanent degraded state. Every document I could dust up lead me to conclusion that I should remove the files – which I did, and then run zfs clean on the pool to clear the errors.

The solution was to reboot after I had removed the files, and let it resilver again. This worked and let me to believe that I could have simply done a clean and thena scrub to verify the consistency of the data.

After this, I could repeat the somewhat lengthy process for the next disk.

Summery

In total I have had ~10 minutes downtime, caused by replacing the disks.
Plus of course a couple of hours downtime while the server dried. This is, in my opinion, very impressive. Another vote for zfs, or +1 on google+ :-)

I have actually found this zfs recovery exercise very enlightening. It is something you usually do not get to do under such “relaxed” circumstances as I had been privileged with.

Update: The new disks does not support temperature polling, apparently Western Digital has removed the feature.

screenshot
Only the remaining "old" disk now support temperature monitoring

A lesson in recovery techniques

I recently got this message from fsck.jfs:

Unrecoverable error writing M to /dev/sdb3. CANNOT CONTINUE.

Okay, so this is an error that can be ignored – right? I can just force mount the partition and extract the data with superblock marked as dirty … right?!

krc@X61s % mount -o ro -f /dev/sdb3 /mnt/rec_mount
krc@X61s % ls /mnt/rec_mount
krc@X61s %

Damn it! This was an 1,4 TB parition with 900GB of data including home videos and .mkv rips of my dvd’s. Most of data could be restored, but a lot work would be lost.

I am running JFS on all my storage drives, as I have found this a good all-round file system especially in smaller devices with limited resources. Unfortunately this a kind of niche file system that does not have a broad variety of recovery tools.
I found jfsrec as the only (non commercial) tool. Unfortunately this tool was unable to read from the partition directly and stopped with an early EOF marker error.

Jfsrec pointed me in the direction of the dd_rhelp tool. This tool turned out to be a life saver. There was just one thing. I needed a disk big enough to hold a complete dump of the partition.

A few days later, armed with a new disk, I was able to continue. I used this guide at debianadmin.com to get started. The command could not be simpler to use:

krc@X61s % dd_rhelp /dev/sdb3 /mnt/rec_target/bad_disk.img

And it started copying data! Yay!
After some time, it settled on a transfer rate of 2500 … KBps! … Wow… This is rather slow…
Quick calculation: (((1400000000)/2500)/3600)/24 = 6.48 days.

One week later:

krc@X61s % ssh atom1
ssh: connect to host atom1 port 22: No route to host

Hmm… I had done this periodically over the last week

krc@X61s % ping atom1
PING atom1 (172.16.0.122) 56(84) bytes of data.
From atom1 (172.16.0.122) icmp_seq=1 Destination Host Unreachable
From atom1 (172.16.0.122) icmp_seq=2 Destination Host Unreachable
From atom1 (172.16.0.122) icmp_seq=3 Destination Host Unreachable
From atom1 (172.16.0.122) icmp_seq=4 Destination Host Unreachable
^C
--- atom1 ping statistics ---
6 packets transmitted, 0 received, +4 errors, 100% packet loss, time 5059ms

Hmm… Thats odd. I didn’t remember putting a ; halt -p after the dd_rhelp command.

A few pings and some reflections later I acutally got up and checked the room where the recovery setup is located.

This was what I found:

20110218-154606_redone.jpg

To quote Freddie Frinton;

I’ll kill that cat!

20110218-154628_redone.jpg
Notice the dangling sata power cables in the top of the photo… I have always found Linux a stable operating system, but a system disk physically disappearing is valid excuse for a crash!

Fortunately, dd_rhelp got to finish the disk dump – which was very lucky because after the fall, the damaged disk is now officially dead. It no longer spins up, and is not recognised by bios.

I tried running a fsck.jfs directly on the disk image, and it managed to fix the errors in the partition. Now i could mount the disk image like so:

krc@X61s % sudo mount -o loop /mnt/rec_target/bad_disk.img /mnt/rec_target

And copy the files from /mnt/rec_target.

Whew!