A lesson in recovery techniques

I recently got this message from fsck.jfs:

Unrecoverable error writing M to /dev/sdb3. CANNOT CONTINUE.

Okay, so this is an error that can be ignored – right? I can just force mount the partition and extract the data with superblock marked as dirty … right?!

krc@X61s % mount -o ro -f /dev/sdb3 /mnt/rec_mount
krc@X61s % ls /mnt/rec_mount
krc@X61s %

Damn it! This was an 1,4 TB parition with 900GB of data including home videos and .mkv rips of my dvd’s. Most of data could be restored, but a lot work would be lost.

I am running JFS on all my storage drives, as I have found this a good all-round file system especially in smaller devices with limited resources. Unfortunately this a kind of niche file system that does not have a broad variety of recovery tools.
I found jfsrec as the only (non commercial) tool. Unfortunately this tool was unable to read from the partition directly and stopped with an early EOF marker error.

Jfsrec pointed me in the direction of the dd_rhelp tool. This tool turned out to be a life saver. There was just one thing. I needed a disk big enough to hold a complete dump of the partition.

A few days later, armed with a new disk, I was able to continue. I used this guide at debianadmin.com to get started. The command could not be simpler to use:

krc@X61s % dd_rhelp /dev/sdb3 /mnt/rec_target/bad_disk.img

And it started copying data! Yay!
After some time, it settled on a transfer rate of 2500 … KBps! … Wow… This is rather slow…
Quick calculation: (((1400000000)/2500)/3600)/24 = 6.48 days.

One week later:

krc@X61s % ssh atom1
ssh: connect to host atom1 port 22: No route to host

Hmm… I had done this periodically over the last week

krc@X61s % ping atom1
PING atom1 (172.16.0.122) 56(84) bytes of data.
From atom1 (172.16.0.122) icmp_seq=1 Destination Host Unreachable
From atom1 (172.16.0.122) icmp_seq=2 Destination Host Unreachable
From atom1 (172.16.0.122) icmp_seq=3 Destination Host Unreachable
From atom1 (172.16.0.122) icmp_seq=4 Destination Host Unreachable
^C
--- atom1 ping statistics ---
6 packets transmitted, 0 received, +4 errors, 100% packet loss, time 5059ms

Hmm… Thats odd. I didn’t remember putting a ; halt -p after the dd_rhelp command.

A few pings and some reflections later I acutally got up and checked the room where the recovery setup is located.

This was what I found:

20110218-154606_redone.jpg

To quote Freddie Frinton;

I’ll kill that cat!

20110218-154628_redone.jpg
Notice the dangling sata power cables in the top of the photo… I have always found Linux a stable operating system, but a system disk physically disappearing is valid excuse for a crash!

Fortunately, dd_rhelp got to finish the disk dump – which was very lucky because after the fall, the damaged disk is now officially dead. It no longer spins up, and is not recognised by bios.

I tried running a fsck.jfs directly on the disk image, and it managed to fix the errors in the partition. Now i could mount the disk image like so:

krc@X61s % sudo mount -o loop /mnt/rec_target/bad_disk.img /mnt/rec_target

And copy the files from /mnt/rec_target.

Whew!

FreeNAS itx setup

As a result of a complete NAS breakdown one of my customers decided to get a new server that had a bit more power than the old one.

I saw this as quite an interesting challenge and got started.

Due to the fact that the rack cabinet that was put up was only ~68cm deep, I had to find a rack chassis that ware to fit these constraints.
It turns out that Travla has some very nice chassis’ with 8 front access hot-swap drive bays for the raid.

Components:

At first, I tried the Jetway NC9C-550-LF mainboard with the 4xSATA daughterboard. But unfortunately, the latter was unsupported, which took the whole idea out of using this board (8xSATA in all). Also the LAN interface was not supported out-of-the-box.

The installation went smooth, and a SoftRAID5 was created using the five disks. The creation was a real pain and took forever.
Initial benchmarks went well, but at deployment a significant slowdown was detected. ~250Mbit LAN usage when transferring large files, and as low as 50Mbit when transferring small files. This was very unacceptable on a Gigabit LAN.

After a switch switch and a NIC switch I turned as a last resort, to what could not possibly be the bottleneck – the server itself!

nas:~# dd if=/dev/zero of=/mnt/storage/zerofile.000 bs=1m count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 271.362496 secs (38641154 bytes/sec)
nas:~# dd of=/dev/zero if=/mnt/storage/zerofile.000 bs=1m 
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 96.963503 secs (108141308 bytes/sec)

40/100 Mb/s is not very impressive for sequential r/w – especially not on a RAID5!
Guess the bottleneck was the server itself.

After a bit of reading and research, I came across a story quite similar to mine – using the exact same disks on a softRaid5. The problem was misalignment of partitions due to a change of standard disk blocksize since – well I don’t know when, I usually don’t follow hardware evolution that closely.

Next thing, I persuaded the customer to backup the data, so that I could re-create the RAID – only this time as a RAID-Z.

dd if=/dev/zero of=/storage/zerofile.000 bs=1m count=10000 && dd of=/dev/null if=/storage/zerofile.000 bs=1m && rm /storage/zerofile.000 
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 98.727775 secs (106208815 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 46.398998 secs (225991087 bytes/sec)

This is a nice improvement! The customer is also satisfied with the speed increase, but then again – who wouldn’t be?

Finally, a photo of the setup.

20110208-154924.jpg

This is a sight that I just had to document. It is a collection of external disks, and the document on top is the index. This index is created by mounting each disk and take a screenshot of the Finder window. A very nice ad-hoc solution if you ask me.

20110208-154905.jpg