Recovering a hard drive

The hard drive started showing symptoms already two weeks ago – while I was on a business trip in Hungary! There were more and more unreadable sectors, and the SMART daemon kept complaining about them daily. When I finally got back to Finland the server was still standing, but barely – many libraries had already been corrupted and starting new programs was pretty haphazard. But the currently running services, such as Apache, Zope, Exim, and sshd, were working.

The original disk had been in continuous use for over 5 years, so I guess it was about time for it to fail. It was an IBM Deskstar, and a good quality product it was. OK, I quickly replaced the disk with two hard disks that had some mirrored space on them, and a preinstalled Debian GNU/Linux on them. The backups were for the most part ok (done with flexbackup), but some important files I hadn’t included in the backups. So I needed to try to recoved the data from the failing disk.

I got a copy of Recovery Is Possible, a small Linux CD image that boots up in just about any decent machine with enough RAM and a bootable CD drive, and which contains a good selection of recovery tools. So I set up an old machine with the broken disk and a brand-new disk, and booted up RIP. After adding more memory and replacing the very old CD drive with a more modern one I got the system to boot properly.

The basic tool that did all the work was ddrescue which was bundled in the RIP distribution. DDrescue needs some place to write its log files to, so I created a small partition on the new disk for this purpose, and mounted it, and then created empty partitions that matched the sizes of the partitions I needed to recover. Then it’s just a matter of running

ddrescue /dev/hda1 /dev/hdb1 /work/rescue_hda1.log

to start the rescue of /dev/hd1 into /dev/hdb1, writing the log file into /work/rescue_hda1.log. DDrescue will try different ways to read problematic areas and can usually read lots of sectors that in normal use would just be classified as unreadable. However, a single run usually doesn’t get everything out, and several runs are required. ddrescue stores in its log file which parts of the drive had problems, and subsequent runs just retry the parts that have not been recovered yet. I noticed that after the hard drive heats up, it’s not working as reliably, so letting the machine cool down between attempts helped. Also, adding the parameter “-n 10″ will repeat the process ten times, so you can leave ddrescue to do its work while you go do something else.

Another good trick was to throw the broken hard drive into the freezer. Just wrap it up in a static-protective pouch, let cool off in room temperature or in the fridge, then put it in the freezer and let it stand overnight. Then take it out, let it slowly return to room temperature, take out of the pouch, hook up to the machine and rerun ddrescue a few times. This helped me with one of the partitions, where the errors were right smack in the ext2 inode tables (equivalent to the file allocation tables of Windows disks) meaning that locating files was a bit of a problem. I did work a bit with lde and recovered some critical files manually, but in the end, after two visits to the freezer, ddrescue was able to get the inode tables recovered as well and I got the data out much more easily.

The recovered image on the new drive is of course partially broken, since not all data is usually recovered. You can either just copy the data you need, mounting the partition read-only, or you can try what fsck will do to repair the partition. In any case, you then have most of the data in a working disk where you can copy them to wherever you need them. I had two partitions that weren’t fully backed up, and of the 11GB and 6GB partitions only 15 and 22 kB were left unreadable (while after just the first ddrescue run something like >200kB were unreadable).

Server disk crash

This server’s old hard drive failed last week. “Did you have backups?” Yes, I did. “Had you tested your backups?” No, I hadn’t. Seems my backup routine did not have enough access to all the files that needed to be backed up, so restoring the server has been a bit of a task. But about everything is now more or less coming up. And at the same time I decided to get rid of the old static xstl pages and replace everything with just this blog. At least for now. I’ll write more on the restoration process a bit later…

So why did the backup not work completely? All of the servers I administer have a centralized backup location on one of the servers. There’s a raid 1 stack that receives all backups. The backup software is “flexbackup”, which quite nicely does backups of remote machines over the net. The problem was that I had decided to use the “backup” user account to do the backups, and of course this user did not have enough access to some of the more secure files. The lesson: add the backup user to the groups that have access to stuff that needs to be backed up. Eg: users, staff, www-data, zope, mysql. You get the idea. Also, after setting up the backups, take a look at /var/log/flexbackup (or wherever you’re saving the flexbackup logs) and look for “access denied” messages and see if those files and folders should be included in the backup. If they should, then you need to grant more access to the backup user account.