Linux DevCenter    
 Published on Linux DevCenter (http://www.linuxdevcenter.com/)
 See this if you're having trouble printing code examples


System Failure and Recovery Practice

by Jeff Dike
11/29/2001

User-Mode Linux (UML) is a Linux virtual machine running on Linux that allows you to boot Linux on a "software" machine. These virtual machines can be easily created and destroyed, and allow you do do virtually anything that can be done with a physical system. Because of this, UML has turned out to have a wide variety of uses. In this article, I will talk about an application that has not received anywhere near the attention I think it deserves.

UML virtual machines are nearly identical to physical machines in their behavior, except that they are far more convenient to configure and boot. This makes them ideal for system administrator training and practice. In particular, they are very well-suited for creating admin disasters in order to practice recovering from them. I will be describing the creation of and recovery from three disasters, plus the creation (but not recovery) of a fourth.

To get started, you will need to download UML and install it. Go to http://user-mode-linux.sourceforge.net/dl-sf.html and grab and install either the UML RPM or deb, whichever is appropriate for your system. These will install UML itself, plus a number of utilities. You will also need a filesystem image to boot UML on. These are available from the same page. I will be using the Debian root filesystem in the examples below. If you are too short of bandwidth to download that one, get the tomsrtbt filesystem instead.

Where have all the files gone?

To help you get used to using UML, I'll start off with a special introductory disaster which I'll make no attempt to recover from. Even if you are an experienced UML user, you'll probably want to follow along because we're going to do something that you've always wanted to do anyway.

We're going to do a rm -rf / just to see what happens.

So, start UML as follows:

% linux ubd0=cow,root_fs

This tells UML to boot from the root_fs file with the file cow as a copy-on-write (COW) layer above it. The file name cow is arbitrary and generated automatically, so you can change the name as long as you are consistent about it. You'll see the utility of this a bit later. After you uncompress it, your root filesystem is likely named something like root_fs_debian2.2_small or root_fs_tomrtbt_1.7.205. You can either rename it to root_fs to follow the instructions below verbatim or replace root_fs everywhere with the actual name.

As it boots, take note of a line in the console output that looks like this:

mconsole initialized on /tmp/uml/d4oIw6/mconsole

Now, when it comes up and gives you a login prompt, log in as root (password "root"), and do the following:

usermode:~# cd /
usermode:/# rm -rf /

Let it crank for awhile until things break horribly. With the Debian filesystem from the UML site, I ultimately get this:

rm: cannot remove directory '//dev/pty': Directory not empty
rm: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following two directories have the same inode number:

//dev
//dev/pts

If you're the morbid type, you might poke around to see what, if anything, you can still do. You'll need the bash built-ins because your favorite utilities are likely to be gone.

When you've had enough of this trashed system, you'll need to shut it down cleanly. Since halt won't work, the best way is to use the uml_mconsole utility to halt it. On the host, run uml_mconsole, giving it the directory name that you took careful note of when it was booting, and tell it to halt UML:

% uml_mconsole d4oIw6
(d4oIw6) halt
OK

Now, you get to see why we used the COW file. The damage to the filesystem is contained entirely within the COW file. The underlying root_fs file is completely untouched. To see this, you can throw out the COW file:

% rm cow

and boot UML just as you did before.

% linux ubd0=cow,root_fs

You'll see that it boots fine, and that the filesystem is intact. We'll be using this technique to create disasters without irreversibly damaging the real filesystem.

The case of the missing password file

Now, we'll create a relatively simple disaster and recover from it.

% rm cow
% linux ubd0=cow,root_fs

Now, remove the password file and try to halt the machine

usermode:~# rm /etc/passwd
usermode:~# halt
You don't exist. Go away.

OK, halt doesn't work any more, so we'll shut it down from the mconsole:

uml_mconsole zJwanV
(zJwanV) sysrq u
OK
(zJwanV) halt
OK

The sysrq u flushes the filesystems to disk and remounts them read-only. This will save us an fsck on the next boot. Boot it again, this time specifying only the cow file on the command line:

% linux ubd0=cow

Now, we see how well Linux works without a password file:

Debian GNU/Linux 2.2 usermode ttys/0

usermode login: root
Password: 
Login incorrect

It boots fine, but it's (surprise!) impossible to log in. So, let's shut this down from the mconsole again and fix it:

uml_mconsole b9cpus
(b9cpus) sysrq u
OK
(b9cpus) halt
OK

We'll boot up only to single-user, and recreate enough of the password file so that root can log in:

% linux ubd0=cow single

Distributions differ on their interpretation of single. If you don't get a shell with single, then try emergency instead. On my Debian filesystem, both give me a shell.

/etc/passwd: No such file or directory
Give root password for maintenance
(or type Control-D for normal startup):

Anything here, including hitting Return, seems to work.

sh-2.03# cat > /etc/passwd
sh: /etc/passwd: Read-only file system

Here's the first problem. We need to remount the root filesystem read-write before doing anything else:

sh-2.03# mount / -o remount

OK, back to our regularly scheduled disaster. I use cat here, but if you prefer vi, go ahead and use that.

sh-2.03# cat > /etc/passwd
root::0:0:root:/root:/bin/bash
^D

So far, so good. Let's do a sanity check to make sure the utilities think the password file is good:

sh-2.03# whoami
root

That's fine, so let's continue the boot by exiting the single-user shell:

sh-2.03# exit

And now let's see if root can log in:


Debian GNU/Linux 2.2 usermode ttys/0

usermode login: root
Last login: Tue Nov 13 18:28:32 2001 on ttys/0
Linux usermode 2.4.13-1um #2 Fri Oct 26 15:42:47 EDT 2001 i686 unknown
usermode:~#

Yes, root can log in again. If this had happened on a physical machine, your next job would be to chase down the most recent backup tape and restore /etc/passwd from it.

No shell

This time, we're going get rid of bash, which can't be fixed by booting into single-user mode.

Related Reading

Running LinuxRunning Linux
By Matt Welsh, Matthias Kalle Dalheimer & Lar Kaufman
Table of Contents
Index
Sample Chapters
Full Description
Read Online -- Safari

While writing this article, I discovered a bug in the UML block driver which causes COW files not to work properly when they aren't mounted as the root filesystem. So, we are going to dispense with them for the time being.

Copy root_fs to no_bash, boot it up, log in, and get rid of bash:

% cp root_fs no_bash
% linux ubd0=no_bash

usermode:~# rm /bin/bash
usermode:~# halt

If the halt hangs, halt UML with the mconsole.

Let's boot it up again and see how it does without a shell:

linux ubd0=no_bash

It boots very quickly and it's impossible to log in:

INIT: cannot execute "/etc/init.d/rcS"
INIT: Entering runlevel: 2
INIT: cannot execute "/etc/init.d/rc"

Debian GNU/Linux 2.2 (none) ttys/0

(none) login: root
Unable to determine your tty name.

So, we need to shut it down with the mconsole and figure out how to fix it.

We're going to simulate booting from a rescue disk. We're going to do so using root_fs as the rescue disk, assigning that to be disk 0, and moving the damaged filesystem to disk 1:

% linux ubd0=root_fs ubd1=no_bash

So, log in, mount the damaged filesystem on /mnt and make sure that bash is missing:

usermode:~# mount /dev/ubd/1 /mnt
usermode:~# ls /mnt/bin/bash
ls: /mnt/bin/bash: No such file or directory

OK, this is now easy to fix. We can just copy the shell from the rescue disk:


usermode:~# cp -p /bin/bash /mnt/bin/bash
usermode:~# ls -l /bin/bash /mnt/bin/bash
-rwxr-xr-x  1 root   root    461400 Feb 20  2000 /bin/bash
-rwxr-xr-x  1 root   root    461400 Feb 20  2000 /mnt/bin/bash

Now, you can halt UML and boot it on no_bash to confirm that it again boots OK.

Backups, backups, backups

For our finale, we are going to make a backup of the filesystem and destroy enough of it that fixing it requires restoring the backup. The backup device will be an empty file that's large enough to hold our filesystem:

% dd if=/dev/zero of=backup seek=600 bs=$((1024*1024)) count=1

My filesystem is just over 500MB, so I created a 600MB backup file to allow for any overhead of the backup format. Replace the seek=600 with whatever size is appropriate for you. Now copy root_fs to trashed and boot it up with backup as disk 1.

% cp root_fs trashed
% linux ubd0=trashed ubd1=backup

Log in, and make the backup on /dev/ubd/1. I'm using tar here. If you favor a different backup tool, feel free to use it. Notice that we're not creating a filesystem on this device. It's being used as a raw data device in exactly the same way as a tape.

If it fails with an I/O error, the backup file you created was too small. You can extend it by simply running dd on the file with a larger seek argument and retrying the backup.

usermode:~# tar clf /dev/ubd/1 /
tar: Removing leading '/' from member names
tar: Removing leading '/' from link names

When it's done, we will make "trashed" live up to its name:

usermode:~# rm -rf /bin /lib /usr/lib

Remove anything you like. Feel free to corrupt things, too. When you're done having fun, shut it down, using the mconsole, if necessary.

Now, it's time to fix it back up. Boot UML with root_fs as the rescue, backup as disk 1 again, and trashed as disk 2:

% linux ubd0=root_fs ubd1=backup ubd2=trashed

Now, log in, mount the damaged filesystem on /mnt, cd to it, and restore the backup:

usermode:~# mount /dev/ubd/2 /mnt
usermode:~# cd /mnt
usermode:/mnt# tar xpf /dev/ubd/1  
tar: : Cannot mkdir: No such file or directory
tar: Error exit delayed from previous errors

It succeeded, despite the error:

usermode:/mnt# ls bin
arch   dd        fgrep     ls       pidof     run-parts  touch
...

Now, you can check that it is fixed by halting UML and booting it on "trashed" again and seeing that it's fine.

linux ubd0=trashed

In conclusion

Hopefully this article has convinced you that UML can be a valuable system administration tool. I've demonstrated the creation and recovery of a variety of different types of sysadmin catastrophes.

Obviously, this is only a tiny sample of the possible disasters that can happen. You can ensure that you are prepared for them by making them happen and figuring out how to fix them. It is possible to make them happen on a physical machine, but it should be apparent that simulating them with UML is far more convenient, and almost completely authentic. The devices may have different names, but the procedures are exactly the same as on a physical machine.

With the publication of this article, I am inaugurating the Sysadmin Disaster of the Month on the UML web site at http://user-mode-linux.sourceforge.net/sdotm.html. I will present a disaster and take submissions of solutions. I will arbitrarily choose a winner each month based on criteria such as originality, subtlety, brevity, and parsimony. I will also take submissions of proposed disasters. If you have a disaster that you'd like featured, submit it, along with a proposed solution, if you have one.

Jeff Dike


Return to the Linux DevCenter.

Copyright © 2009 O'Reilly Media, Inc.