Troubleshooting

From PrgmrWiki

Troubleshooting

With any luck, you're just reading this chapter for fun, and not because your server has just erupted in a tower of flame. (Of course, sysadmins being almost comically lazy, it's most likely the latter, but the former is at least vaguely possible, right?)

If the machine is in fact already broken -- don't panic. Xen is complex, but the issues discussed here are fixable problems with known solutions. There's a vast arsenal of tools, a great deal of information to work from, and a lot of expertise available.

In this section, we're going to outline a number of troubleshooting steps and techniques, with particular reference to Xen's peculiarities. We've tried to include explanations for some of the vague error messages that you might come across, and made some suggestions on where to get help, if all else fails.

Let's start with a general overview of our approach to troubleshooting, which will help to put the specific discussion of Xen-related problems in context.

The most important thing when troubleshooting is to get a clear idea of the machine's state -- what it's doing, what problems it's having, what telegraphic errors it's spitting out, and where the errors are coming from. This is doubly important in Xen's case, because its modular, standards-based design brings together diverse and unrelated tools, each with their own methods of logging and error handling.

Our usual troubleshooting technique is to:

  • Reproduce the problem.
  • If the problem generates an error message, use that as a starting point.
  • If the error message doesn't provide enough information to solve the problem, consult the logs.
  • If the logs don't help, use set -x to make sure that the scripts are firing correctly, and closely examine the control flow of the non-Xen-specific parts of the system.
  • Use strace or pdb to track the flow of execution in the more Xen-specific bits and see what's failing.
  • Ask for help.



Troubleshooting phase 1: Error messages

The first sign that something's amiss is likely to be an error message and an abrupt exit. These usually occur in response to some action -- booting the machine, perhaps, or creating a domU.

Xen's error messages can be, frankly, infuriating. They're somewhat vague and developer-oriented, and usually come from somewhere deep in the bowels of the code, where it's difficult to determine what particular class of user error is responsible, or even if it's user error at all.

Better admins than us have been driven mad, have thrown their machines out the window and vowed to spend the rest of their lives wearing animal skins, killing dinner with fire-hardened spears. And who can say that they are wrong?


Errors at boot

The first place to look for information on systemwide problems (if only because there's nothing else to do while the machine boots) is the boot output, both from the hypervisor and the dom0 kernel.

[Textbox: Reading Boot Error Messages]

When a machine's broken badly enough that it can't boot, it often reboots itself immediately. This can lead to difficulty when trying to diagnose the problem. We suggest using a serial console with some sort of scrollback buffer to preserve the messages on another computer. This also makes it easy to log output, for example by using GNU screen.

If you refuse to use serial consoles, or if you wish to otherwise 'do something' before the box reboots, you can append "noreboot" to both the Xen and Linux kernel lines in GRUB. (If you miss either, it'll reboot. It's finicky that way.)

[FIXME console log option.]

[end box]

Many of the Xen-specific problems we've encountered at boot have to do with kernel/hypervisor mismatches. The xen kernel must match the Dom0 kernel, in terms of PAE support, and if the hypervisor is 64 bit, the Dom0 must be 64 bit or i386-PAE. Of course if the hypervisor is 32-bit, so must be the Dom0.

You can run a i386-PAE dom0 with an x86_64 hypervisor and x86_64 DomUs, but only on recent Xen kernels. (In fact, this is what some versions of the Citrix Xen product do.) In no case can you mismatch the PAE-ness. Modern versions of xen don't even include the compile-time option to run in i386 non-pae mode, causing all sorts of problems if you want to run NetBSD-STABLE.)


Of course, many of the problems at boot that we've had aren't especially Xen-specific -- often people have trouble when running the Xen.org kernel because it puts the drivers for the root device into an initrd, rather than into the kernel.

If your distro, like CentOS, expects an initrd, you probably want to use your distro's initrd creation script. In CentOS, you can install a Xen.org kernel, then run mkinitrd as usual with the new kernel uname and it just works -- assuming that you've got a correctly written modules.conf.

Error Messages

Assuming you've booted successfully, there are a variety of informative error messages that Xen can give you. Usually these are in response to an attempt to do something, like start xend or create a domain.

LVM errors

=Can't remove open logical volume

[root@taney cancelled]# lvremove /dev/guests/luke
  Can't remove open logical volume "luke"


[root@taney cancelled]# dmsetup remove /dev/mapper/guests-lukep1
[root@taney cancelled]# dmsetup remove /dev/mapper/guests-luke
[root@taney cancelled]# lvremove /dev/guests/luke
  Logical volume "luke" successfully removed

Troubleshooting Disks

Most disk-related errors will cause domU creation to fail immediately. This makes them fairly easy to troubleshoot. Here are some examples:

Error: DestroyDevice() takes exactly 3 arguments (2 given)

These pop up frequently and usually mean that something's wrong in the device specification. Check the config file for typos in the vif= and disk= lines. If the message refers to a block device, the problem is often that you're referring to a non-existent device or file.

There are a few other errors that have similar causes. For example:

Error: Unable to find number for device (cdrom)

This, too, is usually caused by a phy: device with incorrectly specified backing device.

However, this isn't the only possible cause. If you're using file-backed block devices, rather than LVM volumes, the kernel may have run out of block loops on which to mount these devices. (In this case, the message is particularly frustrating because it seems entirely independent of the domain's config.) You can confirm this by looking for an error in the logs like:

Error: Device 769 (vbd) could not be connected. Backend device not found.

Although this message usually means that you've mistyped the name of the domain's backing storage device, it may instead mean that you've run out of block loops. The default loop driver only creates 7 of the things -- barely enough for three domains with root and swap devices.

We might suggest that you move to LVM, but that's probably overkill. The more direct answer is to make more loops. If your loop driver is a module, edit /etc/modules.conf and add:

options loop max_loop=64

(or another number of your choice -- each domU file-backed VBD will require one loop device in dom0.) [note: Or whatever domain is used as the backend -- /usually/ dom0, although Xen's new stub domains promise to make non-dom0 driver domains much more prevalent.] Then reload the module. Shut down all domains that use loop devices (and detach loops from the dom0) and then run:

rmmod loop
insmod loop

If the loop driver is built into the kernel, you can add the max_loop option to the dom0 kernel command line. For example, in /boot/grub/menu.lst :

module linux-2.6-xen0 max_loop=64

Reboot and the problem should go away.

VM restarting too fast

Disk problems, if they don't announce themselves through a specific error message, often manifest in log entries like the following:

[2007-08-23 16:06:51 xend.XendDomainInfo 2889] ERROR
(XendDomainInfo:1675) VM sebastian restarting too fast (4.260192 
seconds since the last restart).  Refusing to restart to avoid loops.

This one is really just Xen's way of asking for help -- the domain's stuck in a reboot cycle. Start the domain with the -c option (for console autoconnect) and look at what's causing it to die on startup. In this case, the domain booted and immediately panicked for lack of root device.

Note, in this case, the VM is restarting every 4.2 seconds -- long enough to get console output. If the 'restarting too fast' number is less than one or two seconds, often xm create -c shows no output.

DomU Pre-boot Errors

If you're using Pygrub [note: or another bootloader, such as PyPXEboot], you may see the message "VmError: Boot loader didn't return any data!" This means that Pygrub, for some reason, wasn't able to find a kernel. Usually this is either because the disks aren't specified properly, or because there isn't a valid GRUB configuration in the domU. Check the disk configuration and make sure that /boot/grub/menu.lst exists in the filesystem on the first domU VBD. [note: There's some leeway -- pygrub will check a bunch of filenames, inluding but not limited to /boot/grub/menu.lst, /boot/grub/grub.conf, /grub/menu.lst, and /grub/grub.conf. Remember that Pygrub is a good emulation of GRUB, but it's not exact.]

You can troubleshoot Pygrub problems by running Pygrub by hand:

# /usr/bin/pygrub type:/path/to/disk/image

This should give you a Pygrub boot menu. When you choose a kernel from the menu, Pygrub exits with a message like:

Linux (kernel /var/lib/xen/boot_kerne.hH9kEk)(args "bootdev=xbd1")

This means that Pygrub successfully loaded a kernel and placed it in the dom0 filesystem. Check the listed location to make sure that it's actually there. You can also use strace on pygrub when you call it by hand.

Pygrub is quite picky about the terminal it's connected to. If Pygrub exits, complaining about curses, or if Pygrub on the same domain works for some people and not for others, you might have a problem with the terminal.

With the version of Pygrub that comes with CentOS 5.1, you can repeatably get a failure by executing xm create -c from a terminal window less than 19 lines long. If you suspect this may be the problem, resize your console to 80x24 and try again.

Pygrub will also expect to find your terminal type (the value of the TERM variable) in the terminfo database. Manually setting TERM=vt100 before creating the domain is usually sufficent.

Configuring Devices in the DomU

Most likely, if the domU fails to start because of missing devices, the problem's tied to storage. (Broken network setups don't usually cause the boot to fail outright, although they can render your VM less-than-useful after booting.)

Sometimes the domU will load its kernel and get through the first part of its boot sequence, but then complain about not being able to access its root device, despite a correctly-specified root kernel parameter. Most likely, the problem is that the domU doesn't have the root device node in the /dev/ directory in the initrd.

This can lead to trouble when attempting to use the semantically more correct xvd* devices. Since many distros don't include the appropriate device nodes, they'll fail to boot. The solution, then, is to use the hd* or sd* devices in the disk= line, thus:

disk = ['phy:/dev/tempest/sebastian,sda1,r']
root = "/dev/sda1"

Once the domain's started, you can create the xvd devices properly, or edit your udev configuration. Note that the Xen block driver may have trouble attaching to virtual drives that use the sdX naming convention if the domU kernel includes a SCSI driver. In that case, use the xvdX convention, like this

disk = ['phy:/dev/tempest/sebastian,xvda1,r']

Troubleshooting Network

In our experience, Xen's networking is one of the most reliable parts of the package. Unless you've modfied the networking scripts, Xen will fairly reliably create the vif devices. However, if you have problems, here are some general guidelines.

To troubleshoot networking, you really need to understand how Xen does networking. There are a number of scripts and systems working together, and it's important to decompose problems and isolate them to the appropriate component. We'll focus on network-bridge here, although similar steps apply to network-route and network-nat.

The first thing to do is probably to run the network script with the "status" argument. For example, if you're using network-bridge, "/etc/xen/scripts/network-bridge status" will provide a helpful dump of the state of your network as seen in dom0. At this point you can use brctl to examine the network in more detail, and use the xm vnet-create and vnet-delete commands in conjunction with the rest of the userspace tools to get everything working again.

Once you've got the backend sorted, you can address the frontend. Check the logs, check dmesg from within the domU.

Let's look at these steps in a little more detail. First, make sure that the relevant devices show up in the domU. Xen creates these pretty reliably. If they aren't there, check the domU config and the logs for relevant-looking error messages.

At the lowest level (since we know that the dom0's networking works, right?) we want to check that the link is functioning. Our basic tool for that is arping from within the DomU, combined with tcpdump -i [interface] on the domU's interface in the Dom0.

[xm list]
[tcpdump on vifX.0]
[arping in DomU]
[ping gateway in DomU]


This will tell you if ARP packets are getting from the DomU to the dom0. if they aren't, you probably have a problem on the DomU side, either it has some really screwy routing, or you have some problem within the DomU network stack. (have such a problem at EA, will resolve and put here, once redhat figures it out//lsc)

Now, most of the time, you will see appropriate output in tcpdump as shown. This tells you that Xen is moving packets from the DomU to the Dom0. Do you see a responce to the ARP who-has? (should be ARP is-at) If not, it's possible your bridge in the Dom0 isn't setup correctly. First, run 'brctl show':

[brctl show]

Note: in Xen.org before Xen 3.2, the bridge name is, by default, xenbr0 for network-bridge. Xen 3.2 and later, however, named the bridge eth0 (0, in this case, is the number of the related network interface.) CentOS/RHEL, by default, creates another bridge, virbr0, which is part of the libvirt stuff. In practical terms, it functions like network-nat, with a DHCP server handing out private addresses on the dom0.

Now, a bridge is like a switch. Make sure the bridge (switch) your DomU interface is connected to is also connected to an interface that touches the network you want the DomU on, usually a pethX device. (As explained in networking, when network bridge starts up, it renames ethX to pethX and creates a fake ethX device from vif0.x.) Can anything else on the bridge see traffic from the outside world? tcpdump -n -i peth0 - are the packets flowing properly?

Check your routes. Don't forget higher-level stuff, like DNS servers.

DomU interface number increments with every reboot

When Xen creates a domain, it looks at the vif=[] statement.

Each string within the [ ] characters (it's a python array) is another network device. If I just say vif=[,] it creates two network devices for me, with random mac addresses. In the DomU, they are (ideally) named eth0 and eth1. In the Dom0, they are named vifX.0 and vifX.1, where X is the domain number.

Now, as most modern Linux distros, by default, lock ethX to a particular MAC address on the first boot. In RHEL/CentOS, the setting is

HWADDR= in /etc/sysconfig/network-scripts/ifcfg-ethX. 

Most other distros use udev to handle persistent MAC addresses, as described in Chapter 5, Networking. We circumbent the problem by specifying the MAC address on the vif= line in the xm config file:

vif=['mac=aa:aa:aa:aa:aa:ab','mac=aa:aa:aa:aa:aa:ac'] 

Setting MAC addresses as shown would do the trick, though you may want to use the Xen MAC prefix 00:16:3E.

If you don't specify the MAC address, it's randomly generated every time the DomU boots, which causes huge problems if your DomU OS has locked down ethX to a particular mac (on some distros, this means that on first boot, you will have eth0, on second boot, eth1, third, eth2, etc...)

TCP Offloading

If your NIC uses TCP offloading, it can sometimes cause problems with tcp connections to DomUs. In our experience, this manifests in tcp connections that let the DomU recieve data but not send it. Ping works in both directions, and the three way TCP handshake works -- it's just that once TCP connections are established, they are unidirectional

Using 'ethtool -tx off -rx off' in the dom0 on the peth device in question fixes the problem.

We suspect this has something to do with the TCP offloading happening in the Dom0, which incoming packets going through the peth can take advantage of, but that outgoing packets (from the DomU) can't.

iptables

The iptables rules can also be a source of trouble with Xen. As with any iptables setup, it's easy to mess up in subtle ways and break everything.

The best way we've found to make sure that iptables rules are working is to send packets through and watch what happens to them. Run iptables -L -v to see counters -- how many packets have hit each rule, or been affected by the chain policy.

Note: vifs examined from the dom0 end will display inverted interface counters -- outgoing traffic will report as incoming, and vice-versa. See Chapter 5, Networking, for more information on why that happens.

You may also have trouble getting antispoof to work. If you enable antispoof but find you can still spoof arbitrary IP addresses in the DomU, add the following to your network startup:

echo  1 >/proc/sys/net/bridge/bridge-nf-call-iptables

This will cause packets sent through the bridges to traverse the forward chain, where Xen puts the antispoof rules. We added the command to the end of /etc/xen/scripts/network-bridge.

Another problem can occur if you're using vifnames, as we suggest doing in Chapter 5, Networking. We suggest making sure that the names are short -- 8 characters or less. Above that they can get truncated, and different parts of the system truncate at different lengths (at least in CentOS 5.0.) In our particular case, we saw problems where the actual vifnames were truncated at one length, and our firewall rules (for antispoof) were truncated at another, blocking all packets from the domain in question. It is better to avoid the problem and keep the vifnames short.

Memory Issues

Xen (or rather, the Linux driver domain) can act rather strangely when memory's running low. Since Xen and the dom0 require a certain amount of contiguous, unswappable memory, it's surprisingly easy (in our experience) to find the oom-killer snacking on processes like candy. This even happens when there's plenty of swap available.

The best solution we've found -- and we freely admit that it's not perfect -- is to give dom0 more memory. We also prefer to fix its memory allocation at something like 512 MB, so that it doesn't have to cope with Xen constantly adjusting its memory size.

The basic way of tuning dom0's memory allocation is by adjusting the dom0_mem kernel parameter, which sets an upper limit, and the dom0-min-mem parameter in /etc/xen/xend-config.sxp, which sets a lower limit. Again, we usually set both of these to the same value.

To set the maximum amount of memory available to the dom0, edit menu.lst and put the option after the kernel line, like so:

kernel /xen.gz dom0_mem=512M noreboot

In the absence of units, Xen will assume that the value is in kB.

Next, edit /etc/xen/xend-config.sxp and add a line that says:

(dom0-min-mem 512)

We do this because we've seen the Dom0 have problems with ballooning. It usually works, but, like taking backups from a non-quiescent filesystem, 'usually works' is not good enough for something as important as the Dom0.

Creating domains in low-memory conditions

This is one of the most informative error messages in Xen's arsenal.

XendError: Error creating domain: I need 131072 KiB, but dom0_min_mem is 262144 and shrinking to 262144 KiB would leave only -16932 KiB free.

The error means that the system doesn't have enough memory to create the domU as requested. (The system in this case had only 384MiB, so the error really isn't surprising.)

The solution is to adjust dom0_min_mem to compensate, or adjust the domU to require less memory. Or, as in this case, do both. (And possibly add more memory.)

Other messages

====xenconsole: Could not read tty from store: No such file or directory====

This message usually shows up in response to an attempt to connect to a domain's virtual console.

If this is a paravirtualized domain, first try restarting xenconsoled:

# /usr/sbin/xenconsoled

(we see this problem especially when you have a mismatched kernel and userland)

Then reconnect with xm console.

If the problem persists, the problem is most likely that you're trying to access a domain that doesn't have the necessary Xen frontend console device configured in. There are several possibilities -- if this is a custom kernel, you may have simply forgotten to include it, for example. Check the configuration of the domain's kernel and the initrd for the xvc driver.

If you are accessing an HVM domain running a default (non-enlightened) kernel that doesn't include the console driver, try using the framebuffer, or booting a different kernel. You might also be able to set serial=pty in the domain config file, and set the domU os to use com1 as the console. See the hvm chapter for details.

VmError: (22, 'Invalid argument')

This error can mean a number of things.

Often the problem is a version mismatch between the tools and the running Xen hypervisor. Although the binaries installed in /usr/sbin may be correct, the underlying Python modules may be wrong. Check that they're correct using whatever evidence is available -- dates, comments in the files themselves, output of xm info, etc.

The error can also indicate a PAE mismatch. In this case xend-debug.log will give a succinct description of the problem:

# tail /var/log/xen/xend-debug.log
ERROR: Non PAE-kernel on PAE host.
ERROR: Error constructing guest OS

Incidentally, your dom0 -- which is, after all, just a special Xen guest domain -- can also suffer from this problem. If it happens, the hypervisor will report a PAE mismatch in a large boxed-off error message at boot time, and immediately reboot.

"no version for struct_module found: kernel tainted"

We got this error while trying to install the binary Xen distribution on a Slackware machine. The binary distro comes with a very minimal kernel, so it needs an initrd with appropriate modules. For some reason, the default script loaded modules in the wrong order, causing some loads to fail with the above message.

We fixed the problem by changing the load order in the initrd -- specific directions would depend on your distro.

A constant stream of "4gb seg fixup" messages

Sometimes, on booting a newly installed i386 domain, you'll be greeted with screens full of messages like this:

4gb seg fixup, process init (pid 1), cs:ip 73:b7ec2fc5

These are related to the /lib/tls problem: Xen is complaining because it's having to emulate a 4gb segment for the benefit of some process that's using negative offsets to access the stack. You may also see a giant message at boot, reminding you to address this issue.

This is because Xen interacts very badly with one particular optimization used by the glibc thread library. In order to provide threads with their own data (thread local storage), recent versions of glibc position the thread-local data at the top of the address space, just "before" the base pointer. Because of this, a process can access the TLS block using a negative offset, 'wrapping around' to the other side of the address space.

[DIAGRAM: x86 segment growdown]

Unfortunately, Xen reserves part of the address space for itself, meaning that this trick -- using a negative offset to access the end of memory -- no longer works. Instead, Xen must trap the access, invert the sense of the segment address, and then allow the access to continue. Because the current behavior is near worst-case -- alternating positive and negative accesses -- this makes the machine kind of slow. Note that this only affects i386. X86_64 doesn't even use segments.

To solve this problem, you want to use a glibc that does not do this. you can compile glibc with the -mno-tls-direct-seg-refs option, or install the appropriate libc6-xen package for your distribution (both RedHat-like and Debian-like distros have created packages to address this problem.)

With RedHat (and its derived distros,) you can also run these commands:

# echo 'hwcap 0 nosegneg' > /etc/ld.so.conf.d/libc6-xen.conf
# ldconfig

This will instruct the dynamic loader to avoid that particular optimization.

If all else fails (or if you are just too lazy to find a version of gcc with no-tls-direct-seg-refs,) you can do as the error message advises, and move the TLS library out of the way:

# mv /lib/tls /lib/tls.disabled

In our experience, there isn't any problem with moving the library. Everything will continue to function as expected.

The importance of disk drivers (initrd problems)

Often when using a distro kernel, The Xen domain will boot, but be unable to locate its root device.

For example:

VFS: Cannot open root device "sda1" or unknown-block(0,0)
Please append a correct "root=" boot option
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

The underlying problem here -- at least in this case -- is that the domU kernel hasn't got the necessary drivers compiled in, and the ramdisk was not specified. A look at the boot output confirms this, with the messages:

XENBUS: Device with no driver: device/vbd/769
XENBUS: Device with no driver: device/vbd/770
XENBUS: Device with no driver: device/vif/0

Nearly all distro kernels come with a minimal kernel, and require a initrd with the disk driver to finish booting.

If the kernel managed to load its initrd correctly and failed to switch to its real root, then you'll find yourself stuck in the initrd with a very limited selection of files. In this case, make sure that your devices exist (/dev/sda1 in this example) and that you've got the xenfront kernel module.

We also commonly see this within Pygrub DomUs after a kernel upgrade (and new initrd) if the modules config (/etc/modules on debian, /etc/modprobe.conf on redhat, we need pointers) didn't specify xenblk

for CentOS/RHEL DomUs you can solve this problem by running the mkinitrd script with the --preload xenblk switch.

If you use a external kernel and want to use a distro kernel, you must specify a ramdisk= line in the domain config file, and specify a ramdisk that includes the xenblk (and xennet, if you want network before boot) drivers.

Another solution to this problem would be to compile Xen from source and build a sufficiently generic domU kernel, with the xenblk and xennet drivers already compiled in. Even if you continue to boot the dom0 from the distro kernel (probably a good idea,) this will sidestep the distro-specific issues found with both RedHat and Debian kernels.

This may cause problems with some domU distros, because the expected initrd won't be there. Sometimes it can be difficult to build a initrd against a kernel with disk drivers built in. However, the generic kernel will usually at least boot.

We often find it useful to keep these generic kernels as a secondary 'rescue' boot option within the DomU pygrub config, as they work no matter how badly the initrd is messed up.

Xenstore

Sometimes the Xenstore gets corrupted, or xenstored dies, or for various other reasons the Xenstore ceases to store and report information.

The most obvious symptom is that xm list will report domain names incorrectly, e.g.:

  1. xm list

Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 2554 2 r----- 16511.2 Domain-10 10 127 1 -b---- 1671.5 Domain-11 11 255 1 -b---- 442.0 Domain-14 14 63 1 -b---- 1758.2 Domain-15 15 62 1 -b---- 7507.7 Domain-16 16 127 1 -b---- 11194.9 Domain-6 6 94 1 -b---- 5454.2 Domain-7 7 62 1 -b---- 270.8 Domain-9 9 127 1 -b---- 1715.7

Obviously this is problematic. For one thing, it means that all commands that can take a name or id, such as xm console, will no longer recognize names.

We haven't come up with a great solution. Removing the .tdb file and rebooting seems to work. Thankfully we haven't seen the problem since some early versions of Xen.

Xen's logs

These error messages make a good start on Xen troubleshooting, but sometimes they're not helpful enough to solve the problem. In these cases, we need to dig deeper.

dmesg and xm dmesg

While the output of xm dmesg isn't a log in the usual sense of "log file," it's an important source of diagnostic output.

If you've got a problem whose source isn't obvious from the error message, begin by looking at the Xen kernel message buffer. As you probably know, the Linux dmesg command prints out the Linux kernel's message buffer, which ordinarily contains all kernel messages since the system's last boot. (Or, if the system's been up for a while, it displays a succession of boring status messages.)

Because Xen could be said to act as a kernel in its own right, it includes an equivalent tool to print out messages from the hypervisor boot. (The lines that begin with (XEN) in the startup messages.) For example:

# xm dmesg | tail -3
(XEN) (file=platform_hypercall.c, line=129) Domain 0 says that IO-APIC 
REGSEL is good
(XEN) microcode: error! Bad data in microcode data file
(XEN) microcode: Error in the microcode data

(In this case, the errors are harmless. The processor simply runs on its factory-installed microcode.)

Note: Like the kernel, Xen retains only a fixed-size message buffer. Older messages go off into oblivion.

Logs and what Xen writes to them

If xm dmesg isn't enlightening, Xen's next line of communication is its extensive logging. So, let's look at the various logs that Xen uses, and what we can do with them.

We can summarize Xen's logs as follows, in rough order of importance.

/var/log/xen/xend.log /var/log/xen/xend-debug.log /var/log/xen/xen-hotplug.log /var/log/syslog /var/log/debug

Most of your Xen troubleshooting will involve the first two logs. Xend.log is the main Xend log, as you might suppose. It records domain startups, shutdowns, device creation, debugging whatnot, and occasionally includes giant incomprehensible Python dumps. It's the first thing to check.

Xend-debug.log has information relating to more experimental features of Xen, such as the framebuffer. It'll also have verbose tracebacks when Xen runs into trouble.

Because xend uses the syslog facility, messages from Xen also show up in the system-wide /var/log/syslog and /var/log/debug [note: We hasten to add that syslog is almost /humorously/ configurable. Even the term "system-wide" only applies to the default configuration -- syslog can consolidate logs across multiple hosts, categorize messages into various channels, write to arbitrary files. . . But we're going to assume that, if you've configured syslog, you can translate what we say about Xen's use of it to apply to your configuration.]

Finally, if you're using HVM, qemu-dm will write its own logs. By and large, you can safely ignore these -- in our experience, problems with HVM domains haven't been the fault of QEMU's device emulation.

If the kernel messages prove unenlightening, it's time to take a look at the log files. First, let's configure Xen to ensure that they're as round, firm, and fully packed as possible.

[Textbox: The importance of a debug build

For troubleshooting (and, in fact, general use) we recommend building Xen with all of its debugging options turned on. This makes the error messages more informative and plentiful, making it easier to figure out where problems are coming from, and with any luck, eliminate them.

Although it might seem that copious debugging output would cause a performance hit, in our experience it's negligible when running Xen normally. A debug build gives you the option of running Xen with excessive debugging output, but it performs about as well as a normal build when you're not using that mode. If you find that the error messages are unhelpful, it might be a good idea to make sure that you have all the the debugging knobs set to "full."

See Chapter 14, "tips" for more information on building Xen, including how to set the debugging options.

[End textbox]

Applying the debugger

If even the maximum-verbosity logging isn't enough, it's time to attack the problem at the Python level, with the debugger.

One investigation to try is to run the xend server in the foreground and watch its debug output. This will let you see somewhat more information than simply following the logs.

With current versions of Xen, the debug functionality is included in releases. [note: at one point you had to download a patch and rebuild. Thankfully, this is no longer the case.] Enable it with:

# export XEND_DEBUG=1
# export XEND_DAEMONIZE=0
# xend start

This will start xend in the foreground and tell it to print debug messages as it goes along.

You can also get copious debugging info by setting XENSTORED_TRACE=1 somewhere where xend's environment will pick it up -- perhaps at the top of /etc/init.d/xend, or in root's .bashrc.

Xen's backend architecture -- making sense of the debug info

Of course, all this debugging output is more useful with some idea how Xen is structured.

If you take a look at the actual xend executable, the first thing you'll notice is that it's really very short. There's not much to it -- all of the heavy lifting's done in external python libraries, which live in /xen/xend/server in one of the python library directories. (In the case of the system I'm sitting in front of, /usr/lib/python2.4/site-packages/xen/xend/server .)

Likewise, xm is also a short Python script. The take-home message here is that most of the error messages that you'll see emanate from somewhere in this directory tree, and that they'll helpfully print the responsible file and line number, so that you can examine the Python more closely.

For example, look at this line from /var/log/xen/xend.log:

[2007-08-07 20:14:26 6008] WARNING (XendAPI:672) API call:
VM.get_auto_power_on not found

At the beginning is the date, time, and xend's PID. Then comes the severity of the error (in this case, WARNING, which is to say, merely irritating.) After that is the file and line number where the error ocurred, followed by the actual contents of the error message.

Note: WARNING is only one point along the continuum of messages. At the lowest extreme of severity, we have DEBUG, which the developers use for whatever output strikes their fancy. It's often useful, but generates a lot of data to wade through. Slightly more significant, we have INFO. Messages at this level are are supposed to be interesting or useful to the administrator, but not indicative of a problem.

Then comes WARNING, which indicates a problem, but not a critical one. For example, the message above tells us that we'd have trouble if we're relying on the VM.get_auto_power_on function, but that nothing bad will happen if don't try to use it.

Finally Xen uses ERROR for genuine, beyond-denial errors -- the sort of thing that can't be put off or ignored. Generally this means that a domain is exiting abnormally.

Armed with this information, you can do several things. To continue our earlier example, we'll open /usr/lib/python2.5/site-packages/xen/xend/XendAPI.py and add a line near the top of the file to import the debugger module, pdb.

import pdb

Having done that, you can set a breakpoint. Just add a line near line 672:

pdb.set_trace()

And then try rerunning the server (or redoing whatever other behavior you're concerned with,) and note that xend starts the debugger when it hits your new breakpoint.

At this point you can do everything that you might expect in a debugger -- change the values of variables, step through a function, step into subroutines, and so forth. In this case, we might backtrace, figure out why it's trying to call VM.get_auto_power_on, and maybe wrap it in an error-handling block.

Domain stays in blocked state

Usually, we find that this problem is related to the console -- for example:

[root@localhost ~]# xm create -c sebastian.cfg
Using config file "/etc/xen/sebastian.cfg".
Going to boot Fedora Core (2.6.18-1.2798.fc6xen)
  kernel: /vmlinuz-2.6.18-1.2798.fc6xen
  initrd: /initrd-2.6.18-1.2798.fc6xen.img
Started domain sebastian
rtc: IRQ 8 is not free.
i8042.c: No controller found.

(And then an indefinite hang.) Upon breaking out and looking at the output of xm list, we note that the domain stays in a blocked state and consumed very little CPU time.

[root@localhost ~]# xm list
Name                                      ID Mem(MiB) VCPUs State  Time(s)
Domain-0                                   0     3476     2 r-----   407.1
sebastian                                 13      499     1 -b----    19.9

A quick look /var/log/xen/xend-debug.log suggested an answer:

10/09/2007 20:11:48 Autoprobing TCP port 
10/09/2007 20:11:48 Autoprobing selected port 5900

Port 5900. . . That's VNC. Aha!

The problem was that Xen wasn't using the virtual console device that xm console connects to -- in this case, we traced it to user error. We had specified the framebuffer and forgotten about it. The kernel, as instructed, used the framebuffer as console rather than emulated serial console that we were expecting. When we started a VNC client and connected to port 5900, it gave us the expected graphical console.

Note: If we had put a getty on xvc0, even though we wouldn't see boot output, we'd at least get a login prompt once the machine booted. )

Debugging Hotplug

Xen makes extensive use of udev to create and destroy virtual devices, both in the dom0 and the domU. Most of its interaction with linux's hotplug subsystem gets logged in /var/log/xen/xen-hotplug.log. [note: We're going to treat "hotplug" as synonymous with "udev," since we can't think of any system that still uses the pre-udev hotplug implementation.]

First, we examine the effects of the script. In this case, we use udevmonitor to see udev events. It should show an "add" event for each vif and vbd, and an "online" event for the vif. These go through the rules in /etc/udev/rules.d/xen-backend.rules, which executes appropriate scripts in /etc/xen/scripts.

At this point you can add some extra logging. At the top of the script for the device you're interested in (e.g., blktap,) put:

set -x
exec 2>>/var/log/xen-hotplug.log

This will cause the shell to expand the commands in the script and write them to xen-hotplug.log, enabling you (it is to be hoped) to trace down the source of the problem and eliminate it.

Hotplug can also act as a bit of a catchall for any virtual device problem. Some hotplug-related errors take the form of the dreaded "Hotplug scripts not working" message, like the following:

Error: Device 0 (vkbd) could not be connected. Hotplug scripts not
working.

This seemed to be associated with messages like the following:

DEBUG (DevController:148) Waiting for devices irq.
DEBUG (DevController:148) Waiting for devices vkbd.
DEBUG (DevController:153) Waiting for 0.
DEBUG (DevController:539) hotplugStatusCallback
/local/domain/0/backend/vkbd/4/0/hotplug-status

In this case, however, these messages turned out to be a red herring. The answer came out of xend-debug.log, which said:

/usr/lib/xen/bin/xen-vncfb: error while loading shared libraries:
libvncserver.so.0: cannot open shared object file: No such file or
directory

As it developed, libvncserver was installed in /usr/local, which the runtime linker had been ignoring. After adding /usr/local/lib to /etc/ld.so.conf, xen-vncfb started up happily.

strace

One important generic troubleshooting technique is to use strace to look at what the Xen control tools are really doing. For example, if Xen's failing to find an external binary (like xen-vncfb,) strace can reveal that problem with a command like the following:

# strace -e trace=open -f xm create prospero 2>&1 | grep ENOENT | less

Unfortunately, it'll also give you a lot of other, entirely harmless output, as python proceeds to pull in the entirety of its runtime environment based on crude guesses about filenames.

Another example of strace's usefulness comes from when we were setting up pygrub:

# strace xm create -c prospero
(snipped)
mknod("/var/lib/xen/xenbl.4961", S_IFIFO|0600) = -1 ENOENT (No such file or directory)

As it turned out, we didn't have a directory required by pygrub's backend. Thus:

# mkdir -p /var/lib/xen/ 

And everything works fine.

Python path issues

The Python path itself can be the subject of some irritation. Just as you've got your shell executable path, manpath, library path, and so forth, python has its own internal search path that it examines for modules. If the path doesn't include the Xen modules, you can wind up with errors like the following:

# xm create -c sebastian.cfg
Using config file "/etc/xen/sebastian.cfg".
Traceback (most recent call last):
  File "/usr/bin/pygrub", line 26, in ?
    import grub.fsys
ImportError: No module named fsys

Unfortunately, the mechanisms for adjusting the search path aren't exactly intuitive. In most cases, we just fall back to either creating some symlinks, or moving the Xen files into some directory that's already in Python's path.

The correct solution is to add a .pth file to a directory that's already in Python's path. This .pth file should contain the path of a directory with Python modules. For example:

# echo "/usr/local/lib/python2.5/site-packages" >>
/usr/lib/python2.5/local.pth

Confirm that the path updated correctly by starting python

# python
>>>> import sys
>>>> print sys.path
[, '/usr/lib/python25.zip', 'usr/lib/python2.5' (etc)
'/usr/local/lib/python2.5/site-packages']

Mysterious lockups

These are among the most frustrating aspects of dealing with computers -- sometimes they just don't work.

If Xen (or the dom0) hangs mysteriously, chances are you have a kernel panic in the Dom0. In this case, you have two problems -- first the crash, but second that your console logging isn't adequate to its task.

A serial console improves your life immensely. If you're using serial, you should see an informative panic message on the serial console. If you don't see that, you may want to try typing control-A three times on the console to switch the input to the Xen hypervisor. This will at least confirm that Xen and the hardware are still up.

If you don't have a serial console, try to keep your VGA console on tty1, as often the panic message won't go anywhere else. Sometimes a digital camera is handy for saving the output of a kernel panic, although a serial console would be better.

If the box reboots before you can see the panic message on your console, and serial isn't an option, you can try adding panic=0 to the module line that specifies your linux kernel in the domU menu.lst file. This has the obvious disadvantage of hanging your computer rather than rebooting, but it's good for test setups.

Kernel parameters -- a "Safe Mode"

If even the hypervisor serial console doesn't work -- that is, if the machine is /really/ frozen -- there are some kernel parameters that we've had good luck with in the past.

The "ignorebiostables" option to the linux kernel (on the "module" line) may help to avoid hangs when under I/O stress on certain Intel chipsets. If your machine is crashing -- full-on ceasing to function hardware-wise -- it might be worth a shot. (I know, it's only one step removed from waving a dead chicken over the server. But you work with what you've got.)

In a similar vein, "acpi=off" and "nousb" have been reported to improve stability on some hardware. You may also want to disable hyper-threading in the BIOS. Some Xen versions have had trouble with it.

So, if you want to add all of these options at once, your /boot/grub/menu.lst entry for Xen will look something like this:

root hd0(0)
kernel /boot/xen-3.0.gz
module /boot/vmlinuz-2.6-xen ignorebiostables acpi=off noapic nousb

Getting help

You can, of course, email us directly with Xen-related questions. No guarantee that we'll be able to help, but asking is easy enough. There's also a list of Xen consultants on the Xen wiki, at http://wiki.xensource.com/xenwiki/Consultants. (If you happen to be a Xen consultant, feel free to add yourself.)

Here are some other resources that might be some use to Xen admins.

Mailing lists

There are several popular mailing lists devoted to Xen. You can sign up and read digests at http://lists.xensource.com/.

We recommend reading the Xen-users mailing list at least. Xen-devel can be interesting, but the high volume of patches might discourage people who aren't actively involved in Xen development.

At any rate, both lists are good places to look for help, but Xen-users is a much better place to start, if you have a question that involves using Xen.

Xen.org wiki

Xen's got a fairly extensive wiki at http://xen.org/wiki . Some of it is out-of-date, but it's still a valuable starting point.

And, of course, new contributors are always welcome. Take a look, poke around, add your own experiences, tips, and cool tidbits.

Bugzilla

Xen maintains a bug database, just like all software projects above a certain size. It's publicly accessible at http://bugzilla.xensource.com .

Type keywords into the search box, press the button, read results.

Distro Vendor

Don't forget the specific documentation and support resources of your vendor. Xen is a complex piece of software, and the specifics of how it's integrated vary between distros. Although the distro documentation may not be as complete as, say, this book, it's likely to at least point in the correct direction.

Xen-bugtool

If all else fails, you can use Xen-bugtool to annoy the developers directly. The purpose of Xen-bugtool is to collect the relevant troubleshooting information together, so that you can conveniently attach it to a bug report or make it available to a mailing list.

Simply run xen-bugtool on the affected box (in the dom0, of course.) It'll start an interactive session and ask you what data to include, and what to do with the data.

The xen-bugtool script collects:

  1. The output of xm dmesg
  2. The output of xm info
  3. /var/log/messages (if desired)
  4. /var/log/xen/xend-debug.log (if desired)
  5. /var/log/xen/xen-hotplug.log
  6. /var/log/xen/xend.log

It'll save this data as a .tar.bz2, after which it's up to you to decide what to do with it. We recommend uploading it somewhere web-accessible and sending a message to the xen-devel mailing list.

Error: Invalid Mode

[root@mares ~]# xm create -c rcs
Using config file "/etc/xen/rcs".
Error: Invalid mode

This was caused by a bad disk setting:


kernel = "/usr/lib/xen/boot/pv-grub-x86_64.gz"
extra = "(hd1,0)/boot/grub/menu.lst"
cpu_weight=128
memory = 4096
vcpus=2
cpus="1,2,3,4,5,6,7"
name = "rcs"
vif = ['vifname=rcs,ip=64.71.167.135,bridge=xenbr0,mac=aa:00:00:59:A7:87' ]
disk = [
        'phy:/dev/mares_domU/rcs,xvda,w',
        'phy:/dev/mares_domU/rescue,xvdb,r',
        'phy:/dev/mares_domU/rcs2,xvdc,w'
        'phy:/dev/mares_domU/lscscratch,xvdd,w'
]

can you spot the error?