VMware ESX Server Common Problems Diagnosis and Resolution

In this series, I will share some solutions for commonly encountered problems in VMware ESX host servers, VirtualCenter and virtual machines. For now, let’s start by looking at how to resolve some common issues with VMware ESX servers.

Resolving Purple Screen of Death (PSOD)

VMware ESX servers may encounter a situation similar to the Windows server blue screen of death, called Purple Screen of Death (PSOD). This is usually caused by a hardware issue or an error in the VMware code. When you experience a PSOD, the first step is to document all the information on the screen, such as taking a picture of it with your cell phone. This information is invaluable to VMware’s technical support team and can help them analyze the cause of the problem. In addition to logging the information, you may need to reboot the server. After rebooting, you can find a file named mkernel-zdump-* in the /root directory, which contains important crash information that can be further analyzed with the vmkdump tool.

Checking Server RAM

If you suspect that there is an issue with the server’s RAM, you can check it in the background with the tools provided by VMware, without affecting the running virtual machines. Start the RAM check by logging into the server console and typing Service Ramcheck Start. However, this method can only detect unused RAM. for a full RAM inspection, it is recommended to shut down ESX, boot from CD and run Memtest86+.

Using the VM-support tool

When asking for help from VMware tech support, they will usually ask you to run the vm-support tool, which collects all logs and configuration files and packages them into a single file. Simply log in to the service console with root privileges and type vm-support. Remember to delete the file when you’re done to free up disk space.

Troubleshooting with Log Files

Log files are a great tool for problem solving. Depending on the problem encountered, you need to check different log files. The most commonly used are the VMkernel log (/var/log/vmkernel) and the host agent log (/var/log/vmware/hostd.log). These logs can help you track down the source of the problem.

Version and Patch Information

Knowing the specific version of your ESX environment and the patches that are installed is important for resolving problems. You can obtain this information with the following command:

To view the ESX server version: vmware -v

To view installed patches: esxupdate -l query

To view the host agent version: vpxa -v

View the VMware Tools version: rpm -qa | grep VMware-esx-tools

Restarting Services

For many problems, simply restarting the VMware Host Agent service ( service mgmt-vmware restart ) or the vmware-vpxa service ( service vmware-vpxa restart ) will solve the problem. These actions do not affect the operation of the VMs, but restarting the mgmt-vmware service may automatically start all VMs in certain versions of ESX, so it is recommended to disable the autostart feature first.

Coping with a frozen service console

If the service console is stuck, you can still try to restart the server by logging in via SSH or using the emergency console. If possible, it is best to shut down or migrate the VMs before performing the restart operation.

Missing network configuration

If the network configuration is lost, you may not be able to connect to the server via VI Client. In this case, you need to rebuild the network configuration using the esxcfg-* command through the ESX Local Services Console.

Hopefully, this information will help you better manage and maintain your VMware ESX server.