Interpreting ESXTOP- Know the Essential counters and Threshold values.(Part-1)
ESXTOP is the decisive command-line utility bundled with ESXi host. It is used for troubleshooting many VMware issues. Basically this utility deals with the performance monitoring of various statistics such as CPU, DISK, Memory etc., similarly as Perfmon in windows. It allows you to collect various types of performance data as this utility comes with multiple command switches. You can also export your performance data with this utility. Almost every VMware administrator needs to know the usage of esxtop and its threshold. In this blog I will show you the most important esxtop commands with different switch usage and thresholds.
Mostly used in performance troubleshooting of:
- CPU Issues
- Memory Issues
- Disks/Storage Issues
- Network Issues
How to use esxtop?
As it’s an essential command line utility so it can be run in SSH session to the host, by using putty etc. Make sure SSH service is enabled in the host before you starting using the same.
After SSH is enabled connect to the ESXi host through putty, enter root credentials and hit enter. Now you are connected to the host.
Type command “esxtop” and hit enter, by default you will see the below screen view, by default what you see if the CPU performance details. If you type M you will see memory metrics. N for network etc. If you type H you will see all available commands.
You will also see “Worlds” in top of the screen, basically a world is an ESX Server VMkernel schedulable entity, similar to a process or thread in other operating systems.
How to monitor CPU?
By default the screen view you see is for CPU performance counters, but to check CPU performance in details there are few key counters that you need to see, which can cause serious issues when they are high:
(%RDY)High Ready Time: A CPU is in the Ready state when the virtual machine is ready to run but unable to run because the vSphere scheduler is unable to find physical host CPU resources to run the virtual machine on. Ready Time above 10% could indicate CPU contention.
(%CSTP)High Costop time: Costop time indicates that there are more vCPUs than necessary, and that the excess vCPUs make overhead that drags down the performance of the VM.
%PCPU USED: Indicates Percentage CPU usage per PCPU by VMs, and its average over all PCPUs
Press “C” to see the CPU usage, there you can check all these counters with values.
How to Monitor Memory?
In order to see the memory utilization we need to press “m” in the esxtop screen, there you will see various memory performance counters as below:
On the Top you will see many counters such as PMEM/MB, VMKMEM/MB, PSHARE/MB, MEMCTL/MB, SWAP/MB, and ZIP/MB. Key counters to monitor the memory performance are:
MEM Overcommit Avg: Shows the average memory overcommitment with intervals like 1-min, 5-min, 15-min. It’s good if you see value “0” here.
MEMCTL – Indicates the memory ballooning driver status, if the value is YES against the VM then it’s installed. And if NO then you need to investigate why? Because balloon driver gets installed with vmtools by default.
MEMCTLSZ – If value of this counter is more than ZERO it means host is trying to reclaim the memory from the guest, means host is overcommitted. Balloon driver has two states:
- Inflate – Reclaiming the memory from the guest VM.
- Deflate – Releasing the reclaimed memory to the VM.
Value “0” seems to be good for this counter.
SWAP: ESXi by default creates the swap file for each VM because when memory is excessively overcommitted. Swapping can cause huge performance impact to your VMs because it’s the last memory technique used by ESXi host to reclaim memory.
SWR/S: If value is larger than 0 the it indicates host reading memory pages from the VMs swap file.(.vswp).
SWW/S: Indicates host is writing memory pages to swap file (.vswp). Value greated than 0 needs to be worried, excessive memory overcommitment.
How to monitor Disk?
When it comes to disk, performance here is related to the datastore attached storage and local disk of the ESXi host. Press “d” to see the disk statistics, it will show as below:
Disks should be performing well in order to achieve high productivity of the virtual infrastructure. Its performance also depends upon various hardware, storage array vendors. The latency counters in esxtop report the Guest VM, ESX host Kernel and Device latencies. These are under the labels GAVG, KAVG and DAVG, respectively.
DAVG: Indicates the latency at device driver end, it’s basically the time taken by I/O request to HBA driver from storage driver. It is the good indicator of performance of backend storage. Min Latency should be 25 or lower than it, of its higher than 25 then the latency is caused by your storage array, FC Switch misconfiguration or faults
KAVG: Min threshold value of this counter should be between 0-2, if it’s higher than 2 it indicates the disk I/O latency from host kernel storage stack. Disk I/Os could be queued at kernel level if it’s high.
GAVG: It’s the sum of KAVG and DAVG. This is the total disk I/O latency for the device and the storage layer of VMkernel. It should be <=25, if it’s higher than the given value investigation needs to be done on host kernel and storage array as well.
How to Monitor Network?
When it comes to the monitoring of network throughput you must aware of vswitches, ports & uplinks etc. to know if there is any packet drops, latency or not. Press “n” to see the network statistics, this output will show you TEAM-NIC, DNAME and various other counters, key counters to see the network performance are given below as. “PORT-ID” identifies the port and “DNAME” shows the virtual switch name. A port can be linked to a physical NIC as an uplink, or can be connected by a virtual NIC.
%DRPTX: Indicates the network packet drops that are being transmitted, if it’s high please check the physical NICs are working according to their capacity or not. Either NICs are not up to the speed that is required (Hardware Limitation), better to use NICs with high transmitting speed. OR use more NICs in a load balancing.
%DRPRX: Indicates the network packet drops that are being received, shows the high network utilization or hardware overloaded.
- Min Threshold value for both values should be <=1.
If you are tired of seeing black putty screen using esxtop for statistics, you can also try using “VisualESXTOP” a GUI based utility. Will be sharing the step by step tutorial to interpret the same .