Thursday, February 4, 2010

VMware Swapping and Overcommit

The VMware clusters I maintain are used very aggressively. The servers are large (128 GB or 256 GB currently) and we use memory overcommit pretty aggressively. Usually that is not a problem, but we have found some limits in VMware 3.5 (tested through update 4). A number of cases opened with VMware support have lead to no resolution other than "It's better in VMware 4.0." That remains to be seen.

The symptom is a server with plenty of free RAM deciding it needs to start hard swapping. In ESXTOP this is shown by the SWCUR column in the memory page. To see it you will need to hit m after starting ESXTOP and then f to select fields and choose J to show swap stats. You will get a display somewhat like the following:

9:10:22pm up 146 days 19 min, 180 worlds; MEM overcommit avg: 0.34, 0.34, 0.34
PMEM /MB: 131066 total: 800 cos, 1162 vmk, 50218 other, 78885 free
VMKMEM/MB: 128653 managed: 7719 minfree, 15492 rsvd, 112807 ursvd, high state
COSMEM/MB: 77 free: 541 swap_t, 541 swap_f: 0.00 r/s, 0.00 w/s
PSHARE/MB: 13048 shared, 3534 common: 9514 saving
SWAP /MB: 15435 curr, 1982 target: 0.02 r/s, 0.00 w/s
MEMCTL/MB: 5489 curr, 3376 target, 91201 max

NAME MEMSZ SZTGT SWCUR SWTGT SWR/s SWW/s
vmware-vmkauthd 5.62 5.62 0.00 0.00 0.00 0.00
bwt-as1 16384.00 7777.88 4725.95 0.00 0.01 0.00
dmg-ci 16384.00 12628.46 6986.35 1982.34 0.00 0.00
bw1-ci 16384.00 7782.90 2090.97 0.00 0.00 0.00
crd-ci 16384.00 15843.64 356.13 0.00 0.00 0.00

If you have a number of VMs with very large memory sizes that are usually idle (think development/QA servers), you can have a host that has a memory state of "high", plenty of free RAM (often 60GB+) that will suddenly decide it has a desperate need to hard swap memory out. This will occur once the 'MEM overcommit" average hits about 0.5 (150%). It is easy to end up with VMs with GBs of RAM swapped out.

The VMs may actually behave fairly normally until they are actively used. A vmotion will also cause them to become unresponsive until the memory is paged back in which can take quite a while. On linux this will often appear to system administrators as a very high load average despite a lack of work.

Even more frustrating is that the algorithm that decides DRS placement doesn't care about passing the 0.5 overcommit and will make decisions that will cause the host to begin paging out.

The only workaround we currently have is caution while placing hosts in maintenance mode and buying memory.

No comments:

Post a Comment