Monday, February 22, 2010

Query to map basic host to cluster information VMware 3.5

Brief Description: A quick way to show vital stats on hosts in a cluster.

Problem: The virtual center database is kind of ugly in the way it maps performance statistics to hosts (it appends a host- to the beginning of the ID, stores the clustername one level up from the host etc).

Quick Script:

The script is intended to map the basic stats for all hosts in a cluster. It's part of a bigger reporting services dashboard I'm building out, but I think it's helpfull on its own. As always please feel free to post corrections.

--Map hosts to basic cluster information

declare @ClustertoHost table
(
hostname nvarchar(100),
Clustername nvarchar(50),
Clusterid int,
hostid int,
hoststatid nvarchar(50),
HostCPU bigint,
HostMemory bigint
)

insert into @Clustertohost (hostname, clustername, clusterid, hostid, hoststatid, HostCpu, HostMemory)

select h.name, c.name, c.id, h.id, 'host-' + CONVERT (Varchar, h.ID) AS statID, (convert( bigint, vh.cpu_hz) * vh.cpu_core_count)/1024/1024/1024 as CPUTotal, (convert( bigint, vh.mem_size))/1024/1024/1024 as Memory
from
vpx_entity h join vpx_entity c
on h.parent_id = c.id
join vpx_host vh on
vh.id = h.id

where h.type_id = 1
--select * from @clustertohost

Friday, February 19, 2010

Simple(er) query to get paged out systems on VMware

Brief Description: On VMware, you want to query for VMs swapping out.

Problem:
I posted a longer script for checking VMware tools and Host build versions when looking for VMs that are swapping out in the last hour. It's maybe a little clunky if you just need to know what machines are swapping right now.

Solution:

Smaller query is below just to get a list of VMs Swapping out right now.


select h.name, MAX(s.sample_time), avg(s.stat_value) from vpx_entity h
join vpxV_hist_stat_daily s on s.entity = ('vm-'+ CONVERT (Varchar, h.ID))
where s.stat_name = 'swapped' and s.stat_value > 1
group by h.name
order by h.name

Thursday, February 18, 2010

SQL 2008 reporting services headache.

Brief Description: Report Services 2008 failing with error rsServerConfigurationError

Problem Description:
Reporting Services 2008 is stingy with error message information, and without more information, tracking down this error is like looking for a needle in a stack of needles. More error detail is available in %SQL Program Dir%\MSR10.MSSQLSERVER\Reporting Services\LogFiles. The long message associated with my issue was "The report server has encountered a configuration error. No DSN present in configuration file."

Solution:
This excellent post from the blog of Jessica M. Moss solved the issue for me. So, Thanks Jessica!

Thursday, February 4, 2010

You've been using VMware too long when

The other day I had an emergency request to stand up a SharePoint 2010 beta server immediately. It needed Windows 2008 R2 for various reasons and I realized that my VMware 3.5U4 clusters wouldn't work (previous experience had demonstated that it really didn't work). I thought about my options which didn't include upgrading my production hosts and finally decided that I would install VMware 3.5U5 or VMware 4.0U1 on an old VMware server (it was tiny--only 16 CPUs and 64 GB of RAM). For various reasons having to do with network and storage connectivity, I was unhappy with that solution. Then it occurred to me...

You can install Windows directly on physical hardware.

I've been using VMware too long!

Using PowerShell with NetApp to Remove Snapshots

Most VMware administrators have learned to use PowerShell for scripting. VMware has done a wonderful job with its PowerCLI to embrace this scripting language.

VMware is often implemented on NetApp due to some wonderful advantages with NFS, de-duplication, VDI and Snapshots. Thanks to work in a codeplex project known as PoshOnTap, it is possible to manage NetApp OnTap with the same ease.

A common problem is old snapshots that need to be discovered and deleted. It is simple with PoshOnTap to find all snapshots older than one week:

import-module PoshOnTap
connect-naserver -filer toaster -Credential (get-credential)
Get-NaVol | Get-NaSnapshot | where-object {$_.Created -lt (Get-Date).AddDays(-7)}



Want to delete those snapshots? You can append a " | remove-nasnapshot" to the above command:

Get-NaVol | Get-NaSnapshot | where-object {$_.Created -lt (Get-Date).AddDays(-7)} | remove-nasnapshot

I like to put a delay in between snapshot deletions so I tend to use code to introduce a 10 minute delay in between each deletion:

import-module PoshOnTap
connect-naserver -filer toaster -Credential (get-credential)

$snapshots=(Get-NaVol | Get-NaSnapshot | where-object {$_.Created -lt (Get-Date).AddDays(-7)})

foreach ($s in $snapshots) {remove-nasnapshot -name $s.Snapshot -volume $s.Volume; start-sleep 600}



PoshOnTap has a wealth of management functions for NetApp OnTap. The work done by the coders is exceptional and greatly appreciated!

Using PowerShell to find guests with memory swapped out

I often need to determine if any VMs have memory swapped out (swcur in ESXTOP). I've been using PowerShell and the PowerCLI to generate a list due to its extreme simplicity.

Get-VM | where-object {$_.powerstate -eq "PoweredOn"} | get-stat -Realtime -Stat "mem.swapped.average" -MaxSamples 1 | Select Entity, Value | where-object { $_.Value -gt 0 } |sort-object -property Entity

You can pipe the results to a CSV by appending " | export-csv" to the end of the statement.

VMware Swapping and Overcommit

The VMware clusters I maintain are used very aggressively. The servers are large (128 GB or 256 GB currently) and we use memory overcommit pretty aggressively. Usually that is not a problem, but we have found some limits in VMware 3.5 (tested through update 4). A number of cases opened with VMware support have lead to no resolution other than "It's better in VMware 4.0." That remains to be seen.

The symptom is a server with plenty of free RAM deciding it needs to start hard swapping. In ESXTOP this is shown by the SWCUR column in the memory page. To see it you will need to hit m after starting ESXTOP and then f to select fields and choose J to show swap stats. You will get a display somewhat like the following:

9:10:22pm up 146 days 19 min, 180 worlds; MEM overcommit avg: 0.34, 0.34, 0.34
PMEM /MB: 131066 total: 800 cos, 1162 vmk, 50218 other, 78885 free
VMKMEM/MB: 128653 managed: 7719 minfree, 15492 rsvd, 112807 ursvd, high state
COSMEM/MB: 77 free: 541 swap_t, 541 swap_f: 0.00 r/s, 0.00 w/s
PSHARE/MB: 13048 shared, 3534 common: 9514 saving
SWAP /MB: 15435 curr, 1982 target: 0.02 r/s, 0.00 w/s
MEMCTL/MB: 5489 curr, 3376 target, 91201 max

NAME MEMSZ SZTGT SWCUR SWTGT SWR/s SWW/s
vmware-vmkauthd 5.62 5.62 0.00 0.00 0.00 0.00
bwt-as1 16384.00 7777.88 4725.95 0.00 0.01 0.00
dmg-ci 16384.00 12628.46 6986.35 1982.34 0.00 0.00
bw1-ci 16384.00 7782.90 2090.97 0.00 0.00 0.00
crd-ci 16384.00 15843.64 356.13 0.00 0.00 0.00

If you have a number of VMs with very large memory sizes that are usually idle (think development/QA servers), you can have a host that has a memory state of "high", plenty of free RAM (often 60GB+) that will suddenly decide it has a desperate need to hard swap memory out. This will occur once the 'MEM overcommit" average hits about 0.5 (150%). It is easy to end up with VMs with GBs of RAM swapped out.

The VMs may actually behave fairly normally until they are actively used. A vmotion will also cause them to become unresponsive until the memory is paged back in which can take quite a while. On linux this will often appear to system administrators as a very high load average despite a lack of work.

Even more frustrating is that the algorithm that decides DRS placement doesn't care about passing the 0.5 overcommit and will make decisions that will cause the host to begin paging out.

The only workaround we currently have is caution while placing hosts in maintenance mode and buying memory.