Wednesday, December 29, 2010

Configuring System Center Operations Manager 2007 (SCOM) Monitoring of SQL Server Agent Job Status

As part of the migration from Operations Manager 2005 to System Center Operations Manager 2007, we are configuring centralized monitoring of SQL Server job status.  Previously SQL Server jobs reported their status using notification configured within each job.  Configuring the job notification is a multi-step process:

  1. In the Authoring section of the SCOM console, select Management Pack Objects—Object Discoveries then search for “agent jobs”.  There are separate discovery types for SQL 2000, 2005 and 2008, and each needs to have an override created to allow discovery to occur.  Right click on the object and choose “Overrides—Override the Object Discovery—For all objects of class…”  Check the box for the Enabled parameter and set the override value to True.  Repeat for each version of SQL.  You will need to wait for the discovery to run which can take up to one day by default.  To verify the jobs have been discovered use the Monitoring section and select Microsoft SQL Server—SQL Agent—SQL Agent Job State.
  2. In the Authoring section of the SCOM console select Management Pack Objects—Monitor then search for “Last Run Status”.  Again, there are separate monitors for each version of SQL server.  Right click on the Last Run Status object, choose “Overrides—Override the Monitor—For all objects of class…”.  Check Alert on State and set the Override Value to “The monitor is in a warning state”.  Check Generates Alert and set the Override Value to “True”.
  3. In the Administration section of the SCOM console select Notifications—Subscriptions, right click and choose New subscription.  Provide a subscription name, a criteria that includes the Windows server you are monitoring (for example an alert raised by an instance with a specific name or an instance in a group), a subscriber (generally resolving to an email address) and a channel (generally SMTP for email).
  4. Test the Subscription by running a job that alternately succeeds and fails.  Wait 15 minutes between each run for testing purposes.

Troubleshooting:

  1. Check that the SQL job is succeeding or failing by viewing the job history in the SQL Management console.
  2. Check that the state is changing by checking the SQL Agent Job State in the SCOM Monitoring console.  Right click the job and choose the Health Explorer.  Navigate to Availability—Last Run Status and choose the State Change Events tab.  You should see previous state changes for the job.
  3. Remember to refresh the Management console to get current information.
  4. Remember that the alerts are not sent immediately.  The monitors have an interval parameter that specifies the number of seconds between checks of the job status.
  5. Changing the Alert Severity to critical and the Alert of State to “The monitor is in a critical state” does not appear to generate alerts (as of SCOM 2007 R2).

Friday, December 10, 2010

SharePoint 2010 Content Databases -- Auto Update Statistics Setting

I noticed that our SharePoint 2010 content databases were being set to Auto Update Statistics Off, and Auto Create Statistics Off.  Since we normally run with Auto Update Statistics on, and that was standard in WSS 3.0 and SharePoint 2007, I did some checking.

It's the default for new content databases added and/or attached, and is part of the upgrade process for databases moved over from a WSS 3.0 or SharePoint 2007 farm.  This is only the case with content databases.  The application and config databases default to your model configuration if created by SharePoint, and are not changed if you use DBA created databases.


I could only find one official reference to this in Microsoft's documentation, which is curious to me since their explanation indicates that performance is negatively impacted if your forcibly re-set Auto Create Statistics to on (although no mention of Auto Update is made).  This Microsoft TechNet article contains the information as a passing reference.  I've copied the relevant text below.

Do not enable auto-create statistics on a SQL Server that is supporting SharePoint Server. SharePoint Server implements specific statistics, and no additional statistics are needed. Auto-create statistics can significantly change the execution plan of a query from one instance of SQL Server to another instance of SQL Server. Therefore, to provide consistent support for all customers, SharePoint Server provides coded hints for queries as needed to provide the best performance across all scenarios.

Thanks to rjgd80 for pointing out that my original post was not clear regarding the Microsoft article.

Thursday, December 2, 2010

SharePoint 2010 upgrade error -- failoverpartner not supported

Quick Description:  When patching or upgrading SharePoint 2010 on a farm configured with failover mirroring the upgrade fails with error
Failed to initiate the upgrade sequence
An exception of type System.ArgumentException was thrown.  Additional exception information:  Keyword not supported 'failoverpartner'.
To diagnose the problem, review the application event log and the configuration log located at:
PathToLogFile
Problem:  SharePoint patches are failing on configured database mirroring.

Solution:
You do not need to destroy database mirroring on the SQL Server.  Removing the configuration on the SharePoint server is enough, and can be easily re-configured with a PowerShell command once the upgrade is finished.

Content databases and application databases can be removed through the Central Admin console, but configuration databases have to be configured with PowerShell.  To remove the offending failoverpartner setting run the following script.
get-spserviceinstance -all | foreach-object {
if ($_.typeName -eq "Microsoft SharePoint Foundation Database")
{
foreach ($Database in $_.Databases){
if ($Database.FailoverServer) {
write-host "Found Mirrored Database, Removing"
$Database.AddFailoverServiceInstance($Null)
$Database.Update()
write-host "Successfully removed Failover Instance"

}
else {
write-host $Database.Name "none configured"}
}
}
}
To check that mirroring has been removed, run the following command
get-spdatabase | select name, failoverserver
Run the upgrade, once it's completed you can re-configure the failoverpartner with this script.

Monday, November 29, 2010

DPM 2010 SharePoint Farm Backup Errors

Quick Description:  In the DPM 2010 console, a SharePoint 2010 Farm Replica shows the errors below:
Prepare for backup operation for SQL Server 2008 database ServerName\DatabaseName on ServerName has been stopped because this operation depended on another backup operation which failed or was cancelled. (ID 30200)
1 Database(s) has been removed from the SharePoint farm ServerName.  These databases are not part of the recovery point
    OR
 One or more databases seem to have been added to the SharePoint farm  ServerName.  Recovery point for the farm has been created without these databases.
Problem:
The primary error appears to occur whenever a farm configuration change is made.  Depending on whether you've added to, or subtracted from the farm, the secondary error will be one of the two additional errors.


Solution:
The error that occurs when a database is removed from SharePoint is solved by stopping protection and then re-creating it without deleting any data.  To do this right click the protected farm and select Stop Protection.  Do not check the box to Delete replica on disk.  This bears repeating.  Verify that Delete replica on disk is not checked, and then check it again, if you check the box you will irrevocably lose your existing backups.  Once you've destroyed the replica, you can re-instate it.  It will detect that there is inactive protection for the previously protected farm, and initiate a consistency check before bringing it online. 

The case in which a content database has been added to the SharePoint farm, is much more interesting.  My DPM server is protecting both the SharePoint farm, and the backend  SQL server that hosts all the additional SharePoint databases.  Since the SharePoint farm protection only backs up the config and content databases, I have the SQL servers set to auto protect new databases coming online on this server (generally there shouldn't be change other than new content databases).  It seems that databases are autodiscovered before new site collections, and since the new content database is already in DPM, it fails adding the site collections inside the content database to the farm replica.  The simple answer is to stop protecting the new content database on the SQL server and re-run the recovery point on the SharePoint farm.  Since new databases shouldn't just be cropping up unannounced on the SharePoint SQL server, you may want to stop DPM auto discovery on the SharePoint SQL server, but if you need autodiscovery on, make sure DPM is happy after any new content databases are added.


This works, and isn't disastrous unless you don't catch the DPM error and end up without a current backup of your SharePoint farm, but I think it's pretty fragile to not be able to cope with changes to the SharePoint farm.  If you have a great SharePoint administrator like we do, you shouldn't be surprised by databases appearing out of thin air, but even with warning, you will have to rebuild the replica if a content database is removed or moved to a new farm, which is at the least a hassle, and in practice just another moving part that can break.  I wish it were more robust.  Note to self:  Keep remembering how much better DPM 2010 is than the previous version.

Monday, November 22, 2010

DPM 2010 Item level restore error -- Index out of Bounds

Quick Description:  DPM 2010 item level restore against SharePoint 2010 fails with error "Index was outside the bounds of the array" .

Scenario:
Landscape:
Multiple SharePoint 2010/Windows 2008 R2 web front ends/application servers
SQL Server 2008 R2 back end
DPM 2010 is protecting a SharePoint 2010 farm with Item Level Restore configured and green.
Additional SQL 2008 R2 utility server for the unattached content database restore

Errors: 
In the DPM console, the error below is reported.
    DPM was unable to export the item YourItemName from the content database YourContentDatabase. Exception Message = Index was outside the bounds of the array.. (ID 32017 Details: Unknown error (0x80131508) (0x80131508))
In the DPM Client Logs on the target Web Front End (default location C:\Program Files\Microsoft Data Protection Manager\DPM\Temp\WssCmdletsWrapperCurr.ErrLog) the error below is reported.
06FC    1AB4    date    Time    09    AppAssert.cs(114)    WARNING    Nearest Site Url should not be null or Empty
06FC    1AB4    date    Time    31    WSSCmdlets.cs(450)    WARNING    Caught Exception while trying to export Url [ItemRestoreURL] to File [temporarylocation\DPM_GUID\cmp\].
Solution:
This error appears to occur when more than one of the Central Administration site are running in the farm.
In the Central Administration site, under System Settings, and Services on Server, check all servers in the farm for the Central Administration site, and temporarily stop the Central Administration site on all but the server listed in DPM.
I can reliably reproduce this error on 3 SharePoint farms by starting the Central Administration site on more than one server.  This, however, may not be the only condition that causes this error, I'm curious whether this is the only cause of the error, so please let me know if you run into the error and multiple Central Administration sites is not the cause.


Other Things to Check:
  • Check that the account you used to register your farm with DPM is, in fact, the Farm Account.  
    • In Central Administration go to Service Accounts.  In the Credential Management drop down select Farm Account, and verify you're using the account specified here.
      • DPM will let you build the replica with the content account, and the replica will look healthy until you go to restore and it fails with an Access Denied error.
      • To fix this you need to remove the replica from protection (I believe you can retain the protected data), re-register the servers with DPM, and re-create the protection group.  If you retained the protected data, you will need to perform a consistency check.
  • Although you can register all servers with DPM (and I do, because I worry about the recovery scenario if you lose the server that has registered the replica), the DPM server may complain that the Sharepoint VSS writer is running on multiple servers.  
    • It may be necessary to temporarily stop the SharePoint VSS writer on the SharePoint servers which do not own the replica in DPM.

 

DPM and SharePoint 2010 Item Level Restore -- The Temporary Staging SQL Server

In writing up a post on an error in DPM 2010 SharePoint Item Level Recovery, I started thinking about the role of the SQL server used as a temporary staging location during the restore. 

It seems to me, that for all intents and purposes, DPM's Item Level Recovery is really just the SharePoint 2010 Granular Restore/Unattached Content Database recovery integrated into the DPM console.  Russ Maxwell has a great walk through of how to do a granular restore without DPM.  DPM locates the recovery point of the content database, restores the whole database, and exports the file from the unattached content database, then re-imports it into the live site.  It's much less painful than manually locating the recovery point that contains the file/record, restoring the content database and using the item level recovery, since often the bulk of the work is in finding the restore point.  The DPM console lets you browse the recovery points to find the item before you go to the trouble of restoring the content database.

Restoring without a recovery farm requires 3 temporary locations.  A directory on the Web Front End to hold the restored item temporarily in cmp form, a temporary SQL server where the content database can be restored and, of course, a location on that database server's file system where the database file will temporarily live.  Although the primary database server can be used for the temporary restore of  the unattached content database (the database name is converted to the format  DPM_GUID).  I'm still using a Non-SharePoint, Non-Production SQL server for two reasons:
1.  If your content database is big, you'll take the I/O hit of restoring the whole database to the same disks that are serving up your production farm, and could potentially run your disk out of space (but you'd check that first) just to delete it when the item is restored.
2.  The term "Unattached Content Database" does not mean it isn't attached to SQL Server, it's just not attached to SharePoint. 
I'll be testing whether this works using SQL Express as the intermediate restore database.

Friday, November 19, 2010

Re-adding a rebuilt server to DPM 2010

Quick Description:  If a DPM client system irretrievably crashes, and is rebuilt/replaced with a system using the same computer account, DPM will not allow the old computer to be removed from the agent inventory, or reinstalled from the console.

Problem:  The original client agent is no longer available to perform an uninstall.  Attempting to uninstall results in the error
The agent operation failed because of a communication error with the DPM Agent Coordinator service on clientservername
The RPC server is unavailable (0x800706BA)
Since the rebuilt machine with the same computer account is online, it does not provide the option to remove the server from the DPM catalog.
Attempts to reinstall the agent fail as the server already exists in DPM.

Solution:  Manually install the client agent on the rebuilt server and refresh it in the console.  Firewall rules for DPM communication are automatically created.  It seems like this would be made more intuitive since one of the reasons you protect a server with DPM is that if it completely crashes, you can rebuilt it and restore the data. 

The command line to install the Agent on an x64 machine is
%Path to DPM install file Share%\DPMAgentInstaller_64.exe yourdpmservername
Install files for 2010 for an x64 client are located by default at
c:\program files\microsoft dpm\dpm\protectionagents\RA\3.0.7696.0\amd64\DPMAgentInstaller_x64.exe

Tuesday, November 16, 2010

My Very First Argument with DPM 2010

Problem Quick Description:  DPM servers previously configured and working as cross-protection/DR protection for each other, start failing to connect with the error 
DPM failed to communicate with the protection agent on yourservername because access is denied. (ID 42 Details: Access is denied (0x80070005))
Situation:
DPM Servers configured to protect each other, start denying access and refusing to connect.  Servers are in the correct local groups (DPMDRTrustedMachines, DPMRADmTrustedMachines, DPMRADCOMTrustedMachines) on both the primary DPM server and the DR DPM server.

Work Around:
I don't have root cause on this yet, but not getting a good backup of your DPM Database is such a showstopper that I'm posting a cheesy work around.

To get back up and running, add the machine account of the DR DPM server to the local administrators group of the primary DPM server.  This is obviously not a good long-term solution, but in my situation it's less dangerous than running with the DPM databases unprotected (your mileage may vary).  When I've tracked down the permanent solution, I'll post it.

Update on Failover Mirroring Testing SharePoint 2010/SQL 2008 R2

I took the opportunity to ask the SQLCAT guys at this year's SQLPASS about an issue with storage hangs I had been testing a couple of weeks ago.   In testing our SharePoint 2010 infrastructure, I ran into one scenario in which SQL server loses access to its underlying disks.  The mirror eventually takes over, but SharePoint does not recover until either the storage comes back online, or the primary is completely offline.

Since it was hard to say whether the issue is SQL Server, SharePoint, or the .NET provider, I talked to pretty much everyone.  While there isn't a definitive answer, it's an easy scenario to reproduce, so the webapps team said they would test it with the new "Always On" in the new version.

For the moment, I'm documenting it as a known issue requiring manual intervention to bring SharePoint back online.

Friday, November 5, 2010

Quick Response to SysPrep Question

We got a question on my (perhaps slightly snarky) article on how disappointing SQL 2008 R2 SysPrep is, regarding Windows 2008 R2 SysPrep.

Question from Annonymous is:
I'm trying to use sysprepped 2008R2 Express Editions so that each of my developers can get a local copy of SQL Server to test code as part of their workstation baseline.

The issue I have with it, is that even on a properly Prepared image, the CompleteImage phase doesn't appear to work during the "cmdlines.txt" portion of Windows Mini-Setup. Instead, the setup.exe silently dies, and the even log points to errors resolving the C Runtime library (which appears to be properly installed under WinSxS).

If the Sysprep functions do not work DURING sysprep then they're useless to me. If I, as an admin, have to log on to each workstation/server after loading the image, then there's no point in providing automation functions.

I'm passing on Benj's answer since he's the expert on Windows SysPrep.

Benj suggests setting it to autologin once, and running SQL Setup/Complete Image during the guirunonce phase.

I'll post my non-SysPrep steps for creating a SQL 2008 R2 VMware template soon.

Thursday, November 4, 2010

Quick Recap of DPM 2010/SharePoint 2010 config steps:

Seems like a lot of people are seeing the DPM MetaData error when getting SharePoint 2010 set up for item level restore.  I've posted an article specifically about resolving this error, but it presupposes that the initial configuration for DPM 2010 SharePoint protection has been completed.


I've posted the initial configuration steps below:


From a command window run as administrator on your SharePoint Web Front

pushd [YourDPM Directory default c:\program files\microsoft Data Protection manager\DPM\bin]
ConfigureSharePoint.exe -EnableSharePointProtection
[provide your farm service account user and password when prompted]

ConfigureSharePoint.exe -EnableSPSearchProtection
[provide your farm service account user and password when prompted]
"c:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\BIN\stsadm" -o registerwsswriter
Run "vssadmin list writers" and check for any VSS writers not in "stable" state.

Restart the following services on the SharePoint Web Front End
  • Volume Shadow Copy Services
  • SharePoint VSS Writer
  • DPMRA
Build the replica in DPM

Possible causes of problems with item level restore in addition to the metadata error:
  • One of the SharePoint databases is already protected by DPM -- 
    • Protect the SharePoint Farm before configuring database protection.  The config and content databases will then become unavailable for protection for SQL Server.
  •  Item level restore is configured correctly but the catalog hasn't yet populated -- Follow these instructions to force the catalog to populate.

Configuring SharePoint 2010 Farm Monitoring with SCOM 2007 when missing sharepointMP.config file

Microsoft made management pack installation much easier in System Center Operations Manager 2007, but I recently had problems with the new automated process.
image
Directly installing the management packs from the catalog will appear to function fine, but a critical file “sharepointMP.config” will be missing. There are very references on the internet to issues with the x64 version of Windows installing the file to the wrong location—this is not the solution. A search of the entire server will not return the file.
The solution turned out to be trivial. Download the management pack the old fashioned way as an MSI from Microsoft and install it noting the x86/x64 issue if it is appropriate. Once installed you will find the “sharepointMP.config” file in the directory. Follow the instructions in the “Microsoft® SharePoint® 2010 Products Management Pack Guide” on Microsoft’s site to edit the file appropriately for your service account. Basically just change these two entries:
  • Association Account="SharePointService" (an account with rights to the farms)
  • Machine Name="spoint" (or some string that is in all your sharepoint server names)
Then run the "Configure SharePoint Management Pack" command in the action pane of the SharePoint Folder in SCOM. It will take a while for the data to appear once this is done. Microsoft suggested 30 minutes in their documentation.

Wednesday, November 3, 2010

Get list of mirroring timeouts

Just a quick script to get a list of mirroring timeouts for databases configured for mirroring.  SQL 2008's failover mirroring default is very quick to failover, and I'd like it to wait out things like very high disk latency in the middle of the night when storage does disk intensive things like deleting snapshots rather than failing over.


To check mirroring timeouts in a more human friendly format:

select d.name as databasename, m.mirroring_connection_timeout as timeout
from sys.databases d join sys.database_mirroring m
on d.database_id = m.database_id
where m.mirroring_guid is not null


Note:  when setting the mirroring timeout, setting it on one server sets it on both.

ALTER DATABASE [DBName] SET PARTNER TIMEOUT 20

SQL 2008 -- Can't connect with management studio from the local machine

Brief Description:  On a SQL 2008 R2 machine running on Windows 2008 R2, when attempting to launch Management Studio logged in as a domain account with local admin privileges, SQL server refuses connection with access denied.

More Details:  
If the default admin of the SQL server is the local administrators group, members of the local admin group will be able to access with remote copies of Management Studio, but can't connect locally when directly launching Management Studio

Solution:
Right click management studio and "run as administrator".  Maybe this should be intuitive, but it wasn't for me.  Since Management Studio is an administrative tool by default, it seems slightly off to me that I'd have to tell it I want to be an administrator.

I'm still getting used to the whole concept of Microsoft deciding that "su" is the new thing.  I'm not sure it's the thing I would have taken from *nix if I was going shopping for things windows should totally have...I'd maybe have gone with a native way to mount an ISO, but that's just me.

Disclaimer:
Connecting as a user account to do any real work isn't so much a best practice since so many things end up as owned by the account that created them (mirroring endpoints being owned by user accounts was an unpleasant surprise to me), but sometimes it's practical for troubleshooting, or work that actions that should be audited by username liking shutting down a service.

Wednesday, October 27, 2010

Hotpatching: Great idea Microsoft, but a terrible implementation

While doing research on minimizing planned downtime within a datacenter, I re-examined the idea of hotpatching. Microsoft introduced this idea with Windows 2003 Service Pack 1 as part of a “reboot reduction initiative.”

Microsoft reboot reduction initiative

Hotpatching is part of the Microsoft reboot reduction initiative, which seeks to help minimize the need for a full system reboot after installing updates. Reducing reboots is important because IT departments in many organizations implement a time-consuming test cycle every time an update is installed and the system is rebooted. This results in loss of productivity and revenue to the organization until their system is fully verified and operational. Hotpatching allows customers to deploy important updates and patches in a timely, transparent manner without requiring a full system shutdown and restart. This reduces their rollout time.

The following examples demonstrate possible savings from reboot reduction:

  • Of the 22 updates that shipped for Windows Server 2003 RTM between April 2005 and August 2005, 15 of them required a reboot. Eight of these could have been hotpatched. This would have reduced the number of reboots by 53%.
  • Of the 14 updates that shipped for Windows Server 2003 Service Pack 1 (SP1) prior to August 2005, ten of them required a reboot. Four of these could have been hotpatched. This would have reduced the number of reboots by 40%.

Source: http://technet.microsoft.com/en-us/library/cc781109(WS.10).aspx

So how did it work out?

I did some research by using Google to find security bulletins that contained the word hotpatching and then refined those results. I found only a few bulletins that supported the switch. Here is an example of one search:

hotpatching -does-not-support-HotPatching site:http://www.microsoft.com/technet/security/bulletin

I would estimate less than 1% of patches released support hotpatching—which leaves me pretty disappointed given Microsoft’s example.

The Linux equivalent, ksplice seems to have a much better track record—though I have no experience with its impact on system stability.

Update:

In an effort to find additional statistics showing that hotpatching was basically unused by Microsoft, I extracted one of the few hotfixes I could find that supported hotpatching (WindowsServer2003-KB917159-x86-ENU) using the /x switch. According to Microsoft’s documentation, a hotfix that supports hotpatching will contain a file with a “.hp” extension. Within the directory structure I did indeed find the hotpatching file “svr.hp.sys”.

I then extracted and searched through a collection of 200 post Windows 2003 service pack 2 hotfix files for other filenames containing the ‘hp’ string.

Number found: zero.

Monday, October 25, 2010

Infrastructure Testing: When the storage hangs, SQL 2008 R2 deals, SharePoint 2010 in a tizzy.

I've been testing worst case scenarios for our new 2010 SharePoint infrastructure.  It handles crash testing so elegantly that I'm amazed.  Almost all the usual tests, from graceful shutdown, to tests that are just plain mean, work flawlessly.  Whether it's stopping the services, shutting down the server, pulling the "plug", killing the network -- almost nothing fazes it.  The failover is lighting fast, and the services keep working.  From a browser, there's time to make a couple of http requests that fail before the database and SharePoint shake it off and just work again.

So far, I've only been able to create one really, really, ugly situation.

From my testing, the worst possible thing that could happen appears to be a storage hang.  We have to artificially create a storage hang.  It's maybe hard to imagine how you get into the situation where storage just hangs without triggering failover in an HA storage environment, but it's possible, and boy is it ugly to recover from.   SQL, admirably, manages to detect it needs to failover after a while, but SharePoint just faints. To be fair, I don't know of any system that *loves* losing its storage.  In the same storage hang testing on Oracle DataGuard in high availability mode, database failover never happened, never mind the application surviving.  It seems the ugliest situations HA environments get themselves into, are the ones in which it's not completely crashed, but still unusable.

Scenario:
Landscape:
SQL:  Failover mirroring (SQL 2008 R2) hosted on Windows 2008 R2 VMs on VSphere 4.1.
SharePoint:  Load balanced web front ends SharePoint 2010 on Windows 2008 R2 on VSphere 4.1.   Out of rotation application servers for indexing, metadata, office automation etc.  Configured to understand failover mirroring.

Test:
Hang the vfiler hosting the primary SQL Database...wait.

Results
It take SQL a while to figure it should fail over to the mirror (w/20 second mirroring timeout), but after 3 - 5 minutes databases were failed over and online.
SharePoint hangs until the database fails over, at which point it starts generating 503 errors and never seems to recover.

Things that don't bring SharePoint back online:
Restarting the admin service (with the theory that the admin service was perhaps keeping track of the failover state by keeping a tcp connection open to the server)
Restarting a web front end (same theory but testing whether the web front ends themselves recognize)
Running the test with the admin and config databases already failed over to the mirror (to test if it just becomes paralyzed without the config database).

Things that work to bring SharePoint back online:
Taking the primary SQL server offline (this is hard if VMware can't talk to the vmdk file).
Bringing the storage back online (As soon as the storage on the primary is back online, SharePoint recognizes it's no longer the primary, starts using the secondary, and is happy).


Theories:  
Everything from questions about the .net provider itself, to wondering if the virtual disk needs to return a hard error (new drivers in VSphere), to wondering if the primary is orphaned in some way (the witness and the secondary know it's not the primary anymore, but it doesn't).  Time to get Microsoft support involved.

Update:   Ruled out new VSphere SCSI drivers and disk timeout settings.  Changing the SCSI driver, disk timeouts, and mirroring timeouts affects how quickly SQL server mirroring fails over, but doesn't change the fact that SharePoint doesn't recover until the server goes offline or the storage comes back.

Thursday, October 21, 2010

Data Protection Manager 2010 and SharePoint 2010 metadata error

Brief Description:  After creating a replica of a SharePoint 2010 farm in Data Protection Manager 2010, the replica shows the error "Backup metadata enumeration failed".

Problem:  After running both the EnableSharePointProtection and EnableSPSearchProtection options with ConfigureSharepoint.exe, and creating a protection group in Data Protection Manager, the replica begins building without error.  When it completes, it gives a warning "Backup metadata enumeration failed.".  Item level restore is, therefore, not available.  All the VSS services are running on the front end servers and the database servers.  vssadmin list writers also shows that the SharePoint writer is stable and has no errors.

Solution:

This is a deceptive error because the VSS writers all show up as happily running without error.  In my case, the solution was to run the command
"c:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\BIN\stsadm" -o registerwsswriter
on front end servers in the farm.  That's not a typo, it's wsswriter not vsswriter.  Odd since it's SharePoint 2010, but nonetheless, replicas happily created and with a force populate, I have item-level restore.

I've posted a quick recap of DPM 2010/SharePoint 2010 config steps and other gotchas with item level restore.  If this is an install that has never worked, before following the steps in this article, destroy the replica with the metadata error.  Then, follow the configuration steps, and re-create the replica.

If you get this error on an existing, previously working, DPM replica, force a new recovery point, there may have just been a timeout during the last recovery point.

Wednesday, October 13, 2010

Removing aliases from PowerShell to aid in learning and script stability

When using PowerShell 2.0, it is easy to rely on aliases for functionality.  The excellent Windows PowerShell 2.0 Best Practices book suggests removing all the aliases to ensure you don’t use them accidentally.  Unfortunately, the code provided will only work if you have changed your location to the alias: location before running it.  The following code should work without the need to use set-location first:

foreach($alias in (get-alias | where-object {$_.options -notmatch 'readonly'})){remove-item alias:\$alias}

Tuesday, October 12, 2010

DPM 2010 so much better than 2007, just so much better

Since migrating to DPM 2010, I have an hour a day back.  Seven hours a week isn't trivial  (it couldn't go a weekend day without being looked at either).  With DPM 2007, there were days where I spent much more than an hour coaxing it into behaving itself for the next day, and the alerts were both not tunable, and too chatty to be useful.  With DPM 2007, even using the DPM powershell extensions to run scheduled contingent consistency checks, and hourly runs of pruneshadow copies, I was forever fiddling with things that went pear shaped.  Still, centralizing the SQL backups was worth it, I just hoped it would get better in 2010, and it really, really did.

I moved the last 2007 out of the environment last month, and my life is suddenly *way* easier.

In DPM 2007, it just plain couldn't keep up with pruning and I frequently ended up with protection groups containing *way* more recovery points than I had specified.  Even with an hourly run of pruneshadow copies, I frequently had to brute force it by kicking off multiple concurrent prunes just to get it to catch itself up.  On 2010, with the same retention and number of databases, I'm not even running pruning as a scheduled job, it just keeps up.

In DPM 2007, replicas were forever getting inconsistent with the slightest provocation.  If creating a recovery point timed out, boom, inconsistent and the job to run consistency check didn't always fix it.  More than once, I had to destroy and rebuild the replica (*so* not ideal). after multiple cumbersomely disk intensive consistency checks failed to fix the problem.  DPM 2010 self-heals nicely.

I'm also not seeing agent crashes, even on 2003/2005 servers.

I have an hour a day back from fighting DPM problems, pretty nice!

Tuesday, October 5, 2010

DPM 2010 DPM database has exceeded the threshold limit

Quick Description: DPM 2010 shows a warning alert that the DPM database has exceeded the threshold limit even if you don't have a warning threshold set.

Problem: It seems that even if you have "Alert me when DPM database size reaches x GB" unchecked, it ignores it and generates an alert when the database size exceeds the default threshold of 1 GB.

Solution:
Cheesy, but in the alert description click on "Modify Catalog Alert Threshold size" and enable the alert, set it to a bigger number, and then disable the alert again.
As far as I can tell, it just plain ignores that you don't want to be alerted, and alerts on the default size anyway.

Sunday, September 26, 2010

Zeroing, defragmenting and thin provisioning existing VM disks

Here is a procedure I used to defragment, zero out free space and thin provision hundreds of Windows VMs after a recent migration to vSphere 4.1 from 3.5. I took advantage of a host with a number of local SSD drives to perform the maintenance. The procedure was as follows:

  • Create a list of VMs (in my case all of them of type Thick)
  • Copy utility files to each VM
  • Move each VM to the host with the local SSD drives using vmotion
  • Move each VM to the SSD datastore using svmotion
  • Defrag the VM's drives
  • Zero out any free space on the VM's drives to enable the best results on the conversion to thin disks
  • Move each VM back to its original datastore

I used a couple Microsoft utilities to accomplish this:
I also used a simple batch file "cleanhd.cmd" to accept the EULA and iterate thru all hard drives:

reg add HKCU\Software\Sysinternals\SDelete /v EulaAccepted /d 1 /t REG_DWORD /f
FOR %%I IN (B C D E F G H I J K L M N O P Q R S T U V W X Y Z) DO @if exist "%%I:\System Volume Information" defrag %%I: -f -v
FOR %%I IN (B C D E F G H I J K L M N O P Q R S T U V W X Y Z) DO @if exist "%%I:\System Volume Information" sdelete -c %%I:


Here is the PowerCLI that handles the work:

$vms=Get-VM | Get-HardDisk | where-object {$_.ProvisionedSpaceGB -lt 100} | select-object -expandproperty Parent -unique | sort
Foreach($vm in $vms){
$SSDDatastore=get-datastore 'datastore-ssd'
$SSDHost=get-vmhost 'ssdhost'
$originalds=($vm | get-datastore)
copy-item c:\windows\sdelete.exe -destination (convert-path ('\\' + $vm.Name + '\' + 'admin$'))
copy-item c:\windows\cleanhd.cmd -destination (convert-path ('\\' + $vm.Name + '\' + 'admin$'))
move-vm -destination $SSDHost -vm $vm
$vm | Get-HardDisk | set-harddisk -datastore $SSDDatastore -storageformat thin
psexec \\$VM cleanhd.cmd
$vm | Get-HardDisk | set-harddisk -datastore $originalds -storageformat thin}

Finding all VMs with thick hard drives

After a recent migration to VMware 4.1 and NFS datastores, I decided to convert all the hard disks of the VMs to thin format using storage vmotion. PowerCLI has a new parent property that makes it easy to get a list of the VMs using thick disks in a single line:

Get-VM | Get-HardDisk | where-object {$_.StorageFormat -eq 'Thick'} | select-object -expandproperty Parent -unique | sort

Monday, August 9, 2010

Creating a WinPE 3.0 bootable CDROM for extending the C partition on VMware

VMware makes it very easy to grow the size of an underlying disk. With Windows 2008R2 you can extend the file system online. With Windows 2003 you can extend the file system of any drive except the system drive (the C drive usually) online using diskpart extend. To extend the system drive there are a number of work-around methods such as attaching the drive to a second VM.

I decided to create a WinPE 3.0 bootable CDROM ISO to use as an alternative method--partly to play with WinPE 3.0. I wanted to include the pvscsi drivers from VMware 4.1 and use a Windows 2008 R2 base. It turned out to be pretty straight-forward.

First download the Windows Automated Intallation Kit software from Microsoft and install it on a VM that has VMware 4.x tools installed. The kit is identical for Windows 7 and 2008 R2. You can use it on a 32-bit or 64-bit version of Windows. I tried out both successfully but stuck with the 64-bit version.

Here is a script that you can run that will build the ISO. Run it in an elevated "Deployment Tools Command Prompt" after you install the kit. It will copy the VMware SCSI drivers from the VM and create an ISO that upon boot will extend the C drive and shutdown the VM. Depending on the version of Windows and VMware tools you may not have both the pvscsi and scsi drivers, but if there is no specialized driver the included Windows driver should work.

  • set targetdir=c:\winpe
  • copype.cmd %PROCESSOR_ARCHITECTURE% %targetdir%
  • Copy %targetdir%\winpe.wim %targetdir%\ISO\Sources\boot.wim
  • xcopy "%ProgramFiles%\vmware\VMware Tools\Drivers\pvscsi\*.*" %targetdir%\drivers\pvscsi\*.* /e
  • xcopy "%ProgramFiles%\vmware\VMware Tools\Drivers\scsi\*.*" %targetdir%\drivers\scsi\*.* /e
  • Dism /Mount-WIM /WimFile:%targetdir%\ISO\Sources\boot.wim /index:1 /MountDir:%targetdir%\mount
  • Del %targetdir%\ISO\boot\bootfix.bin
  • echo select volume=c > %targetdir%\mount\Windows\System32\diskpart.ini
  • echo extend noerr >> %targetdir%\mount\Windows\System32\diskpart.ini
  • echo list volume >> %targetdir%\mount\Windows\System32\diskpart.ini
  • echo [LaunchApps] > %targetdir%\mount\Windows\System32\winpeshl.ini
  • echo diskpart.exe, "/s diskpart.ini" >> %targetdir%\mount\Windows\System32\winpeshl.ini
  • echo wpeutil.exe, shutdown >> %targetdir%\mount\Windows\System32\winpeshl.ini
  • Dism /image:%targetdir%\mount /Add-Driver /Driver:%targetdir%\drivers /recurse
  • dism /unmount-wim /mountdir:%targetdir%\mount /commit
  • oscdimg -n -b%targetdir%\etfsboot.com %targetdir%\ISO %targetdir%\winpe_%PROCESSOR_ARCHITECTURE%_extend.iso

Thursday, August 5, 2010

SQL 2008 R2 Sysprep slightly worse than nothing

I'm just going to say it, I don't see a use for SQL 2008 R2 SysPrep.

I'm generally so happy with SQL 2008R2, people often pretend they aren't with me when I get started. Really, if I want to look vaguely normal, I have to be seen with people with an even more fanatical dedication to their product like Mac users, or our SharePoint administrator. That's why it's slightly painful (although vaguely reassuring as regards my sanity), that I'm super disappointed with the much touted SysPrepability of SQL 2008 R2.

Really, as far as I can tell, it's slightly worse than nothing at all. You can only prep the database engine, reporting services, and the browser (not even the client tools and SSIS!). You can tell the prep install all you want that you want the base instance install to go to a location other than c:\program files, but it apparently knows better. It not only defaults back to c:\program files, but greys out the ability to change it once you run the complete install. It knows better than you what you want, even if you explicitly told it what you want. As far as I can tell, you can't even set the browser service to automatic, or set a default data and log location in the SysPrep.

The Microsoft article on SQL SysPrep is here

The only useful thing I can see is that it requires you to copy the install files locally so that they're already included in a template (but, surprisingly, I'm actually capable of copying files to a template anyway). I also find it weird that the SysPrep comes up as evaluation edition until it's fully installed (this is noted in the SysPrep article).

Just a big bummer, back to old school template unattended installs for me. If anyone has cool ways to make SysPrep fancy or, you know, useful, I'd love to hear about them.

Saturday, July 31, 2010

Simple Multistore vFiler setup script

Some of the setup commands are not well documented for creating NetApp Multistore vFilers. Here is an example setup that can be modified and pasted into an ssh/putty window to create a vFiler for use with VMware.

In the following code:
Replace AGGREGATENAME with aggregate name
Replace ROOTVOL with vfilername without dash
Replace vfilerName with vfilername with dash
Replace FLEXVOL with data volume name
Replace PASSWORD with password
Replace IPADDR with IP Address
Replace NETMASK with mask in the form a.b.c.d (use periods as separators)
Replace INTERFACE with interface (e.g. vif0-101)
Replace NISSERVER1 with IP Address
Replace NISSERVER2 with IP Address
Replace DNSSERVER1 with IP Address
Replace DNSSERVER2 with IP Address
Replace IPSPACE with IPSpaceName
Replace ADMINHOST with IP Addressvfiler context vfiler0
  • vol create ROOTVOL -s volume AGGREGATENAME 20g
  • vol create FLEXVOL -s none AGGREGATENAME 3t
  • vol options FLEXVOL no_atime_update on
  • snap sched FLEXVOL 1 1 0
  • snap reserve FLEXVOL 0
  • snap autodelete FLEXVOL commitment try trigger volume target_free_space 5 delete_order oldest_first
  • snap autodelete FLEXVOL on
  • vol options FLEXVOL try_first volume_grow
  • vol autosize FLEXVOL -m 4T -i 100g on
  • vfiler create vfilerName -n -s IPSPACE -i IPADDR /vol/ROOTVOL /vol/FLEXVOL
  • vfiler disallow vfilerName proto=cifs proto=rsh proto=iscsi
  • vfiler run vfilerName setup -e INTERFACE:IPADDR:NETMASK -d itwalkthru.com:DNSSERVER1:DNSSERVER2 -a ADMINHOST -p PASSWORD
  • vfiler run vfilerName secureadmin setup -q ssh 768 512 768
  • vfiler run vfilerName options nis.domainname nis.itwalkthru.com
  • vfiler run vfilerName options nis.enable on
  • vfiler run vfilerName options nis.group_update.enable on
  • vfiler run vfilerName options nis.group_update_schedule 1
  • vfiler run vfilerName options nis.netgroup.domain_search.enable on
  • vfiler run vfilerName options nis.netgroup.legacy_nisdomain_search.enable on
  • vfiler run vfilerName options nis.servers NISSERVER1,NISSERVER2
  • vfiler run vfilerName options nis.slave.enable off
  • vfiler run vfilerName exportfs -p rw=ADMINHOST,root=ADMINHOST /vol/FLEXVOL
  • vfiler run vfilerName exportfs -p rw=ADMINHOST,root=ADMINHOST /vol/ROOTVOL

Wednesday, July 28, 2010

DPM VSS error

Problem: On attempting recovery in DPM 2007, recovery fails with the error

The VSS application writer or the VSS provider is in a bad
state. Either it was already in a bad state or it entered a bad state during
the current operation. (ID 30111)


One Solution: This is just a quick one to rule out before looking at the more complicated things that can go wrong with the VSS provider.

Verify that the restore isn't trying to write to files that already exist on the target server.

Seems to me that perhaps a more descriptive error message...something like...say..."File already exists" would fit the bill here, but absent that, I'm loving how much better behaved DPM 2010 is.

Wednesday, June 9, 2010

Removing a disk from a powered on VM in VMware

Here is my procedure for removing a disk from a powered on VM on VMware 3.5. The procedure was tested on a Windows 2003 SP2 32-bit VM but should be applicable to other operating systems. This is unsupported, but I have not encountered any problems.

  1. Remove drive letter/mount point in Windows Disk Manager
  2. Right click on disk in device manager and choose uninstall (if you skip this step the plug and play manager will complain about unexpected removal)
  3. Connect to host using ssh/putty
  4. Get VMID with vimsh -ne "vmsvc/getallvms"
  5. Get SCSI Controller Number and Disk from properties of disk in VC
  6. vimsh -ne "vmsvc/device.diskremove vmid scsiControllerNumber scsiUnitNumber deleteFile"
Example:

vimsh -ne "vmsvc/device.diskremove 9999 scsi0 4 n"

Listing NetApp Snapshots using PowerShell

NetApp has recently released a support PowerShell module for managing Data OnTap.

You need to unzip and import the module
import-module DataOnTap

I'm just getting started with it, but I've noticed a few inconveniences. One of the first tasks I would like to do is list all snapshots older than a certain date.
get-navol myvolname | get-nasnapshot

will return a list of snapshots on a given volume, but unfortunately the time is stored as an integer. A little playing around discovered this will provide the correct information:

Get-NaVol myvolume | Get-NaSnapshot | select Name, @{Name="AccessTime"; Expression = {([datetime]"1/1/1970").AddSeconds($_.AccessTime).ToLocalTime()}}

It is possible to find snapshots older than one week using a command like:
Get-NaVol | Get-NaSnapshot | select Name, @{Name="AccessTime"; Expression = {([datetime]"1/1/1970").AddSeconds($_.AccessTime).ToLocalTime()}} | where-object {$_.AccessTime -lt (Get-Date).AddDays(-7)}

Hopefully version two of the module will use native data types to make this all easier.

Monday, June 7, 2010

DPM PowerShell: Force a Sharepoint Catalog to populate

Brief Description: After creating a SharePoint 2010 protection group in DPM 2010 the item level restore capability isn't browseable for up to 24hrs.

Problem: It's difficult to verify that your SharePoint protection group is configured correctly for item level restore if you can't actually browse it in the recovery pane, you could waste a lot of days making changes and waiting 24 hours to see if they were the right ones.

Solution:
To force the catalog to populate you can run the following powershell command from the DPM PowerShell commandline (thanks calebfinley for noting the need for clarification).



Get-ProtectionGroup yourdpmservername |get-datasource |where-object {$_.type -like "*sharepoint*"} | start-createcatalog

IsSharepointFarmProtected appears to be the only property that indicates the protection group is a sharepoint farm.

This can be re-run on an existing group, and can be run on a protection group which contains both a sharepoint farm and other objects like databases. The object type like is inelegant, but I'm still working on how to get it to compare the whole string.

Wednesday, June 2, 2010

Sharepoint Powershell: Finding and configuring databases to be configured as failover in Sharepoint.

First, I need to take a second to be an over the top fangirl about how great the failover mirroring is in sharepoint 2010 w/sql 2008 (R2 in our case). Beautiful, amazing, etc. Lightning fast failover not just from the DB side, but from the app side. We're only running a small test farm (compared to our much larger production infrastructure), and we haven't done serious load testing yet (so I hope I won't come back disappointed in a couple of weeks). But seriously, just in our "I hard crash the SQL server while our SharePoint administrator messes around adding items/clicking around" we're seeing one dropped request and no data loss (obviously you can't take messing around testing to the bank). It's synchronous mirroring so I suppose it would be shocking if it did lose data but still, so nice. Way faster failover than anything I've seen on SQL and since we don't have RAC, faster than I've ever seen Oracle failover either. Just gorgeous. Really, I'm not doing it justice. It's so pretty I want to buy it things.

Configuration of failover mirroring from the sql side is standard synchronous with witness. Initial SharePoint configuration is easy with the powershell interface. There's a lot of great articles out there. This technet article publishes the script.

You can run the following command to get the list of databases and their failover servers but if you have a whole bunch of content databases it's easy to miss, or misidentify one.
"get-spdatabase | select name, failoverserver"

Again, cribbed from the technet script for configuring database mirroring the first time is here. My hacking of the meat of the script to check for and configure unconfigured databases is below.


Param([string]$FailoverInstance = $(Read-Host "Enter the Mirror (Partner) SQL Instance(like server\instance)"))
get-spserviceinstance -all | foreach-object {
if ($_.typeName -eq "Microsoft SharePoint Foundation Database")
{
foreach ($Database in $_.Databases){
if ($Database.FailoverServer -eq $null) {
write-host "Found unconfigured database -- setting failover instance to " $Database.Name "to" $FailoverInstance
$Database.AddFailoverServiceInstance($FailoverInstance)
$Database.Update()
write-host "Successfully Updated Failover Partner on" $Database.Name "to" $FailoverInstance

}
else {
write-host $Database.Name "already configured."}
}
}
}



Thursday, May 6, 2010

Enabling SQL Ports on 2008R2

Quick Description: "netsh firewall" is deprecated on 2008 R2. A command line with the old syntax will work but will generate a warning that it has been deprecated.


Solution:
The new syntax is "netsh advfirewall firewall add rule name = dir = action = protocol= localport = "

This Microsoft article has a good comparison of the old syntax vs new syntax

Below are some of the handy command lines for adding port enabling to a sql2008R2 autoinstall (now with added sysprep goodness)

REM Core SQL Services

REM Default Instance
netsh advfirewall firewall add rule name="SQLServer" dir=in action=allow protocol=TCP localport=1433

REM Dedicated Admin Connection
netsh advfirewall firewall add rule name="SQL DAC" dir=in action=allow protocol=TCP localport=1434

REM Browser Service
netsh advfirewall firewall add rule name="SQL Browser" dir=in action=allow protocol=UDP localport=1434
 


REM Non Core Components
REM Dedicated Admin Connection
netsh advfirewall firewall add rule name="SQL DAC" dir=in action=allow protocol=TCP localport=1434

REM Mirroring EndPoint
netsh advfirewall firewall add rule name="Mirroring EndPoint" dir=in action=allow protocol=TCP localport=5022

REM Service Broker
netsh advfirewall firewall add rule name="SQL Service Broker" dir=in action=allow protocol=TCP localport=4022

REM Enable TSQL Debugger -- note, this is the same port as RPC
netsh advfirewall firewall add rule name="T-SQL Debugger" dir=in action=allow protocol=TCP localport=135

REM Browser service for Analysis Services
netsh advfirewall firewall add rule name="SQL Browser for Analysis Services" dir=in action=allow protocol=TCP localport=2382

REM Analysis services Default Instance
netsh advfirewall firewall add rule name="Analysis Services" dir=in action=allow protocol=TCP localport=2383

REM HTTP/HTTPS for reporting services
netsh advfirewall firewall add rule name="HTTP Reporting Services" dir=in action=allow protocol=TCP localport=80
netsh advfirewall firewall add rule name="HTTPS Reporting Services" dir=in action=allow protocol=TCP localport=443

Thursday, March 18, 2010

SQL 2008 x64 Integration Services and Excel

Brief Description:
On x64 SQL 2008 Integration Services Jobs importing Excel files can be created but fail once imported into Integration Services.

Symptom:
The dtsx job can be created, and if you're creating the job with the import data wizard "Execute Immediately" will also work. Once the job is imported into Integration Services and executed as a job it will fail with the error

"One or more component failed validation. End Error Error: 2010-03-18
12:33:23.93 Code: 0xC0024107 "
This also happens on 2005 x64 with the error
"SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER. The AcquireConnection method call to the connection manager "SourceConnectionExcel" failed with error code 0xC0202009."
Solution:

On 2008 the solution is so simple I cursed the time I spent trying to follow some of the solutions offered on messageboards.
1. Make sure you have the complete management pack installed (you should have an x86 mssql home location in addition to the default x64, this is specified on install).
2. In your job when adding the SSIS package select "Use 32-bit runtime" on the "Execution Options" tab for the step.
See Microsoft's KBase article on the subject, it took me a while to figure out that the above 2 steps are essentially what the article says to do.

On 2005, it's just a little bit more complicated.
1. Again, make sure you have the complete management pack installed
2. Instead of adding your package as an SSIS job add it as "Operating System(CmdExec)"
3. In the Command box, specify the location of the 32 bit dtexec.exe.
The default path is "\Microsoft SQL Server\90\DTS\Binn\dtexec.exe".
For an SSIS package that lives in Integration services the command should be similar to

"\Microsoft SQL Server\90\DTS\Binn\dtexec.exe" /SQL "\"
/SERVER /MAXCONCURRENT " -1 " /CHECKPOINTING OFF /REPORTING EW
For a file based .dtsx package the command should be similar to
"\Microsoft SQL Server\90\DTS\Binn\dtexec.exe" /FILE
"\" /SERVER /MAXCONCURRENT " -1 "
/CHECKPOINTING OFF /REPORTING EW

Monday, February 22, 2010

Query to map basic host to cluster information VMware 3.5

Brief Description: A quick way to show vital stats on hosts in a cluster.

Problem: The virtual center database is kind of ugly in the way it maps performance statistics to hosts (it appends a host- to the beginning of the ID, stores the clustername one level up from the host etc).

Quick Script:

The script is intended to map the basic stats for all hosts in a cluster. It's part of a bigger reporting services dashboard I'm building out, but I think it's helpfull on its own. As always please feel free to post corrections.

--Map hosts to basic cluster information

declare @ClustertoHost table
(
hostname nvarchar(100),
Clustername nvarchar(50),
Clusterid int,
hostid int,
hoststatid nvarchar(50),
HostCPU bigint,
HostMemory bigint
)

insert into @Clustertohost (hostname, clustername, clusterid, hostid, hoststatid, HostCpu, HostMemory)

select h.name, c.name, c.id, h.id, 'host-' + CONVERT (Varchar, h.ID) AS statID, (convert( bigint, vh.cpu_hz) * vh.cpu_core_count)/1024/1024/1024 as CPUTotal, (convert( bigint, vh.mem_size))/1024/1024/1024 as Memory
from
vpx_entity h join vpx_entity c
on h.parent_id = c.id
join vpx_host vh on
vh.id = h.id

where h.type_id = 1
--select * from @clustertohost

Friday, February 19, 2010

Simple(er) query to get paged out systems on VMware

Brief Description: On VMware, you want to query for VMs swapping out.

Problem:
I posted a longer script for checking VMware tools and Host build versions when looking for VMs that are swapping out in the last hour. It's maybe a little clunky if you just need to know what machines are swapping right now.

Solution:

Smaller query is below just to get a list of VMs Swapping out right now.


select h.name, MAX(s.sample_time), avg(s.stat_value) from vpx_entity h
join vpxV_hist_stat_daily s on s.entity = ('vm-'+ CONVERT (Varchar, h.ID))
where s.stat_name = 'swapped' and s.stat_value > 1
group by h.name
order by h.name

Thursday, February 18, 2010

SQL 2008 reporting services headache.

Brief Description: Report Services 2008 failing with error rsServerConfigurationError

Problem Description:
Reporting Services 2008 is stingy with error message information, and without more information, tracking down this error is like looking for a needle in a stack of needles. More error detail is available in %SQL Program Dir%\MSR10.MSSQLSERVER\Reporting Services\LogFiles. The long message associated with my issue was "The report server has encountered a configuration error. No DSN present in configuration file."

Solution:
This excellent post from the blog of Jessica M. Moss solved the issue for me. So, Thanks Jessica!

Thursday, February 4, 2010

You've been using VMware too long when

The other day I had an emergency request to stand up a SharePoint 2010 beta server immediately. It needed Windows 2008 R2 for various reasons and I realized that my VMware 3.5U4 clusters wouldn't work (previous experience had demonstated that it really didn't work). I thought about my options which didn't include upgrading my production hosts and finally decided that I would install VMware 3.5U5 or VMware 4.0U1 on an old VMware server (it was tiny--only 16 CPUs and 64 GB of RAM). For various reasons having to do with network and storage connectivity, I was unhappy with that solution. Then it occurred to me...

You can install Windows directly on physical hardware.

I've been using VMware too long!

Using PowerShell with NetApp to Remove Snapshots

Most VMware administrators have learned to use PowerShell for scripting. VMware has done a wonderful job with its PowerCLI to embrace this scripting language.

VMware is often implemented on NetApp due to some wonderful advantages with NFS, de-duplication, VDI and Snapshots. Thanks to work in a codeplex project known as PoshOnTap, it is possible to manage NetApp OnTap with the same ease.

A common problem is old snapshots that need to be discovered and deleted. It is simple with PoshOnTap to find all snapshots older than one week:

import-module PoshOnTap
connect-naserver -filer toaster -Credential (get-credential)
Get-NaVol | Get-NaSnapshot | where-object {$_.Created -lt (Get-Date).AddDays(-7)}



Want to delete those snapshots? You can append a " | remove-nasnapshot" to the above command:

Get-NaVol | Get-NaSnapshot | where-object {$_.Created -lt (Get-Date).AddDays(-7)} | remove-nasnapshot

I like to put a delay in between snapshot deletions so I tend to use code to introduce a 10 minute delay in between each deletion:

import-module PoshOnTap
connect-naserver -filer toaster -Credential (get-credential)

$snapshots=(Get-NaVol | Get-NaSnapshot | where-object {$_.Created -lt (Get-Date).AddDays(-7)})

foreach ($s in $snapshots) {remove-nasnapshot -name $s.Snapshot -volume $s.Volume; start-sleep 600}



PoshOnTap has a wealth of management functions for NetApp OnTap. The work done by the coders is exceptional and greatly appreciated!

Using PowerShell to find guests with memory swapped out

I often need to determine if any VMs have memory swapped out (swcur in ESXTOP). I've been using PowerShell and the PowerCLI to generate a list due to its extreme simplicity.

Get-VM | where-object {$_.powerstate -eq "PoweredOn"} | get-stat -Realtime -Stat "mem.swapped.average" -MaxSamples 1 | Select Entity, Value | where-object { $_.Value -gt 0 } |sort-object -property Entity

You can pipe the results to a CSV by appending " | export-csv" to the end of the statement.

VMware Swapping and Overcommit

The VMware clusters I maintain are used very aggressively. The servers are large (128 GB or 256 GB currently) and we use memory overcommit pretty aggressively. Usually that is not a problem, but we have found some limits in VMware 3.5 (tested through update 4). A number of cases opened with VMware support have lead to no resolution other than "It's better in VMware 4.0." That remains to be seen.

The symptom is a server with plenty of free RAM deciding it needs to start hard swapping. In ESXTOP this is shown by the SWCUR column in the memory page. To see it you will need to hit m after starting ESXTOP and then f to select fields and choose J to show swap stats. You will get a display somewhat like the following:

9:10:22pm up 146 days 19 min, 180 worlds; MEM overcommit avg: 0.34, 0.34, 0.34
PMEM /MB: 131066 total: 800 cos, 1162 vmk, 50218 other, 78885 free
VMKMEM/MB: 128653 managed: 7719 minfree, 15492 rsvd, 112807 ursvd, high state
COSMEM/MB: 77 free: 541 swap_t, 541 swap_f: 0.00 r/s, 0.00 w/s
PSHARE/MB: 13048 shared, 3534 common: 9514 saving
SWAP /MB: 15435 curr, 1982 target: 0.02 r/s, 0.00 w/s
MEMCTL/MB: 5489 curr, 3376 target, 91201 max

NAME MEMSZ SZTGT SWCUR SWTGT SWR/s SWW/s
vmware-vmkauthd 5.62 5.62 0.00 0.00 0.00 0.00
bwt-as1 16384.00 7777.88 4725.95 0.00 0.01 0.00
dmg-ci 16384.00 12628.46 6986.35 1982.34 0.00 0.00
bw1-ci 16384.00 7782.90 2090.97 0.00 0.00 0.00
crd-ci 16384.00 15843.64 356.13 0.00 0.00 0.00

If you have a number of VMs with very large memory sizes that are usually idle (think development/QA servers), you can have a host that has a memory state of "high", plenty of free RAM (often 60GB+) that will suddenly decide it has a desperate need to hard swap memory out. This will occur once the 'MEM overcommit" average hits about 0.5 (150%). It is easy to end up with VMs with GBs of RAM swapped out.

The VMs may actually behave fairly normally until they are actively used. A vmotion will also cause them to become unresponsive until the memory is paged back in which can take quite a while. On linux this will often appear to system administrators as a very high load average despite a lack of work.

Even more frustrating is that the algorithm that decides DRS placement doesn't care about passing the 0.5 overcommit and will make decisions that will cause the host to begin paging out.

The only workaround we currently have is caution while placing hosts in maintenance mode and buying memory.

Thursday, January 7, 2010

Query to Get Paged Out VMs from Virtual Center Database

Brief Description: Query the Virtual Center database for guests with swapped out memory.

Solution:

Just a quick handy script. In our implementation, we've seen swap out correlate to discrepancies between the Guest and host tools version, and found it handy to pull back tools and build info when querying for swap out. I use the (possibly overly convoluted, corrections are always welcome) query to find VMs which are swapping in the last hour, their tools version, and the build of their parent host.

Written with a table variable to make it clearer if you want to modify which columns are pulled back. Our VMware database records stat sample time in UTC (hence the +6), I'm unclear whether that's always the case.
Use [YourVMwareDB]
go
declare @tblToolsVer table
(
strVMName nvarchar(255),
pageout int,
intToolsVer int,
strStatus
nvarchar(255),
strHostName nvarchar(255),
intHostVer int
)

insert into @tblToolsVer(strVMName, pageout, intToolsVer, strStatus, strHostName, intHostVer)

select ev.name as VMName, CONVERT(decimal(12,2),ROUND(AVG(s.stat_value)/1024,2)) as swapped, vm.tools_version,
case vm.tools_status
when 0 then 'Not Installed'
when 1 then 'Not Running'
when 2 then 'Out of Date'
when
3 then 'OK'
when NULL then 'Indeterminate'
end as Tools_Status,
eh.name as Host_Name, ho.Product_Build as Host_Build
from vpx_vm vm
join vpx_entity ev on ev.id = vm.id
join vpx_entity eh on eh.id = vm.host_id
join vpx_host ho on ho.id = vm.host_id
join vpxV_hist_stat_daily s on s.entity = ('vm-'+ CONVERT (Varchar, vm.ID))
Where
s.stat_name = 'swapped' and ev.id in(
select distinct(h.id) 
from vpx_entity h 
join vpxV_hist_stat_daily s on s.entity = ('vm-'+ CONVERT (Varchar, h.ID))
where h.type_id = 0 and s.stat_name =
'swapped' and s.stat_value > 800 and s.sample_time >= dateadd(hh, +6,
getdate())
)
group by ev.name, vm.tools_version, vm.tools_status, eh.name, ho.product_build

select * from @tblToolsVer

Tuesday, January 5, 2010

Installing SQL Reporting Services on an existing instance

Brief Description: It can look as though reporting services is not available for install when running the sql installer on a machine with an existing SQL install but no IIS install.

Solution:

Install NET Framework and service packs for 2.x and 3.x
If IIS is not installed, do a base install (if this is external facing follow your external lock down procedures).


  • Add/Remove Programs

  • Add/Remove Windows Components

  • Choose application server

  • Leave Network COM+ access checked unless you're locking it down

  • Select ASP.NET

  • The rest of the defaults will do a basic IIS install only




  • If the server already has IIS check the .net version mapped to the default site
    Presuming your SQL server is not also a web server for exciting other exciting applications verify the default site is unused and continue to map asp.net to IIS


  • pushd c:\windows\Microsoft.NET\Framework\v2.xxx

  • If this is a new install, register 2.x to the default site with the scriptmap overwrite

  • aspnet_regiis -s /W3SVC/1/Root

  • Otherwise aspnet_regiis help has syntax for your choice of install.

  • aspnet_regiis -ir

  • run inetmgr

  • in Web Service Extensions 'ASP.NET v2.x should be installed and enabled

  • If you didn't install scriptmaps with aspnet_regiis, go the properties of the default website and choose the asp.net tab and change the version to 2.x




  • Run SQL Setup


  • You'll get a warning on edition change. Ignore this for the time being

  • Select the instance you're upgrading.

  • Install Reporting Services.
    Re-Run the current service pack install.
  •