Wednesday, October 27, 2010

Hotpatching: Great idea Microsoft, but a terrible implementation

While doing research on minimizing planned downtime within a datacenter, I re-examined the idea of hotpatching. Microsoft introduced this idea with Windows 2003 Service Pack 1 as part of a “reboot reduction initiative.”

Microsoft reboot reduction initiative

Hotpatching is part of the Microsoft reboot reduction initiative, which seeks to help minimize the need for a full system reboot after installing updates. Reducing reboots is important because IT departments in many organizations implement a time-consuming test cycle every time an update is installed and the system is rebooted. This results in loss of productivity and revenue to the organization until their system is fully verified and operational. Hotpatching allows customers to deploy important updates and patches in a timely, transparent manner without requiring a full system shutdown and restart. This reduces their rollout time.

The following examples demonstrate possible savings from reboot reduction:

  • Of the 22 updates that shipped for Windows Server 2003 RTM between April 2005 and August 2005, 15 of them required a reboot. Eight of these could have been hotpatched. This would have reduced the number of reboots by 53%.
  • Of the 14 updates that shipped for Windows Server 2003 Service Pack 1 (SP1) prior to August 2005, ten of them required a reboot. Four of these could have been hotpatched. This would have reduced the number of reboots by 40%.

Source: http://technet.microsoft.com/en-us/library/cc781109(WS.10).aspx

So how did it work out?

I did some research by using Google to find security bulletins that contained the word hotpatching and then refined those results. I found only a few bulletins that supported the switch. Here is an example of one search:

hotpatching -does-not-support-HotPatching site:http://www.microsoft.com/technet/security/bulletin

I would estimate less than 1% of patches released support hotpatching—which leaves me pretty disappointed given Microsoft’s example.

The Linux equivalent, ksplice seems to have a much better track record—though I have no experience with its impact on system stability.

Update:

In an effort to find additional statistics showing that hotpatching was basically unused by Microsoft, I extracted one of the few hotfixes I could find that supported hotpatching (WindowsServer2003-KB917159-x86-ENU) using the /x switch. According to Microsoft’s documentation, a hotfix that supports hotpatching will contain a file with a “.hp” extension. Within the directory structure I did indeed find the hotpatching file “svr.hp.sys”.

I then extracted and searched through a collection of 200 post Windows 2003 service pack 2 hotfix files for other filenames containing the ‘hp’ string.

Number found: zero.

Monday, October 25, 2010

Infrastructure Testing: When the storage hangs, SQL 2008 R2 deals, SharePoint 2010 in a tizzy.

I've been testing worst case scenarios for our new 2010 SharePoint infrastructure.  It handles crash testing so elegantly that I'm amazed.  Almost all the usual tests, from graceful shutdown, to tests that are just plain mean, work flawlessly.  Whether it's stopping the services, shutting down the server, pulling the "plug", killing the network -- almost nothing fazes it.  The failover is lighting fast, and the services keep working.  From a browser, there's time to make a couple of http requests that fail before the database and SharePoint shake it off and just work again.

So far, I've only been able to create one really, really, ugly situation.

From my testing, the worst possible thing that could happen appears to be a storage hang.  We have to artificially create a storage hang.  It's maybe hard to imagine how you get into the situation where storage just hangs without triggering failover in an HA storage environment, but it's possible, and boy is it ugly to recover from.   SQL, admirably, manages to detect it needs to failover after a while, but SharePoint just faints. To be fair, I don't know of any system that *loves* losing its storage.  In the same storage hang testing on Oracle DataGuard in high availability mode, database failover never happened, never mind the application surviving.  It seems the ugliest situations HA environments get themselves into, are the ones in which it's not completely crashed, but still unusable.

Scenario:
Landscape:
SQL:  Failover mirroring (SQL 2008 R2) hosted on Windows 2008 R2 VMs on VSphere 4.1.
SharePoint:  Load balanced web front ends SharePoint 2010 on Windows 2008 R2 on VSphere 4.1.   Out of rotation application servers for indexing, metadata, office automation etc.  Configured to understand failover mirroring.

Test:
Hang the vfiler hosting the primary SQL Database...wait.

Results
It take SQL a while to figure it should fail over to the mirror (w/20 second mirroring timeout), but after 3 - 5 minutes databases were failed over and online.
SharePoint hangs until the database fails over, at which point it starts generating 503 errors and never seems to recover.

Things that don't bring SharePoint back online:
Restarting the admin service (with the theory that the admin service was perhaps keeping track of the failover state by keeping a tcp connection open to the server)
Restarting a web front end (same theory but testing whether the web front ends themselves recognize)
Running the test with the admin and config databases already failed over to the mirror (to test if it just becomes paralyzed without the config database).

Things that work to bring SharePoint back online:
Taking the primary SQL server offline (this is hard if VMware can't talk to the vmdk file).
Bringing the storage back online (As soon as the storage on the primary is back online, SharePoint recognizes it's no longer the primary, starts using the secondary, and is happy).


Theories:  
Everything from questions about the .net provider itself, to wondering if the virtual disk needs to return a hard error (new drivers in VSphere), to wondering if the primary is orphaned in some way (the witness and the secondary know it's not the primary anymore, but it doesn't).  Time to get Microsoft support involved.

Update:   Ruled out new VSphere SCSI drivers and disk timeout settings.  Changing the SCSI driver, disk timeouts, and mirroring timeouts affects how quickly SQL server mirroring fails over, but doesn't change the fact that SharePoint doesn't recover until the server goes offline or the storage comes back.

Thursday, October 21, 2010

Data Protection Manager 2010 and SharePoint 2010 metadata error

Brief Description:  After creating a replica of a SharePoint 2010 farm in Data Protection Manager 2010, the replica shows the error "Backup metadata enumeration failed".

Problem:  After running both the EnableSharePointProtection and EnableSPSearchProtection options with ConfigureSharepoint.exe, and creating a protection group in Data Protection Manager, the replica begins building without error.  When it completes, it gives a warning "Backup metadata enumeration failed.".  Item level restore is, therefore, not available.  All the VSS services are running on the front end servers and the database servers.  vssadmin list writers also shows that the SharePoint writer is stable and has no errors.

Solution:

This is a deceptive error because the VSS writers all show up as happily running without error.  In my case, the solution was to run the command
"c:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\BIN\stsadm" -o registerwsswriter
on front end servers in the farm.  That's not a typo, it's wsswriter not vsswriter.  Odd since it's SharePoint 2010, but nonetheless, replicas happily created and with a force populate, I have item-level restore.

I've posted a quick recap of DPM 2010/SharePoint 2010 config steps and other gotchas with item level restore.  If this is an install that has never worked, before following the steps in this article, destroy the replica with the metadata error.  Then, follow the configuration steps, and re-create the replica.

If you get this error on an existing, previously working, DPM replica, force a new recovery point, there may have just been a timeout during the last recovery point.

Wednesday, October 13, 2010

Removing aliases from PowerShell to aid in learning and script stability

When using PowerShell 2.0, it is easy to rely on aliases for functionality.  The excellent Windows PowerShell 2.0 Best Practices book suggests removing all the aliases to ensure you don’t use them accidentally.  Unfortunately, the code provided will only work if you have changed your location to the alias: location before running it.  The following code should work without the need to use set-location first:

foreach($alias in (get-alias | where-object {$_.options -notmatch 'readonly'})){remove-item alias:\$alias}

Tuesday, October 12, 2010

DPM 2010 so much better than 2007, just so much better

Since migrating to DPM 2010, I have an hour a day back.  Seven hours a week isn't trivial  (it couldn't go a weekend day without being looked at either).  With DPM 2007, there were days where I spent much more than an hour coaxing it into behaving itself for the next day, and the alerts were both not tunable, and too chatty to be useful.  With DPM 2007, even using the DPM powershell extensions to run scheduled contingent consistency checks, and hourly runs of pruneshadow copies, I was forever fiddling with things that went pear shaped.  Still, centralizing the SQL backups was worth it, I just hoped it would get better in 2010, and it really, really did.

I moved the last 2007 out of the environment last month, and my life is suddenly *way* easier.

In DPM 2007, it just plain couldn't keep up with pruning and I frequently ended up with protection groups containing *way* more recovery points than I had specified.  Even with an hourly run of pruneshadow copies, I frequently had to brute force it by kicking off multiple concurrent prunes just to get it to catch itself up.  On 2010, with the same retention and number of databases, I'm not even running pruning as a scheduled job, it just keeps up.

In DPM 2007, replicas were forever getting inconsistent with the slightest provocation.  If creating a recovery point timed out, boom, inconsistent and the job to run consistency check didn't always fix it.  More than once, I had to destroy and rebuild the replica (*so* not ideal). after multiple cumbersomely disk intensive consistency checks failed to fix the problem.  DPM 2010 self-heals nicely.

I'm also not seeing agent crashes, even on 2003/2005 servers.

I have an hour a day back from fighting DPM problems, pretty nice!

Tuesday, October 5, 2010

DPM 2010 DPM database has exceeded the threshold limit

Quick Description: DPM 2010 shows a warning alert that the DPM database has exceeded the threshold limit even if you don't have a warning threshold set.

Problem: It seems that even if you have "Alert me when DPM database size reaches x GB" unchecked, it ignores it and generates an alert when the database size exceeds the default threshold of 1 GB.

Solution:
Cheesy, but in the alert description click on "Modify Catalog Alert Threshold size" and enable the alert, set it to a bigger number, and then disable the alert again.
As far as I can tell, it just plain ignores that you don't want to be alerted, and alerts on the default size anyway.