Monday, October 25, 2010

Infrastructure Testing: When the storage hangs, SQL 2008 R2 deals, SharePoint 2010 in a tizzy.

I've been testing worst case scenarios for our new 2010 SharePoint infrastructure.  It handles crash testing so elegantly that I'm amazed.  Almost all the usual tests, from graceful shutdown, to tests that are just plain mean, work flawlessly.  Whether it's stopping the services, shutting down the server, pulling the "plug", killing the network -- almost nothing fazes it.  The failover is lighting fast, and the services keep working.  From a browser, there's time to make a couple of http requests that fail before the database and SharePoint shake it off and just work again.

So far, I've only been able to create one really, really, ugly situation.

From my testing, the worst possible thing that could happen appears to be a storage hang.  We have to artificially create a storage hang.  It's maybe hard to imagine how you get into the situation where storage just hangs without triggering failover in an HA storage environment, but it's possible, and boy is it ugly to recover from.   SQL, admirably, manages to detect it needs to failover after a while, but SharePoint just faints. To be fair, I don't know of any system that *loves* losing its storage.  In the same storage hang testing on Oracle DataGuard in high availability mode, database failover never happened, never mind the application surviving.  It seems the ugliest situations HA environments get themselves into, are the ones in which it's not completely crashed, but still unusable.

Scenario:
Landscape:
SQL:  Failover mirroring (SQL 2008 R2) hosted on Windows 2008 R2 VMs on VSphere 4.1.
SharePoint:  Load balanced web front ends SharePoint 2010 on Windows 2008 R2 on VSphere 4.1.   Out of rotation application servers for indexing, metadata, office automation etc.  Configured to understand failover mirroring.

Test:
Hang the vfiler hosting the primary SQL Database...wait.

Results
It take SQL a while to figure it should fail over to the mirror (w/20 second mirroring timeout), but after 3 - 5 minutes databases were failed over and online.
SharePoint hangs until the database fails over, at which point it starts generating 503 errors and never seems to recover.

Things that don't bring SharePoint back online:
Restarting the admin service (with the theory that the admin service was perhaps keeping track of the failover state by keeping a tcp connection open to the server)
Restarting a web front end (same theory but testing whether the web front ends themselves recognize)
Running the test with the admin and config databases already failed over to the mirror (to test if it just becomes paralyzed without the config database).

Things that work to bring SharePoint back online:
Taking the primary SQL server offline (this is hard if VMware can't talk to the vmdk file).
Bringing the storage back online (As soon as the storage on the primary is back online, SharePoint recognizes it's no longer the primary, starts using the secondary, and is happy).


Theories:  
Everything from questions about the .net provider itself, to wondering if the virtual disk needs to return a hard error (new drivers in VSphere), to wondering if the primary is orphaned in some way (the witness and the secondary know it's not the primary anymore, but it doesn't).  Time to get Microsoft support involved.

Update:   Ruled out new VSphere SCSI drivers and disk timeout settings.  Changing the SCSI driver, disk timeouts, and mirroring timeouts affects how quickly SQL server mirroring fails over, but doesn't change the fact that SharePoint doesn't recover until the server goes offline or the storage comes back.

No comments:

Post a Comment