Monday, November 29, 2010

DPM 2010 SharePoint Farm Backup Errors

Quick Description:  In the DPM 2010 console, a SharePoint 2010 Farm Replica shows the errors below:
Prepare for backup operation for SQL Server 2008 database ServerName\DatabaseName on ServerName has been stopped because this operation depended on another backup operation which failed or was cancelled. (ID 30200)
1 Database(s) has been removed from the SharePoint farm ServerName.  These databases are not part of the recovery point
    OR
 One or more databases seem to have been added to the SharePoint farm  ServerName.  Recovery point for the farm has been created without these databases.
Problem:
The primary error appears to occur whenever a farm configuration change is made.  Depending on whether you've added to, or subtracted from the farm, the secondary error will be one of the two additional errors.


Solution:
The error that occurs when a database is removed from SharePoint is solved by stopping protection and then re-creating it without deleting any data.  To do this right click the protected farm and select Stop Protection.  Do not check the box to Delete replica on disk.  This bears repeating.  Verify that Delete replica on disk is not checked, and then check it again, if you check the box you will irrevocably lose your existing backups.  Once you've destroyed the replica, you can re-instate it.  It will detect that there is inactive protection for the previously protected farm, and initiate a consistency check before bringing it online. 

The case in which a content database has been added to the SharePoint farm, is much more interesting.  My DPM server is protecting both the SharePoint farm, and the backend  SQL server that hosts all the additional SharePoint databases.  Since the SharePoint farm protection only backs up the config and content databases, I have the SQL servers set to auto protect new databases coming online on this server (generally there shouldn't be change other than new content databases).  It seems that databases are autodiscovered before new site collections, and since the new content database is already in DPM, it fails adding the site collections inside the content database to the farm replica.  The simple answer is to stop protecting the new content database on the SQL server and re-run the recovery point on the SharePoint farm.  Since new databases shouldn't just be cropping up unannounced on the SharePoint SQL server, you may want to stop DPM auto discovery on the SharePoint SQL server, but if you need autodiscovery on, make sure DPM is happy after any new content databases are added.


This works, and isn't disastrous unless you don't catch the DPM error and end up without a current backup of your SharePoint farm, but I think it's pretty fragile to not be able to cope with changes to the SharePoint farm.  If you have a great SharePoint administrator like we do, you shouldn't be surprised by databases appearing out of thin air, but even with warning, you will have to rebuild the replica if a content database is removed or moved to a new farm, which is at the least a hassle, and in practice just another moving part that can break.  I wish it were more robust.  Note to self:  Keep remembering how much better DPM 2010 is than the previous version.

Monday, November 22, 2010

DPM 2010 Item level restore error -- Index out of Bounds

Quick Description:  DPM 2010 item level restore against SharePoint 2010 fails with error "Index was outside the bounds of the array" .

Scenario:
Landscape:
Multiple SharePoint 2010/Windows 2008 R2 web front ends/application servers
SQL Server 2008 R2 back end
DPM 2010 is protecting a SharePoint 2010 farm with Item Level Restore configured and green.
Additional SQL 2008 R2 utility server for the unattached content database restore

Errors: 
In the DPM console, the error below is reported.
    DPM was unable to export the item YourItemName from the content database YourContentDatabase. Exception Message = Index was outside the bounds of the array.. (ID 32017 Details: Unknown error (0x80131508) (0x80131508))
In the DPM Client Logs on the target Web Front End (default location C:\Program Files\Microsoft Data Protection Manager\DPM\Temp\WssCmdletsWrapperCurr.ErrLog) the error below is reported.
06FC    1AB4    date    Time    09    AppAssert.cs(114)    WARNING    Nearest Site Url should not be null or Empty
06FC    1AB4    date    Time    31    WSSCmdlets.cs(450)    WARNING    Caught Exception while trying to export Url [ItemRestoreURL] to File [temporarylocation\DPM_GUID\cmp\].
Solution:
This error appears to occur when more than one of the Central Administration site are running in the farm.
In the Central Administration site, under System Settings, and Services on Server, check all servers in the farm for the Central Administration site, and temporarily stop the Central Administration site on all but the server listed in DPM.
I can reliably reproduce this error on 3 SharePoint farms by starting the Central Administration site on more than one server.  This, however, may not be the only condition that causes this error, I'm curious whether this is the only cause of the error, so please let me know if you run into the error and multiple Central Administration sites is not the cause.


Other Things to Check:
  • Check that the account you used to register your farm with DPM is, in fact, the Farm Account.  
    • In Central Administration go to Service Accounts.  In the Credential Management drop down select Farm Account, and verify you're using the account specified here.
      • DPM will let you build the replica with the content account, and the replica will look healthy until you go to restore and it fails with an Access Denied error.
      • To fix this you need to remove the replica from protection (I believe you can retain the protected data), re-register the servers with DPM, and re-create the protection group.  If you retained the protected data, you will need to perform a consistency check.
  • Although you can register all servers with DPM (and I do, because I worry about the recovery scenario if you lose the server that has registered the replica), the DPM server may complain that the Sharepoint VSS writer is running on multiple servers.  
    • It may be necessary to temporarily stop the SharePoint VSS writer on the SharePoint servers which do not own the replica in DPM.

 

DPM and SharePoint 2010 Item Level Restore -- The Temporary Staging SQL Server

In writing up a post on an error in DPM 2010 SharePoint Item Level Recovery, I started thinking about the role of the SQL server used as a temporary staging location during the restore. 

It seems to me, that for all intents and purposes, DPM's Item Level Recovery is really just the SharePoint 2010 Granular Restore/Unattached Content Database recovery integrated into the DPM console.  Russ Maxwell has a great walk through of how to do a granular restore without DPM.  DPM locates the recovery point of the content database, restores the whole database, and exports the file from the unattached content database, then re-imports it into the live site.  It's much less painful than manually locating the recovery point that contains the file/record, restoring the content database and using the item level recovery, since often the bulk of the work is in finding the restore point.  The DPM console lets you browse the recovery points to find the item before you go to the trouble of restoring the content database.

Restoring without a recovery farm requires 3 temporary locations.  A directory on the Web Front End to hold the restored item temporarily in cmp form, a temporary SQL server where the content database can be restored and, of course, a location on that database server's file system where the database file will temporarily live.  Although the primary database server can be used for the temporary restore of  the unattached content database (the database name is converted to the format  DPM_GUID).  I'm still using a Non-SharePoint, Non-Production SQL server for two reasons:
1.  If your content database is big, you'll take the I/O hit of restoring the whole database to the same disks that are serving up your production farm, and could potentially run your disk out of space (but you'd check that first) just to delete it when the item is restored.
2.  The term "Unattached Content Database" does not mean it isn't attached to SQL Server, it's just not attached to SharePoint. 
I'll be testing whether this works using SQL Express as the intermediate restore database.

Friday, November 19, 2010

Re-adding a rebuilt server to DPM 2010

Quick Description:  If a DPM client system irretrievably crashes, and is rebuilt/replaced with a system using the same computer account, DPM will not allow the old computer to be removed from the agent inventory, or reinstalled from the console.

Problem:  The original client agent is no longer available to perform an uninstall.  Attempting to uninstall results in the error
The agent operation failed because of a communication error with the DPM Agent Coordinator service on clientservername
The RPC server is unavailable (0x800706BA)
Since the rebuilt machine with the same computer account is online, it does not provide the option to remove the server from the DPM catalog.
Attempts to reinstall the agent fail as the server already exists in DPM.

Solution:  Manually install the client agent on the rebuilt server and refresh it in the console.  Firewall rules for DPM communication are automatically created.  It seems like this would be made more intuitive since one of the reasons you protect a server with DPM is that if it completely crashes, you can rebuilt it and restore the data. 

The command line to install the Agent on an x64 machine is
%Path to DPM install file Share%\DPMAgentInstaller_64.exe yourdpmservername
Install files for 2010 for an x64 client are located by default at
c:\program files\microsoft dpm\dpm\protectionagents\RA\3.0.7696.0\amd64\DPMAgentInstaller_x64.exe

Tuesday, November 16, 2010

My Very First Argument with DPM 2010

Problem Quick Description:  DPM servers previously configured and working as cross-protection/DR protection for each other, start failing to connect with the error 
DPM failed to communicate with the protection agent on yourservername because access is denied. (ID 42 Details: Access is denied (0x80070005))
Situation:
DPM Servers configured to protect each other, start denying access and refusing to connect.  Servers are in the correct local groups (DPMDRTrustedMachines, DPMRADmTrustedMachines, DPMRADCOMTrustedMachines) on both the primary DPM server and the DR DPM server.

Work Around:
I don't have root cause on this yet, but not getting a good backup of your DPM Database is such a showstopper that I'm posting a cheesy work around.

To get back up and running, add the machine account of the DR DPM server to the local administrators group of the primary DPM server.  This is obviously not a good long-term solution, but in my situation it's less dangerous than running with the DPM databases unprotected (your mileage may vary).  When I've tracked down the permanent solution, I'll post it.

Update on Failover Mirroring Testing SharePoint 2010/SQL 2008 R2

I took the opportunity to ask the SQLCAT guys at this year's SQLPASS about an issue with storage hangs I had been testing a couple of weeks ago.   In testing our SharePoint 2010 infrastructure, I ran into one scenario in which SQL server loses access to its underlying disks.  The mirror eventually takes over, but SharePoint does not recover until either the storage comes back online, or the primary is completely offline.

Since it was hard to say whether the issue is SQL Server, SharePoint, or the .NET provider, I talked to pretty much everyone.  While there isn't a definitive answer, it's an easy scenario to reproduce, so the webapps team said they would test it with the new "Always On" in the new version.

For the moment, I'm documenting it as a known issue requiring manual intervention to bring SharePoint back online.

Friday, November 5, 2010

Quick Response to SysPrep Question

We got a question on my (perhaps slightly snarky) article on how disappointing SQL 2008 R2 SysPrep is, regarding Windows 2008 R2 SysPrep.

Question from Annonymous is:
I'm trying to use sysprepped 2008R2 Express Editions so that each of my developers can get a local copy of SQL Server to test code as part of their workstation baseline.

The issue I have with it, is that even on a properly Prepared image, the CompleteImage phase doesn't appear to work during the "cmdlines.txt" portion of Windows Mini-Setup. Instead, the setup.exe silently dies, and the even log points to errors resolving the C Runtime library (which appears to be properly installed under WinSxS).

If the Sysprep functions do not work DURING sysprep then they're useless to me. If I, as an admin, have to log on to each workstation/server after loading the image, then there's no point in providing automation functions.

I'm passing on Benj's answer since he's the expert on Windows SysPrep.

Benj suggests setting it to autologin once, and running SQL Setup/Complete Image during the guirunonce phase.

I'll post my non-SysPrep steps for creating a SQL 2008 R2 VMware template soon.

Thursday, November 4, 2010

Quick Recap of DPM 2010/SharePoint 2010 config steps:

Seems like a lot of people are seeing the DPM MetaData error when getting SharePoint 2010 set up for item level restore.  I've posted an article specifically about resolving this error, but it presupposes that the initial configuration for DPM 2010 SharePoint protection has been completed.


I've posted the initial configuration steps below:


From a command window run as administrator on your SharePoint Web Front

pushd [YourDPM Directory default c:\program files\microsoft Data Protection manager\DPM\bin]
ConfigureSharePoint.exe -EnableSharePointProtection
[provide your farm service account user and password when prompted]

ConfigureSharePoint.exe -EnableSPSearchProtection
[provide your farm service account user and password when prompted]
"c:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\BIN\stsadm" -o registerwsswriter
Run "vssadmin list writers" and check for any VSS writers not in "stable" state.

Restart the following services on the SharePoint Web Front End
  • Volume Shadow Copy Services
  • SharePoint VSS Writer
  • DPMRA
Build the replica in DPM

Possible causes of problems with item level restore in addition to the metadata error:
  • One of the SharePoint databases is already protected by DPM -- 
    • Protect the SharePoint Farm before configuring database protection.  The config and content databases will then become unavailable for protection for SQL Server.
  •  Item level restore is configured correctly but the catalog hasn't yet populated -- Follow these instructions to force the catalog to populate.

Configuring SharePoint 2010 Farm Monitoring with SCOM 2007 when missing sharepointMP.config file

Microsoft made management pack installation much easier in System Center Operations Manager 2007, but I recently had problems with the new automated process.
image
Directly installing the management packs from the catalog will appear to function fine, but a critical file “sharepointMP.config” will be missing. There are very references on the internet to issues with the x64 version of Windows installing the file to the wrong location—this is not the solution. A search of the entire server will not return the file.
The solution turned out to be trivial. Download the management pack the old fashioned way as an MSI from Microsoft and install it noting the x86/x64 issue if it is appropriate. Once installed you will find the “sharepointMP.config” file in the directory. Follow the instructions in the “Microsoft® SharePoint® 2010 Products Management Pack Guide” on Microsoft’s site to edit the file appropriately for your service account. Basically just change these two entries:
  • Association Account="SharePointService" (an account with rights to the farms)
  • Machine Name="spoint" (or some string that is in all your sharepoint server names)
Then run the "Configure SharePoint Management Pack" command in the action pane of the SharePoint Folder in SCOM. It will take a while for the data to appear once this is done. Microsoft suggested 30 minutes in their documentation.

Wednesday, November 3, 2010

Get list of mirroring timeouts

Just a quick script to get a list of mirroring timeouts for databases configured for mirroring.  SQL 2008's failover mirroring default is very quick to failover, and I'd like it to wait out things like very high disk latency in the middle of the night when storage does disk intensive things like deleting snapshots rather than failing over.


To check mirroring timeouts in a more human friendly format:

select d.name as databasename, m.mirroring_connection_timeout as timeout
from sys.databases d join sys.database_mirroring m
on d.database_id = m.database_id
where m.mirroring_guid is not null


Note:  when setting the mirroring timeout, setting it on one server sets it on both.

ALTER DATABASE [DBName] SET PARTNER TIMEOUT 20

SQL 2008 -- Can't connect with management studio from the local machine

Brief Description:  On a SQL 2008 R2 machine running on Windows 2008 R2, when attempting to launch Management Studio logged in as a domain account with local admin privileges, SQL server refuses connection with access denied.

More Details:  
If the default admin of the SQL server is the local administrators group, members of the local admin group will be able to access with remote copies of Management Studio, but can't connect locally when directly launching Management Studio

Solution:
Right click management studio and "run as administrator".  Maybe this should be intuitive, but it wasn't for me.  Since Management Studio is an administrative tool by default, it seems slightly off to me that I'd have to tell it I want to be an administrator.

I'm still getting used to the whole concept of Microsoft deciding that "su" is the new thing.  I'm not sure it's the thing I would have taken from *nix if I was going shopping for things windows should totally have...I'd maybe have gone with a native way to mount an ISO, but that's just me.

Disclaimer:
Connecting as a user account to do any real work isn't so much a best practice since so many things end up as owned by the account that created them (mirroring endpoints being owned by user accounts was an unpleasant surprise to me), but sometimes it's practical for troubleshooting, or work that actions that should be audited by username liking shutting down a service.