Thursday, June 7, 2012

SSIS OLE DB task very, very, slow updating a small number of records

Description:  an SSIS OLE DB task (in this case an update) with only a few thousand records to update takes a totally, unbelievably, long amount of time to run against a large-ish (~6 million rows) table.

My Scenario:  Writing an SSIS package for ETL against a fairly large Fact Table with only a few thousand changes per day.  Records since the latest audit date are picked up by the task, then looked up and with a conditional split carved into Insert and Update.  Nothing fancy.  The insert (an OLE DB destination), runs quickly, but the update ran so long (Some unknown amount longer than 30min for less than 3k records) in test that I killed it because it was making me sad (also not feasible for our ETL window, but mostly the sad part).

The update (in my case) is pretty standard:  Check the numeric unique key (in our case, we don't enforce the source system primary key) against an (Oracle Date data type) audit field, and if it's newer, update it.  Full Disclosure:  because there are so many records and so few changes, I'm only pulling the fields more recent than the newest existing record in the destination table, but this isn't relevant to the update task problem (although it would cause exciting unnecessary work on the pull from source side).

The Problem:  Everyone else might already know this, but it didn't really hit me: when it says  "Executes an SQL command for each row in a dataset." it really means it.  I hadn't put a huge amount of thought into it before today, but until I really traced it and watched performance, I sort of had the impression it relied on some of the work already done by the preceding Lookup and Conditional Split tasks, and maybe sort of cached the lookups it had already done.  On a scale of 1 to finding out the the Colonial furniture appraiser from Antiques Roadshow is actually two twins, it's not an earth shattering revelation (you go your whole life believing the most shocking thing about colonial furniture is how much it would be worth if you'd retained the original "patina," and then you find out that guy is TWINS!).  The OLE DB command task is performing the updates 1 by 1 and thus doing a lookup against a big table also 1 by 1.  That's a whole bunch of work on 6 million records even though only 3 thousand need to be updated.

Skip My Rambling and Try This: While for truly enormous tables, it may be best to use a staging table approach, with a medium-big table the right indexes took it from an amount of time I gave up on finding out, to less than a minute.

Your Mileage May Vary (especially if your primary key isn't numeric):

A basic index on the audit date field and one on the key (if you're using a primary key, you already have this),  took the update down to less than a minute.  A combined index on audit field first/then the key also reduced the run time significantly (down to 10 minutes), but pales in comparison.

No comments:

Post a Comment