Sunday, March 25, 2012

Comparing millions of records from file A and B

I'm trying to compare about 28 million records (270 length) from table A and B using the Lookup task as described in this forum. The process works fine with about two million records or so on my desktop ( p.4 3.39GHz, 1.5 GB Ram), but hangs with the amount of data I'm trying to process. I tried using full and partial caching, but to no avail. I'm thinking this is a hardware resource problem. So, does anyone has any recommendation on the hardware needed for this kind of operation and/or suggestion? Thanks in advance...

What are you interested in?

The "failures" or the "matches"

Try "tossing away" the data that you dont need so you can free the resources. How many Rows are in the Lookup table?

|||I'm interested in the failures. I want all that is in A that is not in B. The lookup table contains 13,730,718 rows. Thanks.|||

Take a look at the TableDifference component available on the http://www.sqlbi.eu site. Works great for a large number of records and can detect rows that do not exist in one or the other of two inputs as well as changes in the records. A few caveats to know about:

1. The inputs to the TableDifference component must be sorted. Since your data is coming from flat files, you will either have to (a) use a Sort on each source or (b) land the data to a staging table of some type and then requery from there using an order by on the queries.

2. The schema between the inputs to TableDifference must match. That means column names as well as data types.

3. There are some cases where one or the other of the inputs arrives at the TableDifference component faster than the other. There are some additional components available on the site that address these types of racing conditions.

Dave F.

|||

Natee wrote:

I'm interested in the failures. I want all that is in A that is not in B. The lookup table contains 13,730,718 rows. Thanks.

If you Lookup Table contains that main rows you will have a hard time with that configuration. Just in order to cache all those rows you will need approx 3.7 Gig of memory. And ontop of that you will have to add a few bytes for an index. If you are not caching the data, it means it will have to be looked up in the DB which is very timeconsuming and for millions of rows that will add up. To execute this I would suggest a box of at least 4-5 Gig (rather more) of main memory, and select to cache the smaller of those 2 tables. There wont be any way to optimize this cleaning unless you can get the data either sorted correctly (So the server will be able to use smart nested loops like in the component mentioned) or expand the memory so the whole looup table will fit into memory.

What will the Data look like? Are there many rows that are "almost" alike? If so there could be a fancy solution (thinking along the lines of using a tree structure to compress data)... But It would deffinitly be easier to find a server with some ram to handle this task.

Edit: Are the 2 Tabels in the same DB? Do Indexes exist on those that can be used to order the Data? If so you might want to write single querry to do this... You might need to use a join hint and force a merge join on the data...

|||I have developed and fully documented a method for doing mass record comparisons and updates using the Script Component instead of the Lookup component. We also tried to use the Lookup component for a similar task and ran into too many issues. This Script Component technique is currently working great for us! Plus, it's even easier to implement than the Lookup component! Let me know what you think.

Here is a URL for the complete documentation for this method:
http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm

Thanks,
Greg Van Mullem
|||I'm sincerely grateful to all for your response and for pointing me in the right direction. I checked out the site Dave directed me to and successfully ran this operation on my laptop with a GB of Ram within an hour. I could not believe it after struggling with this for a week. I'll definitely check out Greg's solution too. Many, many thanks again for your help.|||

Greg Van Mullem wrote:

I have developed and fully documented a method for doing mass record comparisons and updates using the Script Component instead of the Lookup component. We also tried to use the Lookup component for a similar task and ran into too many issues. This Script Component technique is currently working great for us! Plus, it's even easier to implement than the Lookup component! Let me know what you think.
Here is a URL for the complete documentation for this method:
http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm
Thanks,
Greg Van Mullem

Hi Greg,

Fascinating stuff. You've got some really valuable code to share up there.

I'm slightly sceptical as to why this is actually necassary though. Your justification for doing it all in code seems to be that using lookups "needlessly fills up the destination databases transaction log with hoards of update commands" and "It prevents counting the records that actually needed to be updated." Well did you explore using LOOKUPs to find out whether a row that already exists has actually changed or not? Or even a derived column/conditional split component subsequent to your LOOKUp that compares the values in the pipeline with the values in the LOOKUP dataset? That is eminently possible and will solve the two problems that you mention here.

Great work though.

-Jamie

|||I'm posting all my replys about this stuff in this other extremely simular thread.

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=719997&SiteID=1&mode=1

Please go to that thread for further info. I hate trying to maintian a coversation in 2 different threads at the same time so I'm not going to post in this thread any more.

Thanks,
Greg Van Mullem

No comments:

Post a Comment