Wednesday, March 7, 2012

Company Dimension

We deal with multiple vendors who provide us information via text/xml files. Vendor A may provide financial data, vendor b provides litigation data, vendor c provides ratings data. Our current structure has databases for each vendor with its own company table which basically makes all this data disconnected. Of course each vendor has its own proprietary company id to make records unique.

All of the data is based on companies so the grain of data would be at a company level. I would like to be able to link this information together by creating a dimensional model that has a single company table (DimCompany) and has facts populated based on the type of data we receive. Would this be the right sequence of events?

1. My initial load (historical) would have to look at all these data sources and create one company record in my DimCompany table. This table would then link to all other fact tables to provide a single view of company info. I would imagine this would have to be a fuzzy lookup since one company will be in all sources.

2. On subsequent loads (incremental) I would probably have to do a lookup of companies in the dimension via the proprietary code and add if the company wasn't there.

Any advice on tackling this issue would be greatly appreciated especially if SSIS was used in the process.

Hi jrp210,

Based on my interpretation of your schema, I would design DimCompany as you describe. One row per company, unique IDs, and you may wish to add surrogate keys - they will make your data mart / warehouse easier to scale.

I would also add a DimInformationType for the different types of data you receive.

Another option is to snowflake the InformationType table off of DimCompany. This is common when the referenced lookup table is relatively small and doesn't change often.

There are performance implications but in smaller data warehouses (~1 - 2G) you will probably never notice. And if you do, an index will likely clear up any performance issue.

I don't follow the need for a fuzzy lookup. These are expensive in SSIS and should only be used when necessary.

You are correct: incremental loading does require a lookup (or a merge join) to detect new records.

There's lots of good information online about how to do this. I suggest picking up one of the Kimball books for help in designing data warehouses - I particularly like the Data Warehoue Toolkit which is updated to include information on using SSIS for ETL.

Hope this helps,

Andy

|||

Here are my 2 cents:

Having a conformed 'Company' dimension is the way to 'join' all fact rows. Not sure about using a Fuzzy lookup as it may give you non-expected results; so evaluate yourself the margin of error your company can tolerate. In general you may want to asses the level of cleanness of your data, so you can anticipate the results

About how to organize the fact data; I would say that depends on the way the data is going to be used during analysis and reporting, and on the nature of the data. You may want to spend some time with the end users to get that feedback; draw a data model and review it with them again; this may take a few iterations.

Andy suggestion about Kimball’s book is also a good idea.

Good luck with that

|||

Thanks for the response. I did purchase the book you referenced. It has been very helpful but doesn't delve into the issue of trying to create one dimension table from multiple sources with different primary keys.

I will definitely be creating surrogate keys because more data sources will be introduced over the long haul.

I don't follow the DimInformationType table. Is this more of a helper table to map the surrogate company key to the proprietary key used by the vendor? For example:

The first time I run the historical load I will have to insert companies (take Microsoft for example) into my DimCompany table. This will be done by using vendor A's company info, vendor B, and so on. Microsoft's companyId in vendor A's system might be 123456, in vendor B's system it may be 789101. I want one instance of Microsoft in my DimCompany table so I will have to do a lookup to make sure it isn't already in the table before inserting. If it is not in the DimCompany table then I will add to DimCompany and then add to another table that has the surrogate key/vendor a key combination.

When its time to access vendor B's file, most of the companies will be in the DimCompany table. If not, follow the same procedure as above. If they are in the DimCompany table then I will have to add a row to the vendor b helper table with the surrogate key/vendor b key.

It all stems from the fact that each vendor has its own proprietary (different) key for the same company. I don't know how I would get around not using fuzzy logic or some sort of text mapping. Text mapping could be dangerous as well since the names may be slightly different.

|||

There are 2 other books from Kimball’s group:

The Data Warehouse ETL tool kit

The Microsoft Data Warehouse Toolkit

The first one cover how to conform dimension (coming from different sources).; the second one covers Kimball's warehouse methodology using SQL Server 2005 tools.

|||

jrp210 wrote:

The first time I run the historical load I will have to insert companies (take Microsoft for example) into my DimCompany table. This will be done by using vendor A's company info, vendor B, and so on. Microsoft's companyId in vendor A's system might be 123456, in vendor B's system it may be 789101. I want one instance of Microsoft in my DimCompany table so I will have to do a lookup to make sure it isn't already in the table before inserting. If it is not in the DimCompany table then I will add to DimCompany and then add to another table that has the surrogate key/vendor a key combination.

Hi jrp210,

That's different from what I understood previously. Don't feel bad, this happens to me a lot.

Maybe your schema looks like this:

CompanySK (surrogate key) CompanyName (business key)
1 Microsoft

VendorSK (surrogate key) VendorName (business key)
1 Vendor A
2 Vendor B

VendorCompanySK VendorCompany_CompanySK VendorCompany_VendorSK VendorCompany_ID
1 1 1 123456
2 1 2 789101

This is a snowflake that allows you to utilize Company and Vendor separately in facts, and also use VendorCompany in facts. There are foreign key relationships between DimCompany and DimVendorCompany, and DimVendor and DimVendorCompany.

Hope this helps,

Andy

|||

Andy,

My goal would be to create one company record from multiple sources. I don't necessarily need a vendor or vendorcompany table but will probably need their proprietary company id as an attribute in my company dimension table so that one could link back using their key.

I would assume something like this:

CompanySK
VendorA CompanyId
VendorB CompanyId
VendorC CompanyId
Company Name

For the most part one row should contain an Id for vendor a,b, and c but there are times when that is not the case. Is that why the company name would be the business/natural key?

|||

The first thing you need to do if to define the grain of your dimension; it looks to me like the grain is one row for each company (even when that company exists in several vendor data sources); so if you have more than one 'version' of a company, like in your example of Microsoft company, you would need some kind of auxiliary table to keep that 1:many relationship between the many rows/company in the source and the 1 row per company n your dimension.

BTW, this has nothing to do with SSIS...but I hope it helps

|||

You are right about the SSIS - probably more geared toward dimensional modeling/DW. But, I am using SSIS to do that so this is where I originally posted to.

Yes, grain of the dimension is the company. The really isn't more than one version of the company. It is the same company coming from different source (vendor) systems. The vendor systems have different unique keys in which they tag a company. Because of this there isn't one natural key to use across all. If I understand what you are saying is that this auxillary table would "create" the natural key that will be used in the dimension table?

|||

jrp210 wrote:

If I understand what you are saying is that this auxillary table would "create" the natural key that will be used in the dimension table?

That is right. You could create a surrogate key in the dimension and then that auxiliary table will keep the relationship between source system keys(many for a company) and the Dimension surrogate key (one per company).

|||

At first it makes sense but to initially load the database how would you keep the company unique in the DimCompany table without a key that would link them together? Or better yet put, which comes first loading the auxillary table or Dimension table?

I was assuming the auxillary table would be loaded first:
VendorId
VendorCompanyCode
CompanyName
etc.

But then I would have to create a unique key that would be used in the DimCompany table. There would have to be another table that then creates this key (identity column) that would be used in the DimCompany table.

No comments:

Post a Comment