Showing posts with label vendor. Show all posts
Showing posts with label vendor. Show all posts

Wednesday, March 7, 2012

Company Dimension

We deal with multiple vendors who provide us information via text/xml files. Vendor A may provide financial data, vendor b provides litigation data, vendor c provides ratings data. Our current structure has databases for each vendor with its own company table which basically makes all this data disconnected. Of course each vendor has its own proprietary company id to make records unique.

All of the data is based on companies so the grain of data would be at a company level. I would like to be able to link this information together by creating a dimensional model that has a single company table (DimCompany) and has facts populated based on the type of data we receive. Would this be the right sequence of events?

1. My initial load (historical) would have to look at all these data sources and create one company record in my DimCompany table. This table would then link to all other fact tables to provide a single view of company info. I would imagine this would have to be a fuzzy lookup since one company will be in all sources.

2. On subsequent loads (incremental) I would probably have to do a lookup of companies in the dimension via the proprietary code and add if the company wasn't there.

Any advice on tackling this issue would be greatly appreciated especially if SSIS was used in the process.

Hi jrp210,

Based on my interpretation of your schema, I would design DimCompany as you describe. One row per company, unique IDs, and you may wish to add surrogate keys - they will make your data mart / warehouse easier to scale.

I would also add a DimInformationType for the different types of data you receive.

Another option is to snowflake the InformationType table off of DimCompany. This is common when the referenced lookup table is relatively small and doesn't change often.

There are performance implications but in smaller data warehouses (~1 - 2G) you will probably never notice. And if you do, an index will likely clear up any performance issue.

I don't follow the need for a fuzzy lookup. These are expensive in SSIS and should only be used when necessary.

You are correct: incremental loading does require a lookup (or a merge join) to detect new records.

There's lots of good information online about how to do this. I suggest picking up one of the Kimball books for help in designing data warehouses - I particularly like the Data Warehoue Toolkit which is updated to include information on using SSIS for ETL.

Hope this helps,

Andy

|||

Here are my 2 cents:

Having a conformed 'Company' dimension is the way to 'join' all fact rows. Not sure about using a Fuzzy lookup as it may give you non-expected results; so evaluate yourself the margin of error your company can tolerate. In general you may want to asses the level of cleanness of your data, so you can anticipate the results

About how to organize the fact data; I would say that depends on the way the data is going to be used during analysis and reporting, and on the nature of the data. You may want to spend some time with the end users to get that feedback; draw a data model and review it with them again; this may take a few iterations.

Andy suggestion about Kimball’s book is also a good idea.

Good luck with that

|||

Thanks for the response. I did purchase the book you referenced. It has been very helpful but doesn't delve into the issue of trying to create one dimension table from multiple sources with different primary keys.

I will definitely be creating surrogate keys because more data sources will be introduced over the long haul.

I don't follow the DimInformationType table. Is this more of a helper table to map the surrogate company key to the proprietary key used by the vendor? For example:

The first time I run the historical load I will have to insert companies (take Microsoft for example) into my DimCompany table. This will be done by using vendor A's company info, vendor B, and so on. Microsoft's companyId in vendor A's system might be 123456, in vendor B's system it may be 789101. I want one instance of Microsoft in my DimCompany table so I will have to do a lookup to make sure it isn't already in the table before inserting. If it is not in the DimCompany table then I will add to DimCompany and then add to another table that has the surrogate key/vendor a key combination.

When its time to access vendor B's file, most of the companies will be in the DimCompany table. If not, follow the same procedure as above. If they are in the DimCompany table then I will have to add a row to the vendor b helper table with the surrogate key/vendor b key.

It all stems from the fact that each vendor has its own proprietary (different) key for the same company. I don't know how I would get around not using fuzzy logic or some sort of text mapping. Text mapping could be dangerous as well since the names may be slightly different.

|||

There are 2 other books from Kimball’s group:

The Data Warehouse ETL tool kit

The Microsoft Data Warehouse Toolkit

The first one cover how to conform dimension (coming from different sources).; the second one covers Kimball's warehouse methodology using SQL Server 2005 tools.

|||

jrp210 wrote:

The first time I run the historical load I will have to insert companies (take Microsoft for example) into my DimCompany table. This will be done by using vendor A's company info, vendor B, and so on. Microsoft's companyId in vendor A's system might be 123456, in vendor B's system it may be 789101. I want one instance of Microsoft in my DimCompany table so I will have to do a lookup to make sure it isn't already in the table before inserting. If it is not in the DimCompany table then I will add to DimCompany and then add to another table that has the surrogate key/vendor a key combination.

Hi jrp210,

That's different from what I understood previously. Don't feel bad, this happens to me a lot.

Maybe your schema looks like this:

CompanySK (surrogate key) CompanyName (business key)
1 Microsoft

VendorSK (surrogate key) VendorName (business key)
1 Vendor A
2 Vendor B

VendorCompanySK VendorCompany_CompanySK VendorCompany_VendorSK VendorCompany_ID
1 1 1 123456
2 1 2 789101

This is a snowflake that allows you to utilize Company and Vendor separately in facts, and also use VendorCompany in facts. There are foreign key relationships between DimCompany and DimVendorCompany, and DimVendor and DimVendorCompany.

Hope this helps,

Andy

|||

Andy,

My goal would be to create one company record from multiple sources. I don't necessarily need a vendor or vendorcompany table but will probably need their proprietary company id as an attribute in my company dimension table so that one could link back using their key.

I would assume something like this:

CompanySK
VendorA CompanyId
VendorB CompanyId
VendorC CompanyId
Company Name

For the most part one row should contain an Id for vendor a,b, and c but there are times when that is not the case. Is that why the company name would be the business/natural key?

|||

The first thing you need to do if to define the grain of your dimension; it looks to me like the grain is one row for each company (even when that company exists in several vendor data sources); so if you have more than one 'version' of a company, like in your example of Microsoft company, you would need some kind of auxiliary table to keep that 1:many relationship between the many rows/company in the source and the 1 row per company n your dimension.

BTW, this has nothing to do with SSIS...but I hope it helps

|||

You are right about the SSIS - probably more geared toward dimensional modeling/DW. But, I am using SSIS to do that so this is where I originally posted to.

Yes, grain of the dimension is the company. The really isn't more than one version of the company. It is the same company coming from different source (vendor) systems. The vendor systems have different unique keys in which they tag a company. Because of this there isn't one natural key to use across all. If I understand what you are saying is that this auxillary table would "create" the natural key that will be used in the dimension table?

|||

jrp210 wrote:

If I understand what you are saying is that this auxillary table would "create" the natural key that will be used in the dimension table?

That is right. You could create a surrogate key in the dimension and then that auxiliary table will keep the relationship between source system keys(many for a company) and the Dimension surrogate key (one per company).

|||

At first it makes sense but to initially load the database how would you keep the company unique in the DimCompany table without a key that would link them together? Or better yet put, which comes first loading the auxillary table or Dimension table?

I was assuming the auxillary table would be loaded first:
VendorId
VendorCompanyCode
CompanyName
etc.

But then I would have to create a unique key that would be used in the DimCompany table. There would have to be another table that then creates this key (identity column) that would be used in the DimCompany table.

Thursday, February 16, 2012

Comments on InterSystems Cache Database Platform

Hi all,

We're looking at a vendor who uses the InterSystems Cache Database
Platform, but our IT department has zero experience with this system.
This software package will have a pivotal and mission critical roll in
our organization, so I'd like some comments on what others think of
this database platform.

Mainly I'm curious how easy/difficult it is to query a Cache Database,
and does it use standard SQL calls like Oracle and MS SQL? What about
ODBC connections into the database from Crystal, MS Access, and OLAP
tools? Any other caviets (backup, maintenance, etc)?

Thanks in advance for any suggestions or comments Cache.

SamAlex wrote:

> Hi all,
> We're looking at a vendor who uses the InterSystems Cache Database
> Platform, but our IT department has zero experience with this system.
> This software package will have a pivotal and mission critical roll in
> our organization, so I'd like some comments on what others think of
> this database platform.
> Mainly I'm curious how easy/difficult it is to query a Cache Database,
> and does it use standard SQL calls like Oracle and MS SQL? What about
> ODBC connections into the database from Crystal, MS Access, and OLAP
> tools? Any other caviets (backup, maintenance, etc)?
> Thanks in advance for any suggestions or comments Cache.
> Sam

Won't they let you test drive it?

I've never used Cache, but I have gone thru DB evaluations before. I would
only suggest you set out some clear DB requirements (testable if at all
possible). Their sales rep should be able to answer questions about its flavor
of SQL, ODBC support and such. If they cannot (or worse, will not), answer
those simple questions, then run away.

Your post here is not a bad start:
Interfaces/Languages supported - SQL, ODBC, JAVA(JDBC), Special libraries
Standards supported - like SQL, and at what level?
Tool availability - does it have custom Back up/restore programs or is it only
a data file? (
Front end support - Do you have to use their GUI development tools or can you
program using others?
Platforms supported - If next year you no longer want to pay MS for an OS
update license, can you switch to LINUX?
Programmers available? If few people know/use the DB, is training inexpensive?
Are other resources available (e.g. books, usenet discussion groups)?
Performance - run some benchmarks
Flexibility - does the DB adapt easily? (e.g., does adding a column to a table
force you to rebuild the DB? Does it force you to rebuild your applications?

Bottom line is that you need to watch your bottom line. IOW, what is the total
bang for the buck?

Good luck.
--
Ed Prochak
running http://www.faqs.org/faqs/running-faq/
netiquette http://www.psg.com/emily.html
--
"Two roads diverged in a wood and I
I took the one less travelled by
and that has made all the difference."
robert frost|||As a long time (satisfied) user of Cach I'll try to answer some of
the points raised.

First of all a few brief comments on the nature of the beast, what it
is and isn't. It's not your typical RDBMS like Oracle or SqlServer.
Cach is described by Intersystems as being "post-relational" which is
a marketing term supposed to convey modernness (I guess) but in
technical terms this is wishy washy and doesn't really tell you
anything. Cach is based on a multidimensional database engine which,
historically speaking, is derived from the M(umps) DBMS. On top of it
sits an object layer that turns Cach into an OODBMS (think classes,
properties, instance methods, class methods, relationships,
inheritance, polymorphism etc pp). Additionally, Cach will offer a
relational projection of object definitions. Every class is projected
into one or more tables. The relational projection allows for querying
via Sql. Obviously, methods will not be available via the relational
view (but class methods can be projected as stored procedures). The
projection works both ways. You can create tables via DDL and Cach
will create classes from that. They call this process "Unified Data
Architecture". You don't have to use the object features of Cach and
treat it purely as an RDBMS but in my opionon you would lose some of
the more stronger features, like encapsulation of business logic.

Interfaces / Languages
Sql, Odbc, Jdbc, Java, Ejb, C++, Com, Soap, Xml
Cach has its own server side scripting languages, Cach ObjectScript
(COS) and Cach Basic (syntactically derived from VBScript).
Particularly COS is a fully fledged and very flexible programming
language that goes way way beyond what you can do with your typical
RDBMS stored procedure languages like Transact Sql.
Also, Cach has its own web development platform, Cach Server Pages
(CSP) which, in much simplified terms, can be viewed as a cross
between ASP and ASP.Net (but the business logic resides on the
back-end not on the web server).

Standards
Sql 92 Intermediate
Jdbc Type 4
COS is backwards compatible with ISO M(umps).

Operational Tools
Online Backup and Restore
Journalling: Before Image, After Image, Shadow (i.e. near-instant
cross-system replication)
Database Repair (the physical db layer is based on B-trees), although
you are unlikely to ever use them. While Cach as a product / brand
name is around 6 years old the underlying database engine has evolved
from its pre-decessor products over the last 25 years (Intersystems
has been in the dbms market since 1978) and is rock solid.

Front-end tools
Any you like as long as they support one of the interfaces into Cach
mentioned above. If you take our company as an example, we have
developed terminal based applications, moved to C++ front-ends with a
low-level COM interface into the Cach multidimensional database,
moved to VB6 with COM interface into Cach Objects, moved to web
applications using Cach's own CSP and, most recently, ASP.Net via
Soap to Cach (i.e. Cach providing web services).

Platforms
Windows, Linux, Unix (HP, AIX, Solaris) and OpenVMS

Performance
One of the big strengths of Cach. Scales excellently while not being
very resource hungry.

Flexibility
Schema Migrations do not require db rebuild

Resources
There is a newsgroup intersystems.public.cache on
news.intersystems.com

Best regards

Wolf Koelling

Ed Prochak <ed.prochak@.magicinterface.com> wrote in message news:<OqRAc.49460$bO6.35830@.fe39.usenetserver.com>...
> Alex wrote:
> > Hi all,
> > We're looking at a vendor who uses the InterSystems Cache Database
> > Platform, but our IT department has zero experience with this system.
> > This software package will have a pivotal and mission critical roll in
> > our organization, so I'd like some comments on what others think of
> > this database platform.
> > Mainly I'm curious how easy/difficult it is to query a Cache Database,
> > and does it use standard SQL calls like Oracle and MS SQL? What about
> > ODBC connections into the database from Crystal, MS Access, and OLAP
> > tools? Any other caviets (backup, maintenance, etc)?
> > Thanks in advance for any suggestions or comments Cache.
> > Sam
> Won't they let you test drive it?
> I've never used Cache, but I have gone thru DB evaluations before. I would
> only suggest you set out some clear DB requirements (testable if at all
> possible). Their sales rep should be able to answer questions about its flavor
> of SQL, ODBC support and such. If they cannot (or worse, will not), answer
> those simple questions, then run away.
> Your post here is not a bad start:
> Interfaces/Languages supported - SQL, ODBC, JAVA(JDBC), Special libraries
> Standards supported - like SQL, and at what level?
> Tool availability - does it have custom Back up/restore programs or is it only
> a data file? (
> Front end support - Do you have to use their GUI development tools or can you
> program using others?
> Platforms supported - If next year you no longer want to pay MS for an OS
> update license, can you switch to LINUX?
> Programmers available? If few people know/use the DB, is training inexpensive?
> Are other resources available (e.g. books, usenet discussion groups)?
> Performance - run some benchmarks
> Flexibility - does the DB adapt easily? (e.g., does adding a column to a table
> force you to rebuild the DB? Does it force you to rebuild your applications?
> Bottom line is that you need to watch your bottom line. IOW, what is the total
> bang for the buck?
> Good luck.
> --
> Ed Prochak
> running http://www.faqs.org/faqs/running-faq/
> netiquette http://www.psg.com/emily.html|||Hi Sam,

Cache is really heavy in the Medical field. I worked for a Medical
Lab that was running it on several fronts, and to say the least it was
unreliable and had a VERY high overhead for maintenance and
administration. As it stands now it does not follow very many
"standards" in the industry so some things don't work as well.
Support is another issue. You may call support and get someone on the
line that doesn't understand their own product as well as waiting for
weeks to get a response to any issues you may have. The bad thing
about it is that the initial setup was setup by "Cache" consultants
and they company was still having all these issues.

If you have no knowledge of Cache, then the road will be long. If you
have a choice I would definately do a side by side comparison of other
offerings (Oracle, MS SQL, MySQL and even Filemaker Pro) before
committing to Cache.

I am currently working on a Cache roll out for a large firm that is
doing a 180 day eval and so far not one piece of completed code has
been put into testing. In a way Cache is a road block unto itself.

Don't get me wrong, if the product was that good, then I would think
other companies would be trying to re-develop or "copy" some of the
ideas that Cache uses, but you don't see that.

Please, Cache Advocates, don't be angry with my comments. These are
real experiences from someone who worked closely with the developers
at each of these locations.

samalex@.gmail.com (Alex) wrote in message news:<b8d0e42e.0406180801.dc3018e@.posting.google.com>...
> Hi all,
> We're looking at a vendor who uses the InterSystems Cache Database
> Platform, but our IT department has zero experience with this system.
> This software package will have a pivotal and mission critical roll in
> our organization, so I'd like some comments on what others think of
> this database platform.
> Mainly I'm curious how easy/difficult it is to query a Cache Database,
> and does it use standard SQL calls like Oracle and MS SQL? What about
> ODBC connections into the database from Crystal, MS Access, and OLAP
> tools? Any other caviets (backup, maintenance, etc)?
> Thanks in advance for any suggestions or comments Cache.
> Sam|||As you clearly had bad experiences with Cach I can hardly argue you
out of that. All I'd ask is whether you are sure that the problems
were due to Cach or was it the application that was badly written?

I do need to take issue with one of your statements, however, namely
that Cach has "a VERY high overhead for maintenance and
administration". This couldn't be further from the truth. Cach
requires next to no maintenance at all. As an example I can tell you
the setup of our company. We have a 24x7 mission-critical Cach server
with 250 to 300 concurrent users (incl. background processes) during
standard business hours. It has several databases for different
purposes / user groups and is networked to two off-site application
servers which use the live database. For resilience the live server
has a Shadow and a Disaster Recovery server connected to it which
replicate all database activities. We also have a development server
and a Change Management server. One of my tasks at the company is the
role of the DBA which occupies no more than 10% of my time, mostly
setting up and removing development areas and refreshing the data in
test areas. If you want to call that high maintenance then please give
me an idea of what constitutes low maintenance.

Best

Wolf Koelling

americatch@.hotmail.com (IT Man) wrote in message news:<113ad262.0406221055.32e0a4a0@.posting.google.com>...
> Hi Sam,
> Cache is really heavy in the Medical field. I worked for a Medical
> Lab that was running it on several fronts, and to say the least it was
> unreliable and had a VERY high overhead for maintenance and
> administration. As it stands now it does not follow very many
> "standards" in the industry so some things don't work as well.
> Support is another issue. You may call support and get someone on the
> line that doesn't understand their own product as well as waiting for
> weeks to get a response to any issues you may have. The bad thing
> about it is that the initial setup was setup by "Cache" consultants
> and they company was still having all these issues.
> If you have no knowledge of Cache, then the road will be long. If you
> have a choice I would definately do a side by side comparison of other
> offerings (Oracle, MS SQL, MySQL and even Filemaker Pro) before
> committing to Cache.
> I am currently working on a Cache roll out for a large firm that is
> doing a 180 day eval and so far not one piece of completed code has
> been put into testing. In a way Cache is a road block unto itself.
> Don't get me wrong, if the product was that good, then I would think
> other companies would be trying to re-develop or "copy" some of the
> ideas that Cache uses, but you don't see that.
> Please, Cache Advocates, don't be angry with my comments. These are
> real experiences from someone who worked closely with the developers
> at each of these locations.
> samalex@.gmail.com (Alex) wrote in message news:<b8d0e42e.0406180801.dc3018e@.posting.google.com>...
> > Hi all,
> > We're looking at a vendor who uses the InterSystems Cache Database
> > Platform, but our IT department has zero experience with this system.
> > This software package will have a pivotal and mission critical roll in
> > our organization, so I'd like some comments on what others think of
> > this database platform.
> > Mainly I'm curious how easy/difficult it is to query a Cache Database,
> > and does it use standard SQL calls like Oracle and MS SQL? What about
> > ODBC connections into the database from Crystal, MS Access, and OLAP
> > tools? Any other caviets (backup, maintenance, etc)?
> > Thanks in advance for any suggestions or comments Cache.
> > Sam