Thursday, March 29, 2012

Comparing two sets of data

I have the following situation. One set of data has 274 rows (set2)
and anther has 264 (set1). Both data sets are similar in structure as
well as values for both of them were extracts from the same parent
table. Hope the info would substitute DDL. I need to find the "gap"
rows between these two sets.
Attempted to run a query like
select count(*)
from set2
where not exists
(select *
from set1)
did not yield what I desired. What else to try?

TIA.NickName wrote:
> I have the following situation. One set of data has 274 rows (set2)
> and anther has 264 (set1). Both data sets are similar in structure as
> well as values for both of them were extracts from the same parent
> table.
> Hope the info would substitute DDL.

It doesn't.

> I need to find the "gap"
> rows between these two sets.
> Attempted to run a query like
> select count(*)
> from set2
> where not exists
> (select *
> from set1)
> did not yield what I desired. What else to try?

Try posting your DDL. We at least need column names and keys to help you
here.

Zach|||OK,

I've proven that the "EXISTS" keyword/function can't solve this
problem. But then what?

-- test equality between two data sets
-- desired resultset: return "gap" rows

-- DDL and DML
create table #tmp1 (col1 int, col2 char(1));

insert into #tmp1
values(1,'a')
insert into #tmp1
values(2,'b')
insert into #tmp1
values(3,'c')
insert into #tmp1
values(4,'d')
insert into #tmp1
values(5,'e');

select * into #tmp2
from #tmp1
where col1 < 5;
select *
from #tmp1
where not exists
(select *
from #tmp2)|||NickName (dadada@.rock.com) writes:
> I've proven that the "EXISTS" keyword/function can't solve this
> problem. But then what?

It certainly can, but you must specify how the NOT EXISTS is to work.
SQL is not about telepathy.

For your repro, you could do

select *
from #tmp1 t1
where not exists (Select *
from #tmp2 t2
WHERE t1.col1 = t2.col1)

or

select *
from #tmp1 t1
where not exists (Select *
from #tmp2 t
WHERE t1.col1 = t2.col1
AND t1.col2 = t2.col2)

depending on what you are looking for. The first just lists missing key
values, the second list all mismatches.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Thanks, Erland.
*) laziness is an enemy, I thought about that but the table has a large
number of columns, so, was wondering if there's another approach. Will
try it tomorrow when the db is available.
*) btw, I like the word, "telepathy".

Erland Sommarskog wrote:
> NickName (dadada@.rock.com) writes:
> > I've proven that the "EXISTS" keyword/function can't solve this
> > problem. But then what?
> It certainly can, but you must specify how the NOT EXISTS is to work.
> SQL is not about telepathy.
> For your repro, you could do
> select *
> from #tmp1 t1
> where not exists (Select *
> from #tmp2 t2
> WHERE t1.col1 = t2.col1)
> or
> select *
> from #tmp1 t1
> where not exists (Select *
> from #tmp2 t
> WHERE t1.col1 = t2.col1
> AND t1.col2 = t2.col2)
> depending on what you are looking for. The first just lists missing
key
> values, the second list all mismatches.
>
> --
> Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
> Books Online for SQL Server SP3 at
> http://www.microsoft.com/sql/techin.../2000/books.asp|||NickName (dadada@.rock.com) writes:
> Thanks, Erland.
> *) laziness is an enemy, I thought about that but the table has a large
> number of columns, so, was wondering if there's another approach. Will
> try it tomorrow when the db is available.

If your aim is to find differences in any column, you will indeed have
to write code that has all column. There is no shortcut. What you can
do, if you have many tables and columns, is to generate code by reading
metadata. But this requires that you have a clear understanding for which
columns you want to compare, and which you do not.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Erland,

Something seems odd. Here's two test cases. Both are successful.
-- test equality between two data sets (with exact meta data)
-- desired resultset: return "gap" rows

-- DDL and DML
create table #tmp1 (col1 int, col2 char(1));

insert into #tmp1
values(1,'a')
insert into #tmp1
values(2,'b')
insert into #tmp1
values(3,'c')
insert into #tmp1
values(4,'d')
insert into #tmp1
values(5,'e');

select * into #tmp2
from #tmp1
where col1 < 5;

select *
from #tmp1 t1
where not exists
(select *
from #tmp2 t2
where t1.col1 = t2.col1
and t1.col2 = t2.col2)

-- outcome = success

-- test equality between two data sets
-- with minor meta data difference (one table has 3 attributes while
the other has 2)
-- note: comparison seems still successful
-- desired resultset: return "gap" rows

drop table #tmp1
drop table #tmp2;

-- DDL and DML
create table #tmp1 (col1 int, col2 char(1),col3 int);

insert into #tmp1
values(1,'a',11)
insert into #tmp1
values(2,'b',22)
insert into #tmp1
values(3,'c',33)
insert into #tmp1
values(4,'d',44)
insert into #tmp1
values(5,'e',55);

select col1,col2 into #tmp2
from #tmp1
where col1 < 5;

select *
from #tmp1 t1
where not exists
(select *
from #tmp2 t2
where t1.col1 = t2.col1
and t1.col2 = t2.col2)

-- outcome = success

HOWEVER, when I applied the above to my tables (sorry I can't post
exact structure nor data here, PARENT table has 13 columns and the
derived table has 12 of them, comparison is between them), sql still
failed to find the "gap". What could possibly stands in the way?
Many thanks.|||Ahe, I think I've found the problem. In "my" tables, both has
duplicate rows, however, the number of duplicate rows are not the same,
(distinct rows are sure the same).|||NickName (dadada@.rock.com) writes:
> Ahe, I think I've found the problem. In "my" tables, both has
> duplicate rows, however, the number of duplicate rows are not the same,
> (distinct rows are sure the same).

I'm not sure how this should be addressed. Are you saying that in
one table you have two rows with the same value, but in another you
only have one?

If there are no disctinct keys in the data, all relational operations will
be problematic, that's for sure.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||You might find it easier to export the table contents as ordered data and
then use a text file compare utility. Command-prompt example:

BCP "SELECT * FROM MyDatabase..Table1 ORDER BY MyPK" queryout
"C:\temp\Table1.txt" /T /c /SMyServer

BCP "SELECT * FROM MyDatabase..Table2 ORDER BY MyPK" queryout
"C:\temp\Table2.txt" /T /c /SMyServer

WINDIFF "C:\temp\Table1.txt" "C:\temp\Table2.txt"

--
Happy Holidays

Dan Guzman
SQL Server MVP

"NickName" <dadada@.rock.com> wrote in message
news:1104161334.766126.171470@.f14g2000cwb.googlegr oups.com...
> Ahe, I think I've found the problem. In "my" tables, both has
> duplicate rows, however, the number of duplicate rows are not the same,
> (distinct rows are sure the same).|||Dan Guzman (guzmanda@.nospam-online.sbcglobal.net) writes:
> You might find it easier to export the table contents as ordered data and
> then use a text file compare utility. Command-prompt example:
> BCP "SELECT * FROM MyDatabase..Table1 ORDER BY MyPK" queryout
> "C:\temp\Table1.txt" /T /c /SMyServer
> BCP "SELECT * FROM MyDatabase..Table2 ORDER BY MyPK" queryout
> "C:\temp\Table2.txt" /T /c /SMyServer
> WINDIFF "C:\temp\Table1.txt" "C:\temp\Table2.txt"

However, the use of windiff will break down if there is a column which is
permitted to be different in the tables. But there is a better alternative:
Beyond Compare, from http://www.scootersoftware.com. Beyond Compare
offers comparison on character level.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Erland,

You're exactly right. Maybe they did that to test my analytic skill,
who knows :)|||Dan,

Thanks. It seems a good way to try. Will do tomorrow.

Erland, no columns are supposed to contain different data sets though
one table may have an extra column, that is supposedly the only
difference. OK, I'll try the recommended software as well when get a
chance. I appreciate it.

Donsqlsql

No comments:

Post a Comment