Hi,
I hae a DB with a table holding companies, each company has the fields :
name, address, city, state, country and zip.
I have several clients connected to my server application (ASP.Net), and
they sometimes add new companies to the companies table. When they add the
company they don't always take the time to split the different fields of the
company's address and sends it (address, city, state, country and zip) in a
single row. The pattern of the row isn't always the same, sometimes only the
city is added to the address, sometimes only the zip, sometimes none.
I'm trying to make the match between the data the clients send and the data
I already have in my DB in order not to store the same company twice in the
DB.
I've tried using Full-Text Search, I've tried several logical algorithms,
but I can't make an automatic solution to this problem.
Please add any idea you may have on how I can make this comparission as
automatic as possible.
ThanksYou're best bet would be to scrub the data when it's entered, i.e.--change
the app to prevent the concatenated addresses. The only alternative I can
think of would be to split the incoming data into tokens like a compiler, an
d
then compare the token pattern against a syntax graph to categorize the data
.
It might be easier to reverse the graph, in other words, work backward: a
zip code usually has a pattern of either ##### or #####-####, unless you're
in Canada. The state or province usually preceeds the zip code, the city
preceeds the state or province, etc.
"YN" wrote:
> Hi,
> I hae a DB with a table holding companies, each company has the fields :
> name, address, city, state, country and zip.
> I have several clients connected to my server application (ASP.Net), and
> they sometimes add new companies to the companies table. When they add the
> company they don't always take the time to split the different fields of t
he
> company's address and sends it (address, city, state, country and zip) in
a
> single row. The pattern of the row isn't always the same, sometimes only t
he
> city is added to the address, sometimes only the zip, sometimes none.
> I'm trying to make the match between the data the clients send and the dat
a
> I already have in my DB in order not to store the same company twice in th
e
> DB.
> I've tried using Full-Text Search, I've tried several logical algorithms,
> but I can't make an automatic solution to this problem.
> Please add any idea you may have on how I can make this comparission as
> automatic as possible.
> Thanks|||Thanks for your reply Brian,
I can't force the users to split the address to separate fields. I usually
don't have a problem with identifying the country and state, however this is
not the case when dealing cities and addresses.
My current algorithm is (for a1 (=address1), a2 (=address2), c1 (=city1), c2
(=city2)
1. Remove all non relevant chars (non literal or numeric chars) from all.
Remove "non relevant words" like "street","road","st." etc from addresses.
2. Check if c1 is null. If so check if c2 exists in a1 and if so remove it
from the address and set c1 = c2
3. The other way around (c2 and a1)
4. tokenize the results by spaces and try to find matches between them
(since sometimes the structure is streen name followed by building number an
d
sometimes its the other waty around)
Do you perhaps have a better suggestion ?
"Brian Selzer" wrote:
> You're best bet would be to scrub the data when it's entered, i.e.--change
> the app to prevent the concatenated addresses. The only alternative I can
> think of would be to split the incoming data into tokens like a compiler,
and
> then compare the token pattern against a syntax graph to categorize the da
ta.
> It might be easier to reverse the graph, in other words, work backward: a
> zip code usually has a pattern of either ##### or #####-####, unless you'r
e
> in Canada. The state or province usually preceeds the zip code, the city
> preceeds the state or province, etc.
> "YN" wrote:
>|||Thanks for your reply Brian,
I can't force the users to split the address to separate fields. I usually
don't have a problem with identifying the country and state, however this is
not the case when dealing cities and addresses.
My current algorithm is (for a1 (=address1), a2 (=address2), c1 (=city1), c2
(=city2)
1. Remove all non relevant chars (non literal or numeric chars) from all.
Remove "non relevant words" like "street","road","st." etc from addresses.
2. Check if c1 is null. If so check if c2 exists in a1 and if so remove it
from the address and set c1 = c2
3. The other way around (c2 and a1)
4. tokenize the results by spaces and try to find matches between them
(since sometimes the structure is streen name followed by building number an
d
sometimes its the other waty around)
Do you perhaps have a better suggestion ?
"Brian Selzer" wrote:
> You're best bet would be to scrub the data when it's entered, i.e.--change
> the app to prevent the concatenated addresses. The only alternative I can
> think of would be to split the incoming data into tokens like a compiler,
and
> then compare the token pattern against a syntax graph to categorize the da
ta.
> It might be easier to reverse the graph, in other words, work backward: a
> zip code usually has a pattern of either ##### or #####-####, unless you'r
e
> in Canada. The state or province usually preceeds the zip code, the city
> preceeds the state or province, etc.
> "YN" wrote:
>|||Look up the tools that Mellisa Data sells for cleaning up addresses.|||YN
If i understood you may want to write a trigger or better stored procedure
to catch then INSERT a new company and check all data
declare @.address VARCHAR(100)
IF EXISTS (SELECT * FROM Table WHERE address =@.address AND company=@.company
AND .....)
PRINT 'You are trying to insert a duplicate'
"YN" <YN@.discussions.microsoft.com> wrote in message
news:F8C33439-18BB-4183-98DD-BDB2F244BF0B@.microsoft.com...
> Hi,
> I hae a DB with a table holding companies, each company has the fields :
> name, address, city, state, country and zip.
> I have several clients connected to my server application (ASP.Net), and
> they sometimes add new companies to the companies table. When they add the
> company they don't always take the time to split the different fields of
the
> company's address and sends it (address, city, state, country and zip) in
a
> single row. The pattern of the row isn't always the same, sometimes only
the
> city is added to the address, sometimes only the zip, sometimes none.
> I'm trying to make the match between the data the clients send and the
data
> I already have in my DB in order not to store the same company twice in
the
> DB.
> I've tried using Full-Text Search, I've tried several logical algorithms,
> but I can't make an automatic solution to this problem.
> Please add any idea you may have on how I can make this comparission as
> automatic as possible.
> Thanks
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment