Sunday, March 25, 2012

comparing nvarchar(max) column using like to non-ASCII range

Our database defines the long_value column as nvarchar(max). I want to find out which rows actually contain non-ASCII characters in that column, but this clause also returns rows with only ASCII characters:

where long_value like (N'%[' + nchar(128) + N'-' + nchar(65535) + N']%')

What am I doing wrong?

It seems to be the fact that nchar(65535) has no length. I couldn't find any really good reference to cover this (perhaps this thread: http://groups.google.com/group/comp.lang.javascript/tree/browse_frm/month/2004-09/7d3603a75c1550f3?rnum=91&_done=%2Fgroup%2Fcomp.lang.javascript%2Fbrowse_frm%2Fmonth%2F2004-09%3F), but if you run these statements:

select N'%[' + nchar(128) + N'-' + nchar(65535) + N']%'

select N'%[' + nchar(128) + N'-' + nchar(65534) + N']%'

You will see that there is no second character after the - in the second string. This should do what you want:

declare @.long_value nvarchar(max)
set @.long_value = N'abcdefghijklmnopqrstuvwxyz' + nchar(129)
select case when @.long_value like (N'%[' + nchar(128) + N'-' + nchar(65534) + N']%') then 'yes' else 'no' end

and you could add a check for the nchar(65535) also...

|||

Louis Davidson wrote:

It seems to be the fact that nchar(65535) has no length. I couldn't find any really good reference to cover this (perhaps this thread: http://groups.google.com/group/comp.lang.javascript/tree/browse_frm/month/2004-09/7d3603a75c1550f3?rnum=91&_done=%2Fgroup%2Fcomp.lang.javascript%2Fbrowse_frm%2Fmonth%2F2004-09%3F), but if you run these statements:

select N'%[' + nchar(128) + N'-' + nchar(65535) + N']%'

select N'%[' + nchar(128) + N'-' + nchar(65534) + N']%'

You will see that there is no second character after the - in the second string.

I can't corroborate your results. I'm runing SQL Server 2005, both an unpatched Enterprise Edition and the SP1 + build 2153 hotfix Developer Edition. Connected to either server's master database, the result of both queries is displayed in Management Studio with an empty box on either side of the dash. Furthermore, the results of the following queries are 1, 2, 3, and 7 as expected:

selectlen(nchar(65535));

selectlen(nchar(128)+nchar(65535));

selectlen(nchar(128)+ N'-'+nchar(65535));

selectlen(N'%['+nchar(128)+ N'-'+nchar(65535)+ N']%');

Louis Davidson wrote:

This should do what you want:

declare @.long_value nvarchar(max)
set @.long_value = N'abcdefghijklmnopqrstuvwxyz' + nchar(129)
select case when @.long_value like (N'%[' + nchar(128) + N'-' + nchar(65534) + N']%') then 'yes' else 'no' end

and you could add a check for the nchar(65535) also...

Well, the obvious way to add a check for nchar(65535) is to include it in the range:

N'%['+nchar(128)+ N'-'+nchar(65534)+nchar(65535)+ N']%'

Which, if it worked, would indicate a bug in character range matching. But it fails in the same way as the original.

Here is a diagnostic query that suggests to me that the problem has something to do with collations (a topic I had been blissfully ignorant of). If the server's default collation or the database's collation is Latin1_General_BIN2, it returns the expected result; but if the collation is SQL_Latin1_General_CP1_CI_AS, it returns 21,230 Unicode characters between 0 and 65535 inclusive which are equal to nchar(65535). Most disturbing is the fact that 0 (NULL) and 32 (SPACE) are among them, which might explain my original problem with the character range. Try this out on different servers and databases:

declare @.code as integer

set @.code = 0

while(@.code < 65536)

begin

if(nchar(@.code)=nchar(65535))

printcast(@.code asnchar)+' equal'

set @.code = @.code + 1

end

|||

Kevin Rodgers wrote:

Here is a diagnostic query that suggests to me that the problem has something to do with collations (a topic I had been blissfully ignorant of). If the server's default collation or the database's collation is Latin1_General_BIN2, it returns the expected result; but if the collation is SQL_Latin1_General_CP1_CI_AS, it returns 21,230 Unicode characters between 0 and 65535 inclusive which are equal to nchar(65535). Most disturbing is the fact that 0 (NULL) and 32 (SPACE) are among them, which might explain my original problem with the character range. Try this out on different servers and databases:

declare @.code as integer set @.code = 0 while (@.code < 65536) begin if (nchar(@.code) = nchar(65535)) print cast(@.code as nchar) + ' equal' set @.code = @.code + 1 end

Collations are definitely the issue, since the orginal query works as intended (returning all long_value columns with non-ASCII characters) when reformulated with an explicit collation: select * from Item_Detail where (long_value collate Latin1_General_BIN2) like (N'%[' + nchar(128) + N'-' + nchar(65535) + N']%')|||

Kevin Rodgers wrote:

Kevin Rodgers wrote:

Here is a diagnostic query that suggests to me that the problem has something to do with collations (a topic I had been blissfully ignorant of). If the server's default collation or the database's collation is Latin1_General_BIN2, it returns the expected result; but if the collation is SQL_Latin1_General_CP1_CI_AS, it returns 21,230 Unicode characters between 0 and 65535 inclusive which are equal to nchar(65535). Most disturbing is the fact that 0 (NULL) and 32 (SPACE) are among them, which might explain my original problem with the character range. Try this out on different servers and databases:

declare @.code as integer set @.code = 0 while (@.code < 65536) begin if (nchar(@.code) = nchar(65535)) print cast(@.code as nchar) + ' equal' set @.code = @.code + 1 end

Collations are definitely the issue, since the orginal query works as intended (returning all long_value columns with non-ASCII characters) when reformulated with an explicit collation: select * from Item_Detail where (long_value collate Latin1_General_BIN2) like (N'%[' + nchar(128) + N'-' + nchar(65535) + N']%')

Unfortunately, the Latin1_General_CI_AS collation does not work like the Latin1_General_BIN2 collation. Can someone explain that?

No comments:

Post a Comment