Using collation features to calculate Chinese character strokes and obtain pinyin initials

Source: Internet
Author: User
Tags comparison table

SQL Server collation is not a lot of the usual use, perhaps a lot of beginners are more unfamiliar, but there are
A mistake everyone should be frequently encountered: SQL Server database, when cross-Library multi-table connection query, if two data
The library default character set is different, and the system returns an error like this:

"The collation violation for the equal to operation could not be resolved. ”

I. Error ANALYSIS:

This error is due to inconsistent collation, we do a test, such as:
CREATE TABLE #t1 (
Name varchar (COLLATE) ALBANIAN_CI_AI_WS,
Value int)

CREATE TABLE #t2 (
Name varchar (COLLATE) CHINESE_PRC_CI_AI_WS,
Value int)

After the table is built, execute the connection query:

SELECT * from #t1 A inner joins #t2 B on A.name=b.name

In this way, the error occurs:

Server: Msg 446, Level 16, State 9, line 1
Unable to resolve collation violation for equal to operation.
The simplest way to rule out this error is to specify the collation of the table when it joins, so that the error
No longer appears. Statement to write this:

SELECT *
From #t1 A INNER join #t2 B
On A.name=b.name Collate Chinese_prc_ci_ai_ws


Two. Introduction to collation Rules:

What is a collation? MS is described in this way: "In Microsoft SQL Server 2000,
The physical storage of a string is controlled by a collation. Collations specify the bit patterns that represent each character and the
The rules used to store and compare characters. "
You can get all the collations supported by SQL Server by executing the following statement inside the Query Analyzer.

SELECT * FROM:: Fn_helpcollations ()

The collation name consists of two parts, and the first half refers to the character set supported by this collation.
Such as:
Chinese_prc_cs_ai_ws
First half: Refers to the Unicode character set, chinese_prc_ the pointer to the mainland simplified Unicode collation.
The second half of the collation is the suffix meaning:
_bin binary Ordering
_ci (CS) is case sensitive, CI is not differentiated, CS is distinguished
_ai (AS) are accent-sensitive, AI not differentiated, as differentiated
_ki (KS) Whether the kana type is differentiated, KI does not differentiate, KS differentiates
_WI (WS) Whether the width of the WI is differentiated, WS-Differentiated

Case Sensitivity: Select this option if you want the comparison to treat uppercase and lowercase letters as unequal.
Accent Sensitivity: Select this option if you want the comparison to treat accented and non-accented letters as unequal. If you select this option,
Comparisons also treat letters with different accents as unequal.
Kana distinction: Select this option if you want the comparison to treat katakana and hiragana Japanese syllables as unequal.
Width difference: Select this option if you want the comparison to treat half-width characters and full-width characters as unequal


Three. Application of collation:

SQL Server provides a large number of Windows and SQL Server-specific collations, but its application is often
Ignored by the developer. In fact, it is of great use in practice.

Example 1: Make the contents of the table Name column sorted by phonetic alphabet:

CREATE TABLE #t (ID int,name varchar (20))
Insert #t Select 1, ' Medium '
UNION ALL Select 2, ' Country '
UNION ALL Select 3, ' People '
UNION ALL Select 4, ' Ah '

SELECT * from #t ORDER by name collate Chinese_prc_cs_as_ks_ws
drop table #t
/* Results:
ID Name
----------- --------------------
4 AH
2 countries
3 people
1 in
*/

Example 2: Make the contents of the table Name column sorted by last name stroke:

CREATE TABLE #t (ID int,name varchar (20))

Insert #t Select 1, ' Three '
UNION ALL Select 2, ' B '
UNION ALL Select 3, ' two '
UNION ALL Select 4, ' one '
UNION ALL Select 5, ' Ten '
SELECT * from #t ORDER by name collate Chinese_prc_stroke_cs_as_ks_ws
drop table #t
/* Results:
ID Name
----------- --------------------
41
+ Plus
32
50
13
*/

Four. Extensions to application of collation rules in practice

The SQL Server Kanji collation can be sorted by pinyin, stroke, and so on, so how do we use this feature
To deal with some of the problems of Chinese characters. I now give an example:

Calculation of Chinese character strokes with collation characteristics

To calculate the stroke of Chinese characters, we have to do the preparatory work, we know that the Windows multi-national character, Unicode currently
A total of 20,902 Chinese characters are included. Simplified GBK Code Chinese Unicode value starts from 19968.
First, we use the SQL Server method to get all the Chinese characters, without the dictionary, we simply use the statement
Can get:

Select Top 20902 code=identity (int,19968,1) to #t from syscolumns A,syscolumns b

With the following statement, we get all the kanji, which are sorted by Unicode value:

Select Code,nchar (code) as Cnword from #t

We then use the SELECT statement to sort it by stroke.

Select Code,nchar (code) as Cnword
From #t
ORDER by NCHAR (code) COLLATE Chinese_prc_stroke_cs_as_ks_ws,code

Results:
Code Cnword
----------- ------
199,681
20008 丨
20022,
20031 丿
20032 Cookbook
20033 Xiha
20057 b
20058 乚
20059 bereaved
20101 sanctioned
19969 Ding
..........

From the above results, we can clearly see that a stroke of kanji, code is from 19968 to 20101, from small to large row, but to
The first word "ding" of two Chinese characters, code 19969, did not start again in order. With this result, we can light
The first or last Chinese character of each stroke is categorized by the SQL statement.
The following sentence is used to get the last kanji:

CREATE table #t1 (id int identity,code int,cnword nvarchar (2))

Insert #t1 (Code,cnword)
Select Code,nchar (code) as Cnword from #t
ORDER by NCHAR (code) COLLATE Chinese_prc_stroke_cs_as_ks_ws,code


Select A.cnword
From #t1 A
Left join #t1 B on a.id=b.id-1 and A.codewhere b.code is null
ORDER BY a.ID

Get 36 Chinese characters, each of which is sorted by Chinese_prc_stroke_cs_as_ks_ws sorting rules
Last Kanji:

Sanctioned 阝 Ma Finglong qi turtle teeth poison ageing niches 龂 nightmares gum looser dragon Yue 龎 bond Qiuyan 龡 kotewall 龝 齹 龣 calls 齈 龞 麷 鸞 麣 龖 龗 齾 snuffle Tantrayana

As can be seen above: "Sanctioned" is the last word after all the Chinese characters are sorted, "阝" is the last of all two Chinese characters sorted
A word ... Wait a minute.
But also found that from the 33rd Chinese character "龗 (33 pen)" After the stroke is a bit messy, incorrect. But it doesn't matter, than the "龗" stroke
More than four characters, we add: 齾 35 pen, Snuffle 36 pen, 靐 39 pen, Tantrayana 64 pen

Chinese Character Stroke Chart (TAB_HZBH):
CREATE TABLE TAB_HZBH (ID int identity,cnword nchar (1))
--Insert the first 33 characters first
Insert Tab_hzbh
Select Top A.cnword
From #t1 A
Left join #t1 B on a.id=b.id-1 and A.codewhere b.code is null
ORDER BY a.ID
--Add the last four characters
SET Identity_insert TAB_HZBH on
Go
Insert TAB_HZBH (Id,cnword)
Select 35,n ' 齾 '
UNION ALL select 36,n ' Snuffle '
UNION ALL select 39,n ' 靐 '
UNION ALL select 64,n ' Tantrayana '
Go
SET Identity_insert TAB_HZBH off
Go

So far, we can get the results, like we want the strokes of the Chinese character "country":

DECLARE @a nchar (1)
Set @a= ' Country '
Select top 1 ID
From TAB_HZBH
where Cnword>[email protected] collate CHINESE_PRC_STROKE_CS_AS_KS_WS
ORDER BY ID

Id
-----------
8
(Result: Chinese characters "country" stroke number is 8)

All the preparation process above, just to write the following function, this function is set aside all the temporary tables and fixed
For general purpose and code transfer convenience, the contents of the table Tab_hzbh are written in the statement, then the user input a string
Total strokes of Chinese characters:

Create function fun_getbh (@str nvarchar (4000))
returns int
As
Begin
DECLARE @word nchar (1), @n int
Set @n=0
While Len (@str) >0
Begin
Set @word =left (@str, 1)
--If non-kanji, stroke 0
Set @[email protected]+ (case when Unicode (@word) between 19968 and 19968+20901
Then (the select top 1 ID from (
Select 1 as Id,n ' sanctioned ' as word
UNION ALL select 2,n ' 阝 '
UNION ALL select 3,n ' horse '
UNION ALL select 4,n ' Wind '
UNION ALL select 5,n ' Dragon '
UNION ALL select 6,n ' Qi '
UNION ALL select 7,n ' Turtle '
UNION ALL select 8,n ' teeth '
UNION ALL select 9,n ' poison '
UNION ALL select 10,n ' ageing '
UNION ALL select 11,n ' niches '
UNION ALL select 12,n ' 龂 '
UNION ALL select 13,n ' Nightmares '
UNION ALL select 14,n ' Gum '
UNION ALL select 15,n ' looser '
UNION ALL select 16,n ' Dragon '
UNION ALL select 17,n ' Yue '
UNION ALL select 18,n ' 龎 '
UNION ALL select 19,n ' Bond '
UNION ALL select 20,n ' Qiuyan '
UNION ALL select 21,n ' 龡 '
UNION ALL select 22,n ' Kotewall '
UNION ALL select 23,n ' 龝 '
UNION ALL select 24,n ' 齹 '
UNION ALL select 25,n ' 龣 '
UNION ALL select 26,n ' calls '
UNION ALL select 27,n ' 齈 '
UNION ALL select 28,n ' 龞 '
UNION ALL select 29,n ' 麷 '
UNION ALL select 30,n ' 鸞 '
UNION ALL select 31,n ' 麣 '
UNION ALL select 32,n ' 龖 '
UNION ALL select 33,n ' 龗 '
UNION ALL select 35,n ' 齾 '
UNION ALL select 36,n ' Snuffle '
UNION ALL select 39,n ' 靐 '
UNION ALL select 64,n ' Tantrayana '
) T
where Word>[email protected] collate CHINESE_PRC_STROKE_CS_AS_KS_WS
Order BY ID ASC) ELSE 0 end)
Set @str =right (@str, Len (@str)-1)
End
Return @n
End

--Example of a function call:
Select DBO.FUN_GETBH (' People's Republic of China '), DBO.FUN_GETBH (' Chinese People's Republic ')
 
Execution Result: The total number of strokes is 39 and 46 respectively, simple and traditional is OK.

Of course, you can also put the above "UNION all" inside the Chinese character and the stroke of the fixed table, in the Chinese character
Lie Jian clustered INDEX, the column collation is set to:
Chinese_prc_stroke_cs_as_ks_ws
This is faster. If you use the BIG5 code of the operating system, you have to generate additional characters, the same way.
But one thing to remember is that these characters are chosen by SQL statements, not by hand, and not
It is from the dictionary, because the Xinhua dictionary is different from the Unicode character set, the results of the search dictionary will be incorrect
Indeed

  
To get the initials of Chinese characters by using the character of sorting rules

With the same total number of strokes, we can also write a function to find the first letter of Chinese pinyin. As follows:

Create function fun_getpy (@str nvarchar (4000))
Returns nvarchar (4000)
As
Begin
DECLARE @word nchar (1), @PY nvarchar (4000)
Set @PY = ' '
While Len (@str) >0
Begin
Set @word =left (@str, 1)
--If non-kanji characters, return the original character
Set @[email protected]+ (case when Unicode (@word) between 19968 and 19968+20901
Then (the select top 1 PY from (
Select ' A ' as Py,n ' 驁 ' as Word
UNION ALL SELECT ' B ', N ' book '
UNION ALL SELECT ' C ', N ' wrong '
UNION ALL SELECT ' D ', N ' 鵽 '
UNION ALL SELECT ' E ', N ' 樲 '
UNION ALL SELECT ' F ', N ' 鰒 '
UNION ALL SELECT ' G ', N ' hiker '
UNION ALL SELECT ' H ', N ' inceѕt '
UNION ALL SELECT ' J ', N ' 攈 '
UNION ALL SELECT ' K ', N ' 穒 '
UNION ALL SELECT ' L ', N ' 鱳 '
UNION ALL SELECT ' M ', N ' temperature '
UNION ALL SELECT ' n ', n ' 桛 '
UNION ALL SELECT ' O ', N ' 漚 '
UNION ALL SELECT ' P ', N ' exposure '
UNION ALL SELECT ' Q ', N ' 囕 '
UNION ALL SELECT ' R ', N ' 鶸 '
UNION ALL SELECT ' S ', N ' 蜶 '
UNION ALL SELECT ' T ', N ' 籜 '
UNION ALL SELECT ' W ', N ' clamoring '
UNION ALL SELECT ' X ', N ' 鑂 '
UNION ALL SELECT ' Y ', N ' Wan Leng '
UNION ALL SELECT ' Z ', N '
) T
where Word>[email protected] collate CHINESE_PRC_CS_AS_KS_WS
ORDER BY PY ASC) Else @word end)
Set @str =right (@str, Len (@str)-1)
End
Return @PY
End

--Example of a function call:
Select Dbo.fun_getpy (' People's Republic of China '), dbo.fun_getpy (' Chinese People's Republic ')
The results are: ZHRMGHG

If you are interested, you can also use the same method, expand to get the full spelling of the function of Chinese characters, and even get the full spell read
Tone tones, but the whole puzzle is mostly sorted. Get all the best is to use the comparison table, more than 20,000 Chinese characters search speed quickly, with the control
Tables can also make full use of the table's indexes.
There are many other clever uses of collations, which are no longer described in detail here. Welcome all to discuss together.

Using collation features to calculate Chinese character strokes and obtain pinyin initials

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.