Code Page and sorting rules in SQL Server

Source: Internet
Author: User
Tags sql server books
The code page defines the bit mode of character and Unicode data. This mode represents a specific letter, number, or symbol (for example, 0x20 represents a space, 0x74 represents the character "T "). Some data types Use one byte for each character. Each byte can have one of the 256 different bit modes.
In a computer, characters are represented by different bit modes (ON or OFF. Each byte has 8 bits, which can have 256 different ON and OFF combinations. For programs that use one byte to store each character, assign a character to each bit mode to indicate a maximum of 256 different characters. Two bytes have 16 bits, which can have 65,536 unique ON and OFF combinations. A program that uses two bytes to indicate each character can contain a maximum of 65,536 characters.
A single-byte code page is a character definition. These characters are mapped to each of the 256 bit modes that each byte may have. The code page defines uppercase and lowercase characters, numbers, symbols, and! , @, #, %, And other special characters. Each European language (such as German and Spanish) has its own single-byte code page. Although the bitwise pattern used to represent letters from A to Z in the Latin alphabet is the same in all code pages, it is used to represent accent characters (such as "é" and "á ") bit mode is different in different code pages. If data is exchanged between computers running different code pages, all character data must be converted from the code page of the sending computer to the code page of the receiving computer. If the extended characters in the source data are not defined in the Code Page of the receiving computer, the data will be lost. If a database provides services for clients from many different countries, it is difficult to select such a code page for the database to include all the extended characters required by all client computers. In addition, non-stop conversion between code pages takes a lot of processing time.
It is not enough to store characters in many languages based on single-byte character sets. For example, some Asian languages contain thousands of characters, so each character must use dubyte. The dubyte character set is defined for these languages. However, these languages all have their own code pages, and it is also difficult to transmit data between computers running different two-byte code pages.
SQL Server 2000 supports the following code pages.
Code Page Description
1258 Vietnamese
1257 BORO
1256 Arabic
1255 Hebrew
1254 Turkish
1253 Greek
1252 Latin 1 character (ANSI)
1251 Spanish
1250 Central European language
950 traditional Chinese
949 Korean
Simplified Chinese 936
932 Japanese
874 Thai
850 multilingual (MS-DOS Latin1)
437 MS-DOS American English
To solve the problem of character conversion and interpretation when multiple code pages are supported in the network, the ISO Standardization Organization and a group called Unicode Consortium define Unicode standards. Unicode stores each character in two bytes. The Unicode standard is applicable to all major languages because the 65,536 characters are sufficient to cover common characters in all languages in the world. If Unicode is used by all computers and programs on the network, no character conversion is required. Each user and all other users see exactly the same character and will not lose any character.
On a computer running Microsoft Windows & reg; operating system, the code page used by the operating system and Windows applications is defined by the Windows region settings. The region settings are selected when the operating system is installed. Windows applications use the code page defined by Windows region settings to interpret data. Windows applications also support wide character data, that is, Unicode data.
SQL Server 2000 supports two types of character data:
Unicode data types: nchar, nvarchar, and ntext. These data types use Unicode Character notation. The code page does not apply to these data types.
Non-UNICODE character data types: Char, varchar, and text. These data types Use the character notation defined in the single-byte or double-byte code page.
For more information about how character data is stored and about code pages, Unicode, and sort order operations, see developing international software for Windows 95 and Windows NT 4.0 on the http://msdn.microsoft.com msdn & reg; page.
International data and Unicode
When only character data and code pages are used, it is difficult to store data in multiple languages in a database. It is difficult to find a code page for the database to store characters unique to all languages. It is also difficult to ensure correct conversion of special characters read and updated by different clients running various code pages. Databases that support international clients should always use Unicode data, rather than non-Unicode data types.
For example, a North American customer's database must process three major languages:
The Spanish name and address used in Mexico.
The French name and address used by Quebec.
The English Name and Address used in other regions of Canada and the United States.
When only character columns and code pages are used, be careful to ensure that the code pages installed in the database can process the characters in these three languages. When the characters in one of these languages are read by the client running the code page of the other language, You must be more careful to ensure proper conversion of characters.
With the development of the Internet, it is increasingly important to support a large number of client computers running in different regions. It is difficult to select such a code page so that the character data type contained in it can support all the characters required by users worldwide.
The simplest way to manage character data in an international database is to always use Unicode nchar, nvarchar, and ntext data types instead of the corresponding non-Unicode data types (char, varchar, and text ). If all applications that use international databases use Unicode variables instead of non-Unicode variables, character conversion is not required anywhere in the system. Each client has the same character data as all other clients.
For systems that can use a single-byte code page, the storage space required for Unicode data is twice that of non-UNICODE character data, but it eliminates the need to convert extended characters between code pages, therefore, at least partially make up for the above shortcomings. The system that uses the dubyte code page does not have this problem.
SQL Server 2000 stores all the written system directory data in columns that contain Unicode data types. The names of database objects (such as tables, views, and stored procedures) are stored in Unicode columns. In this way, only Unicode can be used to develop applications, thus avoiding all code page conversion problems.
Sorting order
The sorting order specifies the rules used by SQL Server to interpret, sort, compare, and display character data. For example, the sorting order definition "a" is less than, equal to, or greater than "B ". The sorting order defines whether the sorting rules are case sensitive, such as "M" and "m. In addition, it also defines whether the sorting rules are accent-sensitive, such as "á" and "& auml.
SQL Server 2000 uses two sorting orders for each sorting rule, one for Unicode data and the other for character code pages.
Many SQL Server sorting rules use the same code page, but the code page sorting order is different. This allows the site to be selected:
Whether to sort characters only by the numeric value in bit mode. Binary sorting is the fastest, because SQL server does not need to make any adjustments and can use a fast and simple sorting algorithm. Binary sorting is case sensitive. Because the bit mode in the code page may not be arranged according to the sequence defined by the dictionary rules of a specific language, binary sorting sometimes does not sort the characters according to the sequence expected by users using the language. From the above materials, we can see that the code page is the character set of each country and region, and the character bit representation. What we see in the database: the blog under collate chinese_prc_ci_as null to turn itself into a self-seeking blog

When creating table, we often encounter such a statement, for example, corpname nvarchar (10) Collate chinese_prc_ci_as null. What does it mean?

Take a look at the following:

Syntax

Collate <collation_name>

<Collation_name >::=

{Windows_collation_name} | {SQL _collation_name}

Parameters

Collation_name

Is the name of the sorting rule applied to expressions, column definitions, or database definitions. Collation_name can only be the specified windows_collation_name or SQL _collation_name.

Windows_collation_name

Is the name of a Windows sorting rule.

SQL _collation_name

Is the name of the SQL sorting rule.

Note

The collate clause can be specified at the following levels:

  1. Create or change a database.

 

You can use the create database or alter database statement's collate clause to specify the default database sorting rules. You can also use the SQL Server Enterprise Manager to specify sorting rules when creating a database. If no sorting rule is specified, the default sorting rule for the SQL server instance is assigned to the database.

  1. Create or modify a table column.

 

You can use the create table or alter table statement's collate clause to specify the sorting rules for each string column. You can also use the SQL Server Enterprise Manager to specify sorting rules when creating tables. If no sorting rule is specified, the default database sorting rule is assigned to the column.

You can also use the database_default option in the Collate clause to specify the default sorting rules for the columns in the temporary table to connect to the current user database (instead of tempdb.

  1. Sorting rules of projection expressions.

 

You can use the Collate clause to project character expressions to specific sorting rules. The default sorting rules of the current database will be assigned for the character nominal value and variable. A sorting rule is defined for the specified column for the column reference.

The sorting rule of the identifier depends on the level at which the identifier is defined. Assigns default sorting rules for instances for instance-level objects (such as login names and database names) identifiers. Assign default database sorting rules for the identifiers of database objects (such as tables, views, and column names. For example, for tables whose names are case-insensitive, you can create a table in a database that uses case-insensitive sorting rules, but not in a database that uses case-insensitive sorting rules.

When the connection context is related to the database, you can create variables, Goto labels, temporary stored procedures, and temporary tables, and reference them when the context is switched to another database. The variables, Goto labels, temporary stored procedures, and temporary table identifiers are in the default sorting rules of the instance.

The collate clause can only be applied to Char, varchar, text, nchar, nvarchar, and ntext data types.

The following describes the sorting rules:

What is a sorting rule? In Microsoft SQL Server 2000, the physical storage of strings is controlled by sorting rules. The sorting rule specifies the bit mode of each character and the rules used to store and compare characters. "

A sorting rule is generally identified by its name. The exception is that the name of the sorting rule is not specified for the Windows sorting rule during installation, but the sorting rule indicator is specified, then select the check box to specify binary or dictionary sorting for case-insensitive and accent-insensitive.

You can run the fn_helpcollations function to retrieve the list of all valid sorting rule names for Windows and SQL sorting rules:

Select * From: fn_helpcollations ()

SQL Server only supports the code pages supported by the basic operating system. When you perform an operation that depends on a sorting rule, the SQL Server sorting rule used by the referenced object must use the code page supported by the operating system running on the computer. These operations can include:

<! -- [If! Supportlists] --> · <! -- [Endif] --> when creating or changing a database, specify a default sorting rule for the database.

<! -- [If! Supportlists] --> · <! -- [Endif] --> when creating or changing a table, specify a default sorting rule for the column.

<! -- [If! Supportlists] --> · <! -- [Endif] --> when restoring or attaching a database, the operating system must support the default database sorting rules and any char, varchar, and text columns or parameters in the database.

Supports code page conversion for char and varchar data types, but does not support code page conversion for text data types. Data loss during code page conversion is not reported.

If the referenced object uses or the specified sorting rule uses a code page not supported by windows, SQL Server issues an error message. For more information, see the "sorting rules" section in the "SQL Server architecture" chapter of SQL Server books online.

The name of a sorting rule consists of two parts. The first half is the character set supported by this sorting rule.

For example: chinese_prc_cs_ai_ws

First half: the Unicode character set. The chinese_prc _ pointer sorts Unicode in simplified Chinese characters.

The second half of the sorting rule is the suffix meaning:

_ Bin binary sorting

_ Ci (CS) is case sensitive, CI is case insensitive, and CS is case sensitive

_ Whether AI (AS) distinguishes stress, AI does not distinguish,

_ KI (KS) indicates whether Kana is distinguished. Ki is not distinguished, and KS is distinguished.

_ Whether wi (WS) is differentiated by width WI and WS

Case Sensitive: select this option if you want to make the comparison between uppercase and lowercase letters different.

Accent differentiation: select this option if you want to treat the comparison as different from the accent and non-accent letters. If this option is selected, the comparison also treats letters with different accents as unequal.

Kana differentiation: select this option if you want to treat Katakana and katakana as different Japanese syllables.

Width differentiation: select this option if you want to make the comparison between halfwidth and fullwidth characters.

Syntax

<SQL _collation_name >::=

SQL _sortrules [_ Pref] _ cpcodepage _ <comparisonstyle>

<Comparisonstyle >::=

_ Casesensitivity_accentsensiti.pdf | _ Bin

Parameters

Sortrules

A string that identifies the alphabet or language. It is used to sort by dictionary. For example, latin1_general or polish.

Pref

Specifies that uppercase letters take precedence.

CodePage

Number that identifies the number of one to four digits of the code page used by the sorting rule. CP1 specifies the code page 1252 and the full code page number for all other code pages. For example, cp1251 specifies the code page 1251 and cp850 specifies the code page 850.

Casesensiti.pdf

The CI parameter is case insensitive and the CS parameter is case sensitive.

Accentsensiti.pdf

AI specifies no accent, and as specifies the accent.

Bin

Specify the binary sorting order.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.