Document directory
- Unicode and zookeeper
- Saving Unicode data
- Saves non-Unicode data.
- Comparison between Unicode and non-Unicode memory storage methods and Performance
- Processing Method for date and time in multiple countries
- Sequential attention
Multi-country data processing and Integration Technology Exploration released on: July 15, March 7, 2005
Author:Platform-based micro-indexing and Indexing
Local content
|
Preface |
|
Understand Unicode and multi-country resource processing issues |
|
Introduce SQL Server's ability to handle multiple foreign Regions |
|
Introduction to the use of resource processing tools for multi-country regional systems |
|
Introduce front-end development programs and develop skills for handling multiple countries |
|
Conclusion |
Preface
Global enterprises are stepping down, and enterprises are turning to the state of the nation, the interaction processing and program development of enterprise resources will show how to coordinate with different regions and how to integrate the presentation system with the same data volume at the same time. handle multiple types of data exchange and query, these are all major issues of information processing for Taiwan(China)-based enterprises in the international market, this article describes how SQL Server 2000 handles the data processing capabilities of multiple countries, including the use of sequencing tools, with the development instructions of front-end applications, we will gradually introduce how we can adapt to multiple sources of data, precisely process the storage and display of resources.
Return to the beginning
Understand Unicode and multi-country resource processing issues
The international organization provides a unique code point for each character, it is the character that contains all the other words in the world. Therefore, Unicode provides a unique identifier for each character, regardless of the operating platform, program, and statement. The Unicode mark has been used by developers in these industries, such as Apple, HP, IBM, Microsoft, Oracle, SAP, sun, Sybase, Unisys ..., Standards included in products such as SQL Server, IBM DB2, Oracle, Sybase ..., Production Platform Microsoft Windows CE, NT, 2000, XP... Java/Visual Studio ..., In addition, Unicode is the main method for implementing ISO/IEC 10646. The emergence of the Unicode mark and the existence of tools supporting it are the most important development trend in the near world.
Unicode and zookeeper
Unicode does not specify how characters are displayed in the reader, resource, and webpage. The representation of each character must be processed through the encoding method, at present, the main use of the following three types of structural components: UCS-2, UTF-16 and UTF-8.
UCS-2
Under this mechanism, all Unicode elements are stored in two units, A total of 65536 (2 ^ 8*2 ^ 8) character points can be provided in the two RMB organization, UCS-2 is Microsoft NT 4.0/SQL 7.0/SQL 2000, the main operating machine. If the two-dimensional representation of Traditional Chinese "simplified" is expressed as U + 5f37, other words such as traditional Chinese, Japanese and Japanese in the UCS-2 representation method is as follows:
Expressed as U + 7f51
Expressed as U + 3048
Expressed as U + cc45
UTF-16
The UTF-16 uses two yuan for some characters to store, and other characters may use four yuan, which is the main memory mechanism used in Windows 2000, the characters contained in the UCS-2 are all child collections of the UTF-16, and some other special words, such as the Chinese text is rare, and most of them appear in classical literature, or with historical characters such as surrogate, these characters will be converted to replace (surrogate) characters, directly under the Multi-bit meta-organizations exist under the UTF-16 memory structure, in this way, 1,048,576 characters of recognition can be added (from u + d800 to new memory ~ U + dfff, 1024 low character, 1024 high character ). SQL Server 2000 uses a UCS-2 to save a special replacement character as two undefined Unicode characters in the memory process, when two pairs are combined, you can define the replacement character. In this case, when SQL Server 2000 contains replacement characters, there will be no loss or loss of replacement characters.
UTF-8
The third component structure is a UTF-8, which stores Unicode data with a variable length between one to four units. Many data processing programs, such as those developed by Oracle and Sybase, all use this hierarchical structure, the SQL Server 2000 connector also uses UTF-8 for XML data storage. The UTF-8 data character does not have a fixed length, so when the following information is processed, there will be
• |
COM components using UTF-8 information, must be external components into a UCS-2/UTF-16 |
• |
Windows NT/2000 core is the use of UCS-2/UTF-16, the use of UTF-8 memory storage format must be passed through external memory |
• |
The UTF-8 information is based on the change of the dynamic data format, which is in the sort, slower than the others and other strings than the UCS-2. |
• |
The UTF-8 information is used to change the dynamic data format, and the external memory space and memory usage are required. |
Return to the beginning
Introduce the capability of SQL Server to handle multi-country infrastructure and store Unicode data
When SQL Server 2000 is dealing with multiple countries, the root SQL-92 specification uses N to represent the national data type, two bytes per character, not in a certain order (collation) at the same time, shadow supports storing multiple languages. The following information types are used to provide reference data tables for Unicode Character Processing.
Nchar
Fixed Length character item, cannot exceed 4,000 characters
Nvarchar
Variable Length character item, cannot exceed 4,000 characters
Ntext
Height: 2 ^ 30-1 (1,073,741,823)
In SQL Server 2000, the handling of string functions can be used with the function datalength to calculate the string metadata, and the Len function to calculate the number of characters. When processing Unicode data in SQL Server, pay special attention to the use of N to indicate Unicode processing, for example, when adding the expected parameter to nchar/nvarcahr/ntext, the following encoding method must be used
Insert into my_employees values (n'chinese', n'chinese ')
Saves non-Unicode data.
When SQL Server 2000 is used to handle non-Unicode characters, the null space of each character is related to the supported character series, which is determined by the collation of the characters in the specified sequence. Each character has a space of 1-2 yuan. For example, the Asian continent region uses two yuan for each character, generally, each character in the English alphabet uses a one-bit metacode. Generally, when a non-Unicode string is stored in a sequence, you can use the following resource type sequence:
Char
Fixed Length character item, cannot exceed 8,000 characters
Varchar
Variable Length character item, cannot exceed 8,000 characters
Text
Height: 231-1 (2,147,483,647)
When Unicode data passes through a normal string and one of the non-Unicode data rows is inserted, SQL Server uses the widechartomultibyte and multibytetowidechar functions in Windows API to extract the words that are related to the order of data rows. When a character cannot represent a character, a question mark (?) is displayed (?) Instead, it indicates that the data has been lost. If a non-seasonal character or problem occurs in the item, it indicates that your item has been converted from Unicode to non-Unicode for a certain time, character loss occurs in the process of character loss. In addition, the general character sets the supported character system. For example, when the character number is 950, it indicates that the character system supports Chinese characters, in some regions, unicode tokens with 0 characters are supported, such as the northern India region. The following table lists other related words:
Yan |
Word Segmentation |
Simplified Chinese |
936 |
Traditional Chinese |
950 |
Japanese |
932 |
Xiaowen |
949 |
Comparison between Unicode and non-Unicode memory storage methods and Performance
Unicode uses 2 bytes to store each character. Therefore, the following features are available:
• |
Non-bytes meta-character set (DBCS) still uses 2 bytes character set, which requires a little more space |
• |
ODBC (version 3.6 or earlier) and the history API cannot recognize Unicode |
• |
In the Asian continent, Unicode is used to fix two-digit metadata groups. bitwise encoding specifies the efficiency of the processing, the main reason is that the storage of resources specified by the Word Bank will have a hybrid growth deviation. |
Non-UNICODE character encoding code page determines that the specified character is 1 ~ 2 bytes
• |
When non-Unicode is used in Asian languages such as Chinese and Japanese, the character is saved using a character meta-character set (DBCS ). |
• |
There is almost no difference in the use of non-Unicode and Unicode row-based memory operations. |
• |
Non-Unicode sorting is 30% faster than Unicode sorting. |
Processing Method for date and time in multiple countries
SQL Server divides the stored data types of datetime and smalldatetime into two formats, and converts the convert function into the output formats of various types of data types. In the datetime data format, the supported date ranges from Gregorian calendar January 1, 1753 to December 31,999 9, and the time interval is (1/300) seconds, the data volume is saved in two 4 Bytes Integer values. The smalldatetime data format supports lower latency. The time interval is 1900 seconds from January 1, 2079 to June 6, 29.999, and the time interval is seconds, this is based on the accuracy. The internal memory of the resource is saved with two 2 Bytes Integer values.
When the date time is converted into the representation of each country, you can use the convert () function to complete the statement. The usage is as follows:
Select convert (data_type, time, Format)
For example, the format 111 indicates the Japanese format, that is, yyyy/mm/DD. It also supports the transfer character format of taijiao, but pay special attention to when the string is output, the Unicode table sequence must be used. Otherwise, the sequence loss occurs ??? References, which can be shown in the following figure.
Sequential attention
In earlier versions, SQL Server 6.5 does not support Unicode characters. The supported character processing function is determined through the character encoding, each SQL Server 6.5 line corresponds to a single character. For example, when cp950 is selected, Chinese characters can be supported. In SQL Server 7.0, each SQL Server line supports a unicode sequence and a non-unicode sequence. The non-unicode sequence includes the character collation and sorting sequence. In SQL Server 2000, you can set a single unit for Word Segmentation and sequential synthesis. You can select the Unit during personal security, or you can select the unit when the item is created, it is also possible to specify a specific bit when the data table is created, or to use the method for determining the string in the T-SQL formula, such as the arrangement mode, character size, cavity margin difference, and full half-form ratio. English letters can determine the order of letters in a specific order, and the size of indexes is separated. Non-English letters determine the sorting of multiple types of data, for example, the index sorting method is more relevant when indexes are created according to the sequence, phonetic annotation, full-half character, and Kana.
SQL Server 2000 has two types of sequencing: Windows and SQL. You can define the ordering rules based on the Windows sequence and the Windows region settings. To perform this operation, SQL Server will sort the sorting rules of Windows Server and apply them to the corresponding windows order, in SQL Server, specifying the character string ratio in the Windows 2000 sequence is compatible with the Windows version in the same sequence as the original row. However, because Windows Server 2000 and later versions, such as Windows XP and Windows Server 2003 use different sorting data tables, therefore, the Windows ordering of SQL Server on these operating systems may display different sorting methods from those on the host OS. The SQL sequence is irrelevant to the Windows region settings. It is provided to be compatible with the sorting sequence of earlier versions of SQL Server.
Based on the internal system function collationproperty, you can determine the number of supporting words in each sequence and the region. If the number of words found is 0, indicates that Unicode string processing is supported in the specific sequence. The following example shows the collation and region regions supported by the internal letter determination sequence.
Memory Used for four sequential operations
The usage of SQL Server 2000 in a specific order is very disruptive, including the following four features:
• |
System Security (server collation) According to the select serverproperty ('colation') query, the sequence of the selected row is determined, when you need to change the sequence of a Specific Row, you must go through the public program rebuildm.exe to re-Modify the sequence settings of the Specific Row. Set the location of the public program to c: \ Program Files \ Microsoft SQL Server \ 80 \ tools \ binn \ rebuildm.exe. Dynamic rebuildm.exe program faces Select the preset plane in a specific sequence. |
• |
Database collation) In the second example, you can set the sequence when the data volume is set up. When the data volume is set up, you can set the sequence of the individual, if special ordering is required, it must be selected through the following conditions during the creation of the resource. After the data source is created, you can query the Sequence Value of the selected data source through select databasepropertyex ('dbname, 'colation. Generally, when the metadata is in the memory of the Enterprise Manager, it is impossible to directly modify the set sequence value of the information, but it can be done through the specified T-SQL to make changes to the database sequence, the modification command is as follows: Alter DatabaseNorthwind collate chinese_taiwan(China)_bin Generally, after the creation, the information sequence cannot be directly modified from the Enterprise Manager workflow. |
• |
Column collation) The sequence setting of the Third sequence is to specify different sequences in a specific sequence when the data table is created, through such rational settings, the normal char, varchar, and text data types in the same data table are supported. For example, if a foreign company has a single summary table, you can choose the first place to store traditional Chinese materials and the second place to store traditional Chinese materials, the third dimension stores daily data. Such a sensitive setting can make the data storage more stable and active. The following section shows the skills for the sequencing of different data types and the design of multi-country databases. The first parameter is set in the order of chinese_prc_stroke_ci_as. The second vertex is determined by the sequence of chinese_taiwan(China)_stroke_ci_as. The third parameter is set in the order of japanese_ci_as. Through query analyzer, you can directly retrieve data in different sequence locations. |
• |
Expression collation) You can use the Collate keyword to specify the use of the sequence in the script written by the T-SQL, for example, specifying the original character-less character string through a specific order, segments are divided into character sizes. You can also specify the order of the text in a certain order, or sort the text according to the note or character. The following example shows how to change the sorting method of the data in the formula by using a definite order. If the original data is chinese_taiwan(China)_stroke_ci_as, information delivery is arranged in the order of comparison, such as the following If the order is changed to the order by the note operator in the ascending order, the result will be significantly different, sort the words according to the sorted keyword, as shown in the following example. |
Return to the beginning
Introduction to the use of resource processing tools for multi-country regional systems
• |
Data transmission (Data Processing Service) When data transmission imports character data, Unicode can be fixed on the server side to ensure that the incoming data will be completely received and stored. In addition, when the source or target instance is not an SQL Server user, you must rely on the dedicated operation method of the call-end ODBC dynamic program or OLE DB Provider, for example, if you want to import the French OEM word to the SQL Server 1252-Character Library, and the source TV does not have a correct line of 1252-character library, this may cause information loss. When Unicode data is used, DTS imports/exports a precise response containing a specific caller, which may not properly display Unicode data, however, there will be no additional resources during the waiting period. In addition, the copy SQL Server objects task can be used to handle the issue of batch processing in different sequences, and the data can be written according to the original type. |
• |
Bulk insert program utility/bulk insert command When using non-Unicode data, to prevent the data from being corrupted, you can use the SQL Server BCP utility to compile the data as follows: use bcp to export data to general cases. By specifying the-W or-N command column flag, the information will be converted to Unicode in the handler. However, this method still loses the source information but does not contain the character of the target character. BCP dB. Owner. xxx out c: \ xxx.txt-w-T-ssvrname BCP dB. Owner. xxx out c: \ xxx.txt-n-t-ssvrname
Flag |
Italian |
Appendix |
-C xxx |
Word Segmentation |
Xxx indicates that the data is converted into ANSI, OEM, raw (directly written without authorization), or a specific word. |
-N |
Unicode native format |
For all non-character data, the information is converted into Native information type, and the information is converted into UNICODE character format. |
-W |
Unicode character format |
For all data rows, convert the data into the Unicode character data format. |
When the bulk insert command of T-SQL is passed, it can be passed through the definition of codePage and datafiletype to move the strings in multiple countries. BULK INSERT[['database_name'.]['owner'].]{'table_name' FROM 'data_file'} [WITH ( [BATCHSIZE[=batch_size]] [[,]CHECK_CONSTRAINTS] [[,]CODEPAGE[='ACP'|'OEM'|'RAW'|'code_page']] [[,]DATAFILETYPE[={'char'|'native'|'widechar'|'widenative'}]]
The value of the word 'delimiter' |
Description |
OEM (this is the operation method of the preset) |
When data is imported to the SQL Server, data rows of char, varchar, or text types are displayed, the system will be upgraded from OEM to SQL Server. In the same region, the same is true when the user experiences data from SQL Server. |
ACP |
When data is imported to the SQL Server, Char, varchar, or text data rows are uploaded from ANSI/Windows (ISO 1252) when writing data to SQL Server, the same way is true when the user experiences data from SQL Server. |
Raw |
This is the fastest choice because there is no limit to the number of characters. |
<Value> |
This indicates that the specific word is wrongly formed (for example, 850 ). |
|
• |
Replication) When the replication mechanism is used to generate the row data, you can [sort] the row data in the row sequence of the row set. the following describes how to use a sequence-compatible processing mechanism. when the data is imported, it can be completely moved in and set as follows. |
Return to the beginning
Introduce front-end development programs and develop skills for handling multiple countries
• |
Win32 Application In Microsoft Visual Basic applications, character strings are processed in UCS-2 between these structures, so between these applications and SQL Server, you do not need to explicitly specify the sequence of the sequence structure, VB. net itself supports the presentation of multi-country data and does not have to be customized. |
• |
Web Application For web-based applications, you specify charset encoding under the meta encoding of the webpage-based HTML webpage. For example, if the Unicode character structure at the terminal side is a UTF-8, charset = UTF-8 is specified. On the server side, use the session. codePage annotation or @ codePage annotation directive to specify the deployment structure at the terminal side. For example, codePage = 65001 specifies the UTF-8 lifecycle structure. If you follow these methods, Internet Information Services (IIS) 5.0 or later will indeed convert the UTF-8 into a UCS-2 and then roll back, you do not need to perform other operations. If you need to deal with this foreign choice, you can in the http://support.microsoft.com /? Kbid = 232580. |
Return to the beginning
Conclusion
On the focal points of data processing in multiple countries, it is necessary to understand the data processing sequence, and there are several lines of personal security settings on the photo plane, the master data volume must be rebuilt during the change. The data volume creation time setting can be completed through the alter database if you want to perform the modification, it can be used in multi-character processing or SQL computing, and can be used to change the data display and query modes. For the use of Unicode strings, you can submit and process a single resource in multiple countries. To use Unicode, you must pay attention to the blank space of the character, in terms of application usage, the browser program is developed through the data processing feature of multiple countries.