Summary of the turn Oracle character set issues

Source: Internet
Author: User
Tags number sign

Summary of Oracle character set issuesCategory: Oracle2006-06-04 13:48 1298 people read comments (3) favorite reports Oracle Database Sqlcharacter Storage Insert Vstonet Source: Chrysanthemum Forum often see some friends ask questions about Oracle character set, I would like to introduce an iterative way. First iteration: Mastering the basic concepts of character sets.
Some friends may think that this is superfluous, but in fact it is because of the basic concept of unclear, it leads to a lot of questions and questions.
The first is the concept of a character set.
We know that computers were originally used for scientific calculations (so called "Computers"), but as technology developed, other applications were needed for processing. This requires that the computer can not only deal with numerical values, but also processing such as text, special symbols and other information, and the computer itself can directly deal with only numerical information, so it is required to encode the text, symbolic information, the original character set is we are very familiar with the ASCII, It is 7 bits to represent 128 characters, and later with the needs of different countries, organizations, there appear a lot of character sets, such as the ISO8859 series of Western European characters of the character set, representing the character set of gb2312-80, GBK, and so on.
The essence of a character set is to assign different numerical codes to a specific set of symbols, so that the computer can be processed.
The conversion between character sets. A character set is a problem, such as a character that is encoded as a number in one character set, and another in another, such as I create two character sets Demo_charset1 and Demo_charset2, in Demo_ Charset1, I have specified three symbols encoded as: A (0001), B (0010),? (1111) While in Demo_charset2, I have also specified three symbols encoded as: A (1001), C (1011),? (1111), when I received a task, to write a program, responsible for the conversion between Demo_charset1 and Demo_charset2. Because you know the encoding rules for two character sets, for 0001 in Demo_charset1, when converting to Demo_charset2, you want to change its encoding to 1001, and for Demo_charset1 in 1111, convert to Demo_charset2, The value is the same, but for Demo_charset1 0010, the corresponding character is B, but there is no corresponding character in Demo_charset2, so theoretically can not be converted, for all such a situation can not be converted, We can convert them uniformly to a special character in the target character set (called the "substitution character"), for example, where we can put the? As the replacement character, so B is converted to? , there is a loss of information, the same reason, when the Demo_charset2 C character is converted to Demo_charset1, information loss will also occur.
Therefore, in the character set conversion process, if a character in the source character set is not defined in the target character set, information loss will occur.
The selection of the database character set.
One of the issues we need to consider when creating a database is the choice of what character set is selected with the national character set (specified by the CHARACTER set in CREATE DATABASE and the Nation CHARACTER set clause). Considering this problem, we have to know what data is stored in the database, if only need to store English information, then choose Us7ascii as a character set, but if you want to store Chinese, then we need to choose a character set that supports Chinese (such as ZHS16GBK) If you need to store multi-lingual text, you should choose UTF8.
The determination of the database character set, in fact, shows that the database can handle the set of characters and encoding method, because the character set selected after the changes will have a lot of restrictions, so when the database is created must be considered clearly and then select.
And many of our friends when creating a database, do not consider clearly, often choose a default character set, such as WE8ISO8859P1 or US7ASCII, and these two character sets are not Chinese character coding, so using this character set to store Chinese character information is in principle wrong. Although it may seem normal to use this character set in some cases, it will cause a series of problems in the use and maintenance of the database, which we will delve into later in the iterative process.
The character set of the client.
With some Oracle experience friends, most will know to set up the client through Nls_lang, Nls_lang consists of the following parts:nls_lang=<language>_<territory>.< Clients Characterset&gt, where the third part of <clients characterset> is intended to indicate the character set used by the client operating system by default. Therefore, according to the normal usage, Nls_lang should be configured according to the actual situation of the client machine, especially for the character set, so that Oracle can realize the automatic conversion of the database character set and the client character set (if the conversion is necessary, of course).
Summarize the focus of the first iteration:
Character set: Encodes a specific set of symbols into numeric values that the computer can handle;
Conversion between character sets: for symbols that exist in both the source character set and the target character set, theoretical conversions will not result in information loss, whereas in the case of symbols that exist in the source character set and do not exist in the target character set, the conversion will result in the loss of information;
Database Character Set: Select the character set that can contain all the information symbols that will be stored;
Client Character Set setting: Indicates the character set used by the client operating system by default. Second iteration: Deepen understanding of basic concepts through examples
Below I will cite Netizen tellin in itpub published "CHARACTER Set Research and questions" post, the friend in the post listed his related experiments, and put forward some questions on the experimental results, I will analyze his experimental results and answer his questions.
The results of the experiment
Quote
--------------------------------------------------------------------------------
Originally released by Tellin
Set the client character set to Us7ascii
D:/>set Nls_lang=american_america. Us7ascii
View the server character set as Us7ascii
Sql> SELECT * from Nls_database_parameters;
PARAMETER VALUE
------------------------------ ----------------------------------------
Nls_characterset Us7ascii set up a test table
Sql> CREATE TABLE TEST (R1 VARCHAR2 (10)); Table created. Inserting data
Sql> INSERT into TEST VALUES (' northeast '); 1 row created. Sql> SELECT * from TEST; R1
----------
Tohoku Sql> EXIT
--------------------------------------------------------------------------------This part of the experimental data access and display are correct, as if there is no problem, But in fact there is a great hidden danger.
First of all, it is inappropriate to set the character of the database to Us7ascii if it is to be stored in the database. The Us7ascii character set defines only 128 symbols and does not support kanji. In addition, due to the ability to input Chinese in sql*plus, the operating system default should be in Chinese, but the character set in Nls_lang is Us7ascii, it is obviously not correct, it does not reflect the actual situation of the client.
But the actual display is correct, mainly because the Oracle check database and the client's character set is the same, then the data between the client and the database in the access process will not occur any conversion. Specifically, in the client input "Northeast", "East" of the Chinese character encoding is 182 (10110110), 171 (10101011), "North" Chinese characters are encoded 177 (10110001), 177 (10110001), they will not make any changes in the database, But this actually causes the character set of the database identity to be inconsistent with what is actually deposited, and in a sense it is an inconsistency and a mistake. In the process of select, Oracle also checks that the discovery database is the same as the client's character set, so it also transfers the contents of the deposit to the client intact, and the client operating system recognizes that it is Chinese character coding so that it is displayed correctly.
In this example, the database and client settings are problematic, but it seems to have a "negative negative positive" effect, from the perspective of application to look like no problem. But there are great pitfalls, such as the use of string functions such as length or substr, which can result in unexpected results. Also, if you encounter import/export (Import/export), you will experience more trouble. Some friends have done a lot of testing in this area, such as Eygle study "the source database character set is us7ascii, export file character set to Us7ascii or ZHS16GBK, the target database character set is ZHS16GBK", he concluded that " If it is in the Oracle92, we found that in this case, regardless of how the processing, the export file will not be imported into the Oracle9i database "," for this case, we can use the Oracle8i Export tool, the export character set as US7ASCII, After exporting, modify the second to third character and modify 0001 to 0354 so that the data for the US7ASCII character set can be imported correctly into the ZHS16GBK database. " I think it might be more appropriate to understand these conclusions: since the ZHS16GBK character set is Us7ascii super, this kind of conversion should be fine if normal operation, but the essence of the problem is that we let the US7ASCII database that should only store English characters, The non-conventional storage of Chinese information, then in the conversion process error or trouble there is no surprise, no trouble is a bit strange.
Therefore, to avoid this situation, it is necessary to select the appropriate character set when setting up the database, so that the label (the character set of the database) does not match the actual (the information actually stored in the database).
Experimental results Analysis Two quote:
--------------------------------------------------------------------------------
[Change the client character set to ZHS16GBK
D:/>set Nls_lang=american_america. Zhs16gbkd:/>sqlplus "/As SYSDBA" does not display data sql> SELECT * from TEST; R1
--------------------
6+11 question 1:ZHS16GBK is a superset of us7ascii, why in the ZHS16GBK environment can not display properly
--------------------------------------------------------------------------------This is primarily because the Oracle check discovers that the character set of the database is different from the client configuration character set , which transforms the character set for the data. The actual data stored in the database is 182 (10110110), 171 (10101011), 177 (10110001), 177 (10110001), because the database character set is Us7ascii, which is a 7bit character set. stored in 8bit bytes, Oracle ignores the highest bit of each byte, then 182 (10110110) becomes 54 (0110110), and in ZHS16GBK represents the number sign "6" (of course, in other character sets is also "6"). The same process occurs in the other 3 bytes, so that the "Northeast" becomes "6+11".
Three quote of experimental results:
--------------------------------------------------------------------------------
Originally released by Tellin
Inserting data with ZHS16GBK
Sql> INSERT into TEST VALUES (' northeast '); 1 row created. Sql> SELECT * from TEST; R1
--------------------
6+11
?? Sql> EXIT--------------------------------------------------------------------------------
When the client character set to ZHS16GBK is inserted into the database after "Northeast", Oracle checks that the character set of the database setting is US7ASCII inconsistent with the client and needs to be converted, but the word "Northeast" in the character set ZHS16GBK does not have a corresponding character in Us7ascii , so Oracle inserts the database with a uniform "substitution character", where is the "? ", encoded at 63 (00111111), the input information is actually lost, no matter how the character set changes (such as the experimental results referenced below), the second row select out of the results are two"? "Number (note is 2, not 4). Quote
--------------------------------------------------------------------------------changing the client character set to Us7ascii
D:/>set Nls_lang=american_america. Us7asciid:/>sqlplus "/As SYSDBA" cannot display a character set inserted with ZHS16GBK, but can display a character set inserted with Us7ascii
Sql> SELECT * from TEST; R1
----------
Northeast
??
Change the server character set to ZHS16GBK
sql> Update props$ set value$= ' ZHS16GBK ' WHERE name= ' nls_characterset '; 1 row updated. sql> COMMIT; Change the client character set to ZHS16GBK
D:/>set Nls_lang=american_america. Zhs16gbkd:/>sqlplus "/As SYSDBA" can display the previously us7ascii character set, but cannot display data inserted with ZHS16GBK, indicating that data inserted with ZHS16GBK is garbled. Sql> SELECT * from TEST; R1
--------------------
Northeast
?? --------------------------------------------------------------------------------
It should be noted that through "update props$ set value$= ' ZHS16GBK ' WHERE name= ' nls_characterset ';" To modify the database character set is a non-conventional approach, it is likely to cause problems, here is only the original text quoted the experimental results of netizens.
Experimental results Analysis Four quote:
--------------------------------------------------------------------------------sql> INSERT into TEST VALUES (' Northeast '); 1 row created. Sql> SELECT * from TEST; R1
--------------------
Northeast
??
Tohoku Sql> EXIT--------------------------------------------------------------------------------
Because the character set of the database and the client is ZHS16GBK at this time, the character set conversion does not occur, the first row and the third row of data are displayed correctly, and the second row is displayed because the stored data is 63 (00111111). No. Quote
--------------------------------------------------------------------------------changing the client character set to Us7asciid:/>set NLS _lang=american_america. Us7asciid:/>sqlplus "/As SYSDBA" cannot display data sql> SELECT * from TEST; R1
----------
??
??
?? Question 2: The first row of data is inserted in the US7ASCII environment, why not display properly? --------------------------------------------------------------------------------
After changing the client character set to Us7ascii, Select,oracle checks that the character set of the database setting is ZHS16GBK, the data needs character set conversion, and the first and third lines of the Chinese character "East" and "North" have no corresponding characters in the client character set Us7ascii. , so convert to "replace character" ("? "), and the second row of data stored in the database would have been two"? "Number, so although the three rows shown on the client are two"? "Number, but the content stored in the database is different. Experimental results Analysis Five quote:
--------------------------------------------------------------------------------
Sql> INSERT into TEST VALUES (' northeast '); 1 row created. Sql> EXIT
Change the client character set to ZHS16GBK
D:/>set Nls_lang=american_america. Zhs16gbkd:/>sqlplus "/As SYSDBA" cannot display a character set inserted with Us7ascii, but can display a character set inserted with ZHS16GBK
Sql> SELECT * from TEST; R1
--------------------
Northeast
??
Northeast
6+11sql>
Question 3:us7ascii is a subset of ZHS16GBK, why is the data inserted in the US7ASCII environment not displayed? [/b]
--------------------------------------------------------------------------------
When the client charset is set to Us7ascii, the "Northeast" is inserted into the ZHS16GBK database, which requires character conversion, and the ZHS16GBK encoding for "Tohoku" is 182 (10110110), 171 (10101011) and 177 (10110001), 177 (10110001), because US7ASCII is 7bit encoded, Oracle treats the two kanji as four characters and ignores the highest bits of each byte, Thus the encoding of the database becomes 54 (00110110), 43 (00101011) and 49 (00110001), 49 (00110001), or "6+11", and the original information is changed. At this point, set the client character set to ZHS16GBK and select, the information in the database does not need to change to the client, the first to third line because the information is not changed to display the "Northeast", and the second to fourth line because the data inserted when the information changes, so the original information can not be displayed.
Analyze so much of the content, but in fact it is very simple to sum up
The analysis of so much content, but in fact summed up is very simple, in order to reduce errors and problems in the character set, need to adhere to two basic principles:
On the database side: Select the desired character set (specified by the CHARACTER set in CREATE DATABASE and the National CHARACTER set clause);
On client: Sets the character set that is actually used by the operating system (through the environment variable Nls_lang setting).

Summary of the turn Oracle character set issues

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.