Oracle database Character Set problem Resolution ZZ

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

oracle| Data | database | questions
I often see some friends ask questions about Oracle character sets, and I'd like to introduce them in an iterative way.

First iteration: Mastering the basic concepts of character sets.
Some friends may think this is superfluous, but in fact it is because of the lack of understanding of the relevant basic concepts, leading to a lot of questions and questions.
The first is the concept of character set.
We know that the electronic computer was originally used for scientific calculation (so called "Computer"), but with the development of technology, the computer has to do other aspects of the application processing. This requires that the computer can not only deal with the numerical value, but also to deal with such as text, special symbols and other information, and the computer itself can directly deal with only the numerical information, so it is required for these text, symbolic information for numerical coding, the original character set is we are very familiar with the ASCII, It is 7 bits to represent 128 characters, and later with the needs of different countries, organizations, there are many sets of characters, such as the Western European character of the ISO8859 series of character sets, representing the character of the gb2312-80, GBK and other character sets.
The essence of a character set is to assign different numeric codes to a specific set of symbols, so that the computer can be processed.
The conversion between character sets. With more character sets, there is a problem, such as a character, encoded in a character set as a numeric value, and encoded as another value in another character set, such as I to create two character sets Demo_charset1 and Demo_charset2, in Demo_ In Charset1, I set three symbols for the code: A (0001), B (0010),? (1111) and in Demo_charset2, I also specified three symbols for the encoding: A (1001), C (1011),? (1111), at which point I received a task to write a program responsible for converting between DEMO_CHARSET1 and Demo_charset2. Because you know the encoding rules for both character sets, for 0001 in Demo_charset1, when you convert to Demo_charset2, you want to change the encoding to 1001, and when you convert to Demo_charset2 for 1111 in Demo_charset1, The value is unchanged, and for 0010 of the Demo_charset1, the corresponding character is B, but there is no corresponding character in the Demo_charset2, so it cannot theoretically be converted, and for all such situations that cannot be converted, We can translate them uniformly into a special character in the target character set (called a "substitution character"), such as where we can set the As the substitution character, so B is converted to? , there is a loss of information, and in the same way, the loss of information can also occur when converting Demo_charset2 C characters to Demo_charset1.
Therefore, in the character set conversion process, if a character in the source character set is not defined in the target character set, information loss will occur.
The selection of the database character set.
One of the issues that we need to consider when we create a database is to choose what character set and national character set (specified by the CHARACTER set in the CREATE DATABASE and the Nation CHARACTER set clause). To consider this issue, we have to know what data to store in the database, if only need to store English information, then choose Us7ascii as the character set, but if you want to store Chinese, then we need to choose to support Chinese character set (such as ZHS16GBK) , if you need to store more than one language, you should choose UTF8.
The determination of the character set of the database, in fact, illustrates the collection of characters that the database can handle and how to encode it, since the character set is selected and then changes are limited, so be sure to consider the database when it is created and then select it.
And many of our friends when creating a database, do not think clearly, often choose a default character set, such as WE8ISO8859P1 or US7ASCII, and these two character sets are not encoding, so it is wrong to store Chinese character information in this character set. Although in some cases the use of this character set seems to be used normally, but it will give the use of the database and maintenance of a series of problems, in the next iteration process we will in-depth analysis.
The character set of the client.
With some Oracle experience friends, most of them know how to set up a client through Nls_lang, Nls_lang consists of the following parts:nls_lang=<language>_<territory>.< Clients Characterset&gt, where the third part of <clients characterset> is intended to indicate the character set used by the client operating system by default. So in normal usage, Nls_lang should be configured according to the client machine's actual situation, especially for character sets, so that Oracle will be able to maximize the automatic conversion of the database character set and the client character set (and, of course, if it needs to be converted).
Summarize the focus of the first iteration:
Character set: Encodes a particular set of symbols into a number that the computer can handle;
Conversion between character sets: for symbols that exist in both the source and target character sets, theoretically the conversion will not result in loss of information, whereas for symbols in the source character set that do not exist in the target character set, the conversion will theoretically result in information loss;
Database Character Set: Select the character set that can contain all the information symbols that will be stored;
Client Character Set setting: Indicates the character set that is used by default by the client operating system.

Second iteration: Deepening the understanding of basic concepts through examples

Below I will quote Netizen Tellin in itpub published "CHARACTER Set research and question" post, the friend in the post listed his related experiments, and the experimental results raised some questions, I will analyze his experimental results and answer his questions.
The experimental results are analyzed in a
Quote: Originally released by Tellin
Set the client character set to Us7ascii
D:\>set Nls_lang=american_america. Us7ascii
View server character Set as Us7ascii
Sql> SELECT * from Nls_database_parameters;
PARAMETER VALUE
------------------------------ ----------------------------------------
Nls_characterset Us7ascii

Set up a test table
Sql> CREATE TABLE TEST (R1 VARCHAR2 (10));

Table created.

Inserting data
Sql> INSERT into TEST VALUES (' northeast ');

1 row created.

Sql> SELECT * from TEST;

R1
----------
Northeast

Sql> EXIT

This part of the experimental data access and display are correct, as if there is no problem, but in fact hidden a lot of hidden dangers.
First, it is not appropriate to store Chinese characters in a database and set the database character set to Us7ascii. The Us7ascii character set defines only 128 symbols and does not support Chinese characters. In addition, due to the ability to enter Chinese in the Sql*plus, the operating system should be supported by default, but the character set in Nls_lang is Us7ascii, which is clearly incorrect and does not reflect the client's actual situation.
But the actual display is correct, mainly because the Oracle checks the database and the client's character set is the same, then the data between the customer and the database access process will not occur any conversion. Specifically, in the client input "Northeast", "East" of the Chinese character encoding is 182 (10110110), 171 (10101011), "North" Chinese characters are encoded in 177 (10110001), 177 (10110001), they will not make any changes in the database, But this actually causes the character set of the database identity to be not consistent with the actual deposit, in a sense, this is an inconsistency, is also a mistake. In the process of select, Oracle also checks that the database and the client's character set are the same, so it also sends the contents of the deposit to the client intact, and the client operating system recognizes that this is encoding so it can be displayed correctly.
In this example, the database and client settings are problematic, but it seems to play a "negative positive" effect, from the point of view of the application is not a problem. But there are a lot of hidden dangers, such as the application of length or substr string functions, you can get unexpected results. In addition, if you encounter an import/export (Import/export), you will experience greater trouble. Some friends have done a lot of testing in this area, such as Eygle. "The source database character set is US7ASCII, the export file character set is Us7ascii or ZHS16GBK, and the target database character set is ZHS16GBK", he concludes. If it is in Oracle92, we find that in this case, no matter how it is handled, the export file is not properly imported into the Oracle9i database "," in this case, we can set the export character set to Us7ascii by using the Oracle8i export tool, After the export, modify the second to third character and modify 0001 to 0354 so that the data in the US7ASCII character set can be imported correctly into the ZHS16GBK database. I think it might be more appropriate for these conclusions to be understood: because the ZHS16GBK character set is Us7ascii super, this conversion should be fine if you do it properly, but the nature of the problem is that we make the US7ASCII database that should store only English characters, Storing Chinese information in an irregular way, it's not surprising that there are errors or problems in the process of transformation, but it's a bit strange not to be in trouble.
So to avoid this situation, it is necessary to choose the right character set when establishing the database, not to let the label (database character set) and actual (database stored in the actual information) does not match the situation.

The experimental results are analyzed in two

Quote: [Change the client character set to ZHS16GBK
D:\>set Nls_lang=american_america. Zhs16gbk

D:\>sqlplus "/As SYSDBA"

Unable to display data correctly

Sql> SELECT * from TEST;

R1
--------------------
6+11

Question 1:ZHS16GBK is a us7ascii superset of why it is not displayed properly in ZHS16GBK environment

This is mainly because the Oracle check found that the character set of the database is different from the client configuration character set, and it will convert the data to a character set. The actual data stored in the database is 182 (10110110), 171 (10101011), 177 (10110001), 177 (10110001), and because the database character set is Us7ascii, it is a 7bit character set. stored in 8bit bytes, Oracle ignores the highest bit of each byte, then 182 (10110110) becomes 54 (0110110), representing the number symbol "6" in ZHS16GBK (of course, also "6" in other character sets), The same process occurs in 3 other bytes, so that "Northeast" becomes "6+11".

The experimental results are analyzed in three

Quote: Originally released by Tellin
Inserting data with ZHS16GBK
Sql> INSERT into TEST VALUES (' northeast ');

1 row created.

Sql> SELECT * from TEST;

R1
--------------------
6+11
??

Sql> EXIT

When a client character set is set to ZHS16GBK and the database is inserted into the "Northeast", the Oracle check found that the character set of the database is us7ascii inconsistent with the client and needs to be converted, but the word "ne" in the character set ZHS16GBK does not have a corresponding character in the Us7ascii , Oracle inserts a database with a uniform "substitution character" here for "? , the encoding is 63 (00111111), at which point the input information is actually lost, regardless of how the character set is changed (as the experimental results are quoted below), and the second row selects the result is two "? "Number (note is 2, not 4).

Quote
Change the client character set to Us7ascii
D:\>set Nls_lang=american_america. Us7ascii

D:\>sqlplus "/As SYSDBA"

The character set inserted with ZHS16GBK cannot be displayed, but the character set inserted with US7ASCII can be displayed
Sql> SELECT * from TEST;

R1
----------
Northeast
??

Change the server character set to ZHS16GBK
sql> Update props$ set value$= ' ZHS16GBK ' WHERE name= ' nls_characterset ';

1 row updated.

Sql> COMMIT;

Change the client character set to ZHS16GBK
D:\>set Nls_lang=american_america. Zhs16gbk

D:\>sqlplus "/As SYSDBA"

You can display a previously us7ascii character set, but you cannot display data inserted with ZHS16GBK, indicating that the data inserted with ZHS16GBK is garbled.

Sql> SELECT * from TEST;

R1
--------------------
Northeast
??

It should be noted that the "Update props$ set value$= ' ZHS16GBK ' WHERE name= ' nls_characterset ';" To modify the database character set is not a normal practice, it is likely to cause problems, here is only the original text to quote the results of the experiment.

The experimental results are analyzed in four

Quote
Sql> INSERT into TEST VALUES (' northeast ');

1 row created.

Sql> SELECT * from TEST;

R1
--------------------
Northeast
??
Northeast

Sql> EXIT

Since both the database and the client's character set are ZHS16GBK, the character set is not converted, the first row and the third row of data are displayed correctly, and the second row because the stored data is 63 (00111111), so the display is "? Resolution

Quote
Change the client character set to Us7ascii

D:\>set Nls_lang=american_america. Us7ascii

D:\>sqlplus "/As SYSDBA"

Unable to display data

Sql> SELECT * from TEST;

R1
----------
??
??
??

Question 2: The first row of data is inserted with the US7ASCII environment, why not display correctly?

When the client character set is changed to Us7ascii, the Select,oracle check discovers that the character set of the database is ZHS16GBK, the data requires character set conversion, and the first and third lines of Chinese characters "East" and "Bei" have no corresponding characters in the client character set Us7ascii , so convert to replace character ("? ", and the second row of data is stored in the database is two"? "Number, so although the three lines displayed on the client are two"? "Number, but the content stored in the database is different.

The results of the experiment were analyzed five

Quote

Sql> INSERT into TEST VALUES (' northeast ');

1 row created.

Sql> EXIT
Change the client character set to ZHS16GBK
D:\>set Nls_lang=american_america. Zhs16gbk

D:\>sqlplus "/As SYSDBA"

The character set inserted with Us7ascii cannot be displayed, but the character set inserted with ZHS16GBK can be displayed
Sql> SELECT * from TEST;

R1
--------------------
Northeast
??
Northeast
6+11

Sql>
Question 3:us7ascii is a subset of ZHS16GBK, why is the data inserted in US7ASCII environment not displayed? [/b]

When setting the client character set to Us7ascii, insert "Northeast" into the ZHS16GBK database for the character set, the character conversion is required, and the ZHS16GBK encoding for "Northeast" is 182 (10110110), 171 (10101011) and 177 (10110001), 177 (10110001), because US7ASCII is 7bit encoded, Oracle treats these two characters as four and ignores the highest bits of each byte. Thus the encoding into the database becomes 54 (00110110), 43 (00101011) and 49 (00110001), 49 (00110001), or "6+11", where the original information is changed. At this time, set the client character set to ZHS16GBK and then select, the information in the database does not need to change to the client, the first to third line because the information is not changed to show "Northeast", and the second to fourth line due to the insertion of data changes in information, so can not display the original information.

The analysis of so many content, but actually summed up is also very simple, to the character set in the less error and trouble, you need to adhere to the two basic principles:
On the database side: Select the desired character set (specified by the CHARACTER set in the CREATE database and the National CHARACTER set clause);
On the client: set the character set that the operating system actually uses (set by the environment variable Nls_lang).

For example:
CHARACTER SET ZHS16GBK
National CHARACTER SET AL16UTF16

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Oracle database Character Set problem Resolution ZZ

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Oracle database Character Set problem Resolution ZZ

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support