Analysis of Character Set problems in Oracle databases

Source: Internet
Author: User

Analysis of Character Set problems in Oracle databases

I often see some questions about Oracle character sets. I 'd like to introduce them in an iterative way.

First iteration: master the basic concepts of character sets.
Some may think that this is an option, but in fact it is precisely because of the unclear understanding of the basic concepts that lead to many problems and questions.
The first is the concept of character sets.
We know that electronic computers were originally used for scientific computing (so called "computers"), but with the development of technology, computers need to be used for other applications. This requires that the computer not only can process numeric values, but also other information such as text and special symbols. The computer itself can only process numerical information directly, therefore, we need to perform numerical encoding on these text and symbol information. The initial character set is ASCII which we are all very familiar with. It uses seven binary bits to represent 128 characters, and later with the needs of different countries, organizations, there are a lot of character sets, such as the iso8859 series of the western European character set, representing Chinese characters such as GB2312-80, GBK character set.
The essence of character set is to assign different numerical codes to a specific set of symbols for computer processing.
Conversion between character sets. If there are more character sets, a problem occurs. For example, a character is encoded as a value in one character set, and another character set is encoded as another value, for example, I want to create two character sets: demo_charset1 and demo_charset2. In demo_charset1, I specify three symbols encoded as a (0001), B (0010 ),? (1111); In demo_charset2, I also specify three symbols encoded as: A (1001), C (1011 ),? (1111), then I received a task and wrote a program to convert between demo_charset1 and demo_charset2. Because we know the encoding rules of the two character sets, for demo_charset1's 0001, we need to change the encoding to 1001 when converting to demo_charset2; For demo_charset1's 1111, the value remains unchanged when converting to demo_charset2; for demo_charset1's 0010, the corresponding character is B, but there is no corresponding character in demo_charset2, so it cannot be converted theoretically. For all such cases, we can convert them to a special character ("replacement character") in the target character set. For example, here we can convert? As a replacement character, B is converted ?, Information loss occurs. Similarly, information loss occurs when the C character of demo_charset2 is converted to demo_charset1.
Therefore, if a character in the source character set is not defined in the target character set during character set conversion, information will be lost.
Database character set selection.
When creating a database, we need to consider the character set and the national character set (specified by the character set and National Character Set clauses in create database ). To solve this problem, we must know what data needs to be stored in the database. If you only need to store English information, you can select us7ascii as the character set. However, if you want to store Chinese characters, then we need to select a character set that supports Chinese characters (such as zhs16gbk). If you need to store multi-language texts, You need to select utf8.
The determination of the character set in the database actually indicates the character set that the database can process and the encoding method. There are many restrictions on changing the character set after selection, therefore, when creating a database, you must consider it before selecting it.
Many of our friends do not think clearly when creating a database. They often choose a default character set, such as we8iso8859p1 or us7ascii, which has no Chinese character encoding, therefore, using this character set to store Chinese characters is in principle incorrect. Although this character set can be used normally in some cases, it will bring a series of troubles to the use and maintenance of the database. We will analyze it in depth in the subsequent iteration process.
Character Set of the client.
Most of my friends who have experience using Oracle will know how to set the client through nls_lang. nls_lang consists of the following parts: nls_lang = <language >_< territory>. <clients characterset>. The third part of <clients characterset> is used to specify the default character set used by the client operating system. Therefore, according to the regular usage, nls_lang should be configured according to the actual situation of the client machine, especially for character sets, in this way, Oracle can achieve automatic conversion between the database character set and the client character set to the maximum extent (if necessary ).
Summarize the key points of the first iteration:
Character Set: encode a specific symbolic set as a numerical value that can be processed by a computer;
Conversion between character sets: For symbols that exist in both the source and target character sets, theoretically conversion will not produce information loss; for symbols that exist in the source character set but do not exist in the target character set, theoretically conversion will lead to loss of information;
Database Character Set: select a character set that can contain all information symbols to be stored;
Client Character Set setting: Specifies the default character set used by the client operating system.

Second iteration: deepen understanding of basic concepts through instances

Next I will reference the "character set research and question" post posted by the netizen tellin on itpub. This friend will list his experiments in this post, I also raised some questions about the experiment results. I will analyze his experiment results and answer his questions.
Experiment Result Analysis 1

Quote:
--------------------------------------------------------------------------------
Initially released by tellin
Set the client character set to us7ascii
D:/> set nls_lang = american_america.us7ascii
Check that the server character set is us7ascii
SQL> select * From nls_database_parameters;
Parameter Value
----------------------------------------------------------------------
Nls_characterset us7ascii

Create a test table
SQL> Create Table Test (R1 varchar2 (10 ));

Table created.

Insert data
SQL> insert into test values ('northeast ');

1 row created.

SQL> select * from test;

R1
----------
Northeast China

SQL> exit

--------------------------------------------------------------------------------

The access and display of this part of the experiment data are correct. It seems that there is no problem, but it actually hides a great hidden danger.
First, it is not appropriate to store Chinese characters in the database and set the database character set to us7ascii. The us7ascii character set defines only 128 characters and does not support Chinese characters. In addition, because SQL * Plus can input Chinese characters, the operating system should support Chinese characters by default, but the character set in nls_lang is set to us7ascii, which is obviously incorrect, it does not reflect the actual situation of the client.
But the actual display is correct. This is mainly because Oracle checks the character set settings of the database and the client, and no conversion will occur during the data access process between the customer and the database. Specifically, on the client, enter "Northeast" and "East" Chinese characters encoded as 182 (10110110), 171 (10101011), and "North" Chinese characters encoded as 177 (10110001), 177 (10110001), they will not be changed into the database, but this actually causes the character set of the database logo to be inconsistent with the actually saved content, in a sense, this is also an inconsistency and a mistake. In the select process, Oracle also checks and finds that the character set settings of the database and the client are the same, so it also transfers the stored content to the client intact, the client operating system identifies that this is a Chinese character code, so it can be correctly displayed.
In this example, the settings of the database and the client are incorrect, but it seems to have a "negative Positive" effect. From the perspective of the application, it seems that there is no problem. However, there are great hidden risks, such as unexpected results when applying string functions such as length or substr. In addition, if you encounter import/export (import/export), it will be more troublesome. Some friends have done a lot of tests in this regard. For example, eygle has studied the situation that "the source database character set is us7ascii, the exported file character set is us7ascii or zhs16gbk, and the target database character set is zhs16gbk, he concluded that "if we find that in oracle92, this exported file cannot be correctly imported to the Oracle9i database. "In this case, you can use the Oracle8i export tool to set the export character set to us7ascii, after the export, modify the second and third characters and 0001 to 0354. In this way, you can correctly import data from the us7ascii character set to the database of zhs16gbk ". I think this understanding of these conclusions may be more appropriate: Because the zhs16gbk character set is the super of us7ascii, if you follow the normal operation, this conversion should be no problem; however, the essence of the problem is that we have made the us7ascii database, which should only store English characters, not to mention Chinese information, so there will be no strange errors or troubles in the conversion process, it's a bit strange to be out of trouble.
To avoid this situation, you must select an appropriate character set when creating a database, so that tags (Character Set settings of the database) and actual information (actually stored in the database) are not allowed) this error occurs.
Experiment Result Analysis 2

 

 

Quote:
--------------------------------------------------------------------------------
[Change the client character set to zhs16gbk
D:/> set nls_lang = american_america.zhs16gbk

D:/> sqlplus "/As sysdba"

Unable to display data normally

SQL> select * from test;

R1
--------------------
6 + 11

Question 1: Why can't the zhs16gbk display normally in the zhs16gbk environment because it is a superset of us7ascii?
--------------------------------------------------------------------------------

This is mainly because the Oracle check finds that the character set in the database is different from the character set configured on the client. It will convert the character set of the data. The actual data stored in the database is 182 (10110110), 171 (10101011), 177 (10110001), 177 (10110001), because the database character set is set to us7ascii, it is a 7-bit character set, which is stored in 8-bit bytes. If Oracle ignores the highest bit of each byte, 182 (10110110) is changed to 54 (0110110 ), in zhs16gbk, it represents the digit "6" (of course, it is also "6" in other character sets), and the process also occurs in the other three bytes, in this way, "Northeast" becomes "6 + 11 ".
Experiment Result Analysis 3

 

 

Quote:
--------------------------------------------------------------------------------
Initially released by tellin
Use zhs16gbk to insert data
SQL> insert into test values ('northeast ');

1 row created.

SQL> select * from test;

R1
--------------------
6 + 11
??

SQL> exit
--------------------------------------------------------------------------------

When the client character set is set to zhs16gbk and then "Northeast" is inserted to the database, Oracle checks that the character set for the database is us7ascii, which is inconsistent with that for the client, and must be converted, however, the "Northeast" character in the character set zhs16gbk does not have the corresponding characters in us7ascii. Therefore, Oracle inserts a unified "replacement character" into the database. Here it is "?" ", The encoding is 63 (00111111). In this case, the input information is actually lost, regardless of how the character set settings are changed (as shown in the following experiment results ), the results of the SELECT statement in the second row are both "?" Number (note 2, not 4 ).

Quote:
--------------------------------------------------------------------------------
Change the client character set to us7ascii
D:/> set nls_lang = american_america.us7ascii

D:/> sqlplus "/As sysdba"

The character set inserted with zhs16gbk cannot be displayed, but the character set inserted with us7ascii can be displayed.
SQL> select * from test;

R1
----------
Northeast China
??

Change the server character set to zhs16gbk
SQL> Update props $ set value $ = 'zhs16gbk' where name = 'nls _ characterset ';

1 row updated.

SQL> commit;

Change the client character set to zhs16gbk
D:/> set nls_lang = american_america.zhs16gbk

D:/> sqlplus "/As sysdba"

The former us7ascii character set can be displayed, but the data inserted with zhs16gbk cannot be displayed. This indicates that the data inserted with zhs16gbk is garbled.

SQL> select * from test;

R1
--------------------
Northeast China
??
--------------------------------------------------------------------------------

It should be noted that modifying the database character set through "Update props $ set value $ = 'zhs16gbk' where name = 'nls _ characterset ';" is not a common practice and may cause problems, here is only the original article referencing the experiment results of netizens.

Experiment result analysis 4

 

 

Quote:
--------------------------------------------------------------------------------
SQL> insert into test values ('northeast ');

1 row created.

SQL> select * from test;

R1
--------------------
Northeast China
??
Northeast China

SQL> exit
--------------------------------------------------------------------------------

Because both the character set of the database and the client are set to zhs16gbk at this time, Character Set conversion will not occur. The data in the first and third rows is correctly displayed, the second row shows "?" Because the stored data is 63 (00111111)." .

Quote:
--------------------------------------------------------------------------------
Change the client character set to us7ascii

D:/> set nls_lang = american_america.us7ascii

D:/> sqlplus "/As sysdba"

Unable to display data

SQL> select * from test;

R1
----------
??
??
??

Question 2: the first line of data is inserted in the us7ascii environment. Why cannot it be displayed normally?
--------------------------------------------------------------------------------

After changing the character set of the client to us7ascii, select is performed. Oracle checks that the character set of the database is zhs16gbk, and data needs to be converted to the character set, the Chinese characters "East" and "North" in the first and third lines do not have corresponding characters in the client Character Set us7ascii, so they are converted to "replace character" ("? "), The second row of data stored in the database is actually two "?" So even though the three lines displayed on the client are two "?" But the content stored in the database is different.

Experiment Result Analysis 5

 

 

Quote:
--------------------------------------------------------------------------------
SQL> insert into test values ('northeast ');

1 row created.

SQL> exit
Change the client character set to zhs16gbk
D:/> set nls_lang = american_america.zhs16gbk

D:/> sqlplus "/As sysdba"

The character set inserted with us7ascii cannot be displayed, but the character set inserted with zhs16gbk can be displayed.
SQL> select * from test;

R1
--------------------
Northeast China
??
Northeast China
6 + 11

SQL>
Question 3: Why cannot I display the data inserted in the us7ascii environment because the us7ascii is a subset of zhs16gbk?
--------------------------------------------------------------------------------

When the client character set is set to us7ascii, insert "Northeast" to the database whose character set is zhs16gbk. The character conversion is required. The "Northeast" zhs16gbk is encoded as 182 (10110110) 171 (10101011), 177 (10110001), 177 (10110001), because us7ascii is 7-bit encoding, Oracle treats these two Chinese characters as four characters, and ignores the highest bit of each byte, therefore, the codes stored in the database are 54 (00110110), 43 (00101011), 49 (00110001), and 49 (00110001), that is, "6 + 11 ", the original information is changed. In this case, set the character set of the client to zhs16gbk and then select the database. The information in the database does not need to be transferred to the client. The first and third rows can display "Northeast" because the stored information is not changed ", the second and fourth rows cannot display the original information because the information changes when the data is inserted.
I have analyzed so much content, but it is actually very simple to sum up.

After analyzing so much content, it is actually very simple to sum up. To reduce errors and troubles in character sets, we need to adhere to two basic principles:
On the Database End: select the desired character set (specified by the character set and National Character Set clauses in create database );
On the client: Set the character set actually used by the operating system (set through the Environment Variable nls_lang ).

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.