Welcome to the Oracle community forum and interact with 2 million technical staff. I often see some questions about ORACLE character sets. I would like to introduce them in an iterative manner. First iteration: master the basic concepts of character sets. Some friends may think this is an extra move, but it is because they are not sure about the basic concepts.
Welcome to the Oracle community forum, and interact with 2 million technical staff> I often see some questions about ORACLE character sets. I would like to introduce them in an iterative manner. First iteration: master the basic concepts of character sets. Some friends may think this is an extra move, but it is because they are not sure about the basic concepts.
Welcome to the Oracle community forum and interact with 2 million technical staff> enter
I often see some questions about ORACLE character sets. I 'd like to introduce them in an iterative way.
First iteration: master the basic concepts of character sets.
Some may think that this is an option, but in fact it is precisely because of the unclear understanding of the basic concepts that lead to many problems and questions.
The first is the concept of character sets.
We know that electronic computers were originally used for scientific computing (so called "computers"), but with the development of technology, computers need to be used for other applications. This requires that the computer not only can process numeric values, but also other information such as text and special symbols. The computer itself can only process numerical information directly, therefore, we need to perform numerical encoding on these text and symbol information. The initial character set is ASCII which we are all very familiar with. It uses seven binary bits to represent 128 characters, and later with the needs of different countries, organizations, there are a lot of character sets, such as the ISO8859 series of the western European character set, representing Chinese characters such as GB2312-80, GBK character set.
The essence of character set is to assign different numerical codes to a specific set of symbols for computer processing.
Conversion between character sets. If there are more character sets, a problem occurs. For example, a character is encoded as a value in one character set, and another character set is encoded as another value, for example, I want to create two character sets: demo_charset1 and demo_charset2. In demo_charset1, I specify three symbols encoded as A (0001), B (0010 ),? (1111); In demo_charset2, I also specify three symbols encoded as: A (1001), C (1011 ),? (1111), then I received a task and wrote a program to convert between demo_charset1 and demo_charset2. Because we know the encoding rules of the two character sets, for demo_charset1's 0001, we need to change the encoding to 1001 when converting to demo_charset2; For demo_charset1's 1111, the value remains unchanged when converting to demo_charset2; for demo_charset1's 0010, the corresponding character is B, but there is no corresponding character in demo_charset2, so it cannot be converted theoretically. For all such cases, we can convert them to a special character ("replacement character") in the target character set. For example, here we can convert? As a replacement character, B is converted ?, Information loss occurs. Similarly, information loss occurs when the C character of demo_charset2 is converted to demo_charset1.
Therefore, if a character in the source character set is not defined in the target character set during character set conversion, information will be lost.
Database character set selection.
When creating a database, we need to consider the character set and the national character set (specified by the character set and national character set clauses in create database ). To solve this problem, we must know what data needs to be stored in the database. If you only need to store English information, you can select US7ASCII as the character set. However, if you want to store Chinese characters, then we need to select a character set that supports Chinese characters (such as ZHS16GBK). If you need to store multi-language texts, You need to select UTF8.
The determination of the character set in the database actually indicates the character set that the database can process and the encoding method. There are many restrictions on changing the character set after selection, therefore, when creating a database, you must consider it before selecting it.
Many of our friends do not think clearly when creating a database. They often choose a default character set, such as WE8ISO8859P1 or US7ASCII, which has no Chinese character encoding, therefore, using this character set to store Chinese characters is in principle incorrect. Although this character set can be used normally in some cases, it will bring a series of troubles to the use and maintenance of the database. We will analyze it in depth in the subsequent iteration process.
Character Set of the client.
If you have experience using Oracle, most users will know how to set the client through NLS_LANG. NLS_LANG is composed of the following parts: NLS_LANG = _ . , The third part Is used to specify the default character set used by the client operating system. Therefore, according to the regular usage, NLS_LANG should be configured according to the actual situation of the client machine, especially for character sets, in this way, Oracle can achieve automatic conversion between the database character set and the client character set to the maximum extent (if necessary ).
Summarize the key points of the first iteration:
Character Set: encode a specific symbolic set as a numerical value that can be processed by a computer;
Conversion between character sets: For symbols that exist in both the source and target character sets, theoretically conversion will not produce information loss; for symbols that exist in the source character set but do not exist in the target character set, theoretically conversion will lead to loss of information;
Database Character Set: select a character set that can contain all information symbols to be stored;
Client Character Set setting: Specifies the default character set used by the client operating system.
Second iteration: deepen understanding of basic concepts through instances
Next I will reference the "CHARACTER SET research and question" post posted by the netizen tellin on ITPUB. This friend will list his experiments in this post, I also raised some questions about the experiment results. I will analyze his experiment results and answer his questions.
Experiment Result Analysis 1
Quote:
--------------------------------------------------------------------------------
Initially released by tellin
Set the client character set to US7ASCII
D:> SET NLS_LANG = AMERICAN_AMERICA.US7ASCII
Check that the server character set is US7ASCII
SQL> SELECT * FROM NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
----------------------------------------------------------------------
NLS_CHARACTERSET US7ASCII
Create a test table
SQL> CREATE TABLE TEST (R1 VARCHAR2 (10 ));
Table created.
Insert data
SQL> INSERT INTO TEST VALUES ('northeast ');
1 row created.
SQL> SELECT * FROM TEST;
R1
----------
Northeast China
SQL> EXIT
--------------------------------------------------------------------------------
The access and display of this part of the experiment data are correct. It seems that there is no problem, but it actually hides a great hidden danger.
First, it is not appropriate to store Chinese characters in the database and set the database character set to US7ASCII. The US7ASCII character set defines only 128 characters and does not support Chinese characters. In addition, because SQL * PLUS can input Chinese characters, the operating system should support Chinese characters by default, but the character set in NLS_LANG is set to US7ASCII, which is obviously incorrect, it does not reflect the actual situation of the client.
But the actual display is correct. This is mainly because Oracle checks the character set settings of the database and the client, and no conversion will occur during the data access process between the customer and the database. Specifically, on the client, enter "Northeast" and "East" Chinese characters encoded as 182 (10110110), 171 (10101011), and "North" Chinese characters encoded as 177 (10110001), 177 (10110001), they will not be changed into the database, but this actually causes the character set of the database logo to be inconsistent with the actually saved content, in a sense, this is also an inconsistency and a mistake. In the SELECT process, Oracle also checks and finds that the character set settings of the database and the client are the same, so it also transfers the stored content to the client intact, the client operating system identifies that this is a Chinese character code, so it can be correctly displayed.
In this example, the settings of the database and the client are incorrect, but it seems to have a "negative Positive" effect. From the perspective of the application, it seems that there is no problem. However, there are great hidden risks, such as unexpected results when applying string functions such as length or substr. In addition, if you encounter import/export (import/export), it will be more troublesome. Some friends have done a lot of tests in this regard. For example, eygle has studied the situation that "the source database character set is US7ASCII, the exported file character set is US7ASCII or ZHS16GBK, and the target database character set is ZHS16GBK, he concluded that "if we find that in Oracle92, this exported file cannot be correctly imported to the Oracle9i database. "In this case, you can use the Oracle8i export tool to set the export character set to US7ASCII, after the export, modify the second and third characters and 0001 to 0354. In this way, you can correctly import data from the US7ASCII character set to the database of ZHS16GBK ". I think this understanding of these conclusions may be more appropriate: Because the ZHS16GBK character set is the super of US7ASCII, if you follow the normal operation, this conversion should be no problem; however, the essence of the problem is that we have made the US7ASCII database, which should only store English characters, not to mention Chinese information, so there will be no strange errors or troubles in the conversion process, it's a bit strange to be out of trouble.