Thorough understanding of Oracle character Set ____oracle

Source: Internet
Author: User
The following is a detailed analysis of the character set in Oracle, the need for friends can refer to the following

Basic concepts
Character Set (Character set): Is a collection of all abstract characters supported by a system. Characters are the general name of various words and symbols, including the national characters, punctuation marks, graphic symbols, numbers and so on. Common character sets are ASCII,ZHS16GB231280,ZHS16GBK and so on.

Character encoding (Character Encoding): A set of rules that can be used to pair a set of characters of a natural language (such as an alphabet or syllable table) with one of its collections (such as computer coding). The corresponding relationship between the symbol set and the digital system is established. As opposed to the character set, common character encodings are: ascii,zhs16gbk,zht16big5,zhs32gb18030, and so on.
The definition of a character set is actually a collection of characters, while character encoding refers to how these characters are converted into bytes for saving, reading, and transmitting.

Universal Code (UNICODE): Contains almost all the available characters of human, and is increasing every year, can be considered as a common character set. It will unify all the characters in the world, unified coding, no longer appear character incompatibility and character conversion problems.
It has the following three ways of encoding:
1.UTF-32 encoding: Fixed use of 4 bytes to represent a character, there is a problem of space utilization efficiency.
2.UTF-16 encoding: Uses two bytes for more than 60,000 characters that are commonly used, and uses 4 bytes for the remainder.
3.UTF-8 encoding: Compatible with ASCII encoding; Latin, Greek, etc. use two bytes; other commonly used characters, including Chinese characters, use three bytes; the remaining few characters use four bytes.

Oracle Character Set rationale
Before you understand the fundamentals of Oracle character sets, be sure to distinguish the following three concepts:
1. Oracle database server Character set: That is, Oracle in which character encoding storage characters, you can use the following statement to detect the setting of the database character set.
Copy code code as follows:
Sql> SELECT * from v$nls_parameters where parameter= ' nls_characterset ';
PARAMETER VALUE
------------------------------ -----------------
Nls_characterset Al32utf8


2. Client operating system Character set: that is, which character encoding the client operating system stores characters.
If it is windows, you can use the CHCP command to get the code page: Copy the code as follows:
C:usersxianzhu>chcp
Active Code page:936


According to the code page, to Microsoft's Official document National Language Support (NLS) API Reference found its corresponding character set.
If it is Linux, the character set is set in/etc/sysconfig/i18n: Copy code code as follows:
Lang= "ZH_CN. GB2312 "(Specifies the character set of the current operating system)
Supported= "ZH_CN. GB2312 "(Specifies the character set supported by the current operating system)
Sysfont= "Lat0-sun16" (Specifies the font for the current operating system)


3. Client Nls_lang parameter: This parameter is used to instruct Oracle on the character set of the client operating system.
With the above 3 basic concepts, let me elaborate on the basic principles of Oracle character set translation:
1. Set the client's Nls_lang as the character set of the client operating system
2. If the database character set equals Nls_lang, the database and the client transmit characters without any conversion
3. If they are unequal, they need to be converted between different character sets, and only the client operating system character set is based on a subset of the database character set to be converted correctly, otherwise garbled characters will appear.
Analysis of several common situations
Let's look at an example, and then look at the nature of the phenomenon, we will analyze this example.
This example is as follows: The copy code code is as follows:
1. Database character set is Unicode (UTF-8 encoding)
Our database version is 10.2.0.4.0, and the database character set is:
Sql> SELECT * from v$nls_parameters where parameter= ' nls_characterset ';
PARAMETER VALUE
---------------------------------------- ------------------------------
Nls_characterset Al32utf8
2. Client operating system character set is code page 936 (character set is ZHS16GBK)
You can use CHCP to get the code page for Windows
C:documents and Settingsa105024desktop>chcp
Active Code page:936
3. Create a test table
Sql> CREATE TABLE Test (ID number,var varchar2 (30));
Table created.
4. Inserting data
Here, the same operating system to start two Session,session1 Nls_lang set to the same as the database character set (that is, Al32utf8):
C:documents and Settingsa105024desktop>set nls_lang=simplified Chinese_china.al32utf8
To connect to a database and insert a piece of data:
Session_1>insert into test values (1, ' China ');
1 row created.
session_1>commit;
Commit complete.
The Session2 Nls_lang is set to be the same as the client operating system (i.e. ZHS16GBK):
C:documents and Settingsa105024desktop>set nls_lang=simplified CHINESE_CHINA.ZHS16GBK
To connect to a database and insert a piece of data:
Session_2>insert into test values (2, ' China ');
1 row created.
session_2>commit;
Commit complete.
5. Execute Query
To execute a query on session 1:
Session_1>select * from Test;
ID VAR
---------- ---------------------
1 China
2 Juan Ricoh
To execute a query on session 2:
Session_2>select * from Test;
ID VAR
---------- --------------------
1.??
2 China


The example above looks very strange, session1 and 2 can display their own inserted strings, and can not properly display the other inserted strings. To find out, we first need to know how the database is stored in the two strings. We can use the dump function to get the character's encoding in the database: Copy code code as follows:
Sql> Select Id,dump (var,1016) from test;
ID DUMP (var,1016)
-- ------------------------------------------------------------
1 typ=1 len=4 Characterset=al32utf8:d6,d0,b9,fa
2 Typ=1 len=6 CHARACTERSET=AL32UTF8:E4,B8,AD,E5,9B,BD


According to Al32utf8 's code, the word "China" is encoded correctly (all 3 bytes):
Middle--e4,b8,ad
National--E5,9B,BD
So the string inserted in session 1 is encoded incorrectly in the database, session 2 is correct. This is also why it is important to set the Nls_lang as the client operating system's character set.
But according to the above experiments, we can know that the database is stored correctly, does not represent the client can be normal display, the same time, the real-time database is not properly stored, sometimes the client can also be normal display, this is why. Don't worry, please listen to me slowly:

Scene 1:session 1 Insert, Session 1 query, store errors in the database, but display correctly.
Insert procedure:
"China" in the client operating system character Set ZHS16GBK encoding is "D6,D0,B9,FA", because the Nls_lang and the database character set, the database end of the client-transmitted character encoding without any conversion directly into the database, so the database stored in the encoding is also " D6,d0,b9,fa ",
Read process:
Database-side read encoding is "D6,D0,B9,FA", because the Nls_lang and database character set is the same, the client to the database end of the character encoding does not make any conversion direct display, encoding "D6,D0,B9,FA" In the client operating system character set zhs16gbk the corresponding Chinese character is "China".

From the above analysis, although the reading is correct, but that is because negative negative positive, in fact, storage in the database is wrong, so be particularly careful in this case, in the build library to avoid. In fact, as long as the length of the operation can easily uncover its mask: Copy code is as follows:
Session_1>select Length (Var) from Test where id=1;
LENGTH (VAR)
-----------
3


The length of the result is 3. The actual length is only 2, which can cause a lot of trouble.

Scene 2:session 1 Insert, Session 2 query, store errors in the database, and display errors.
The insertion process, like Scene 1, is no longer described here.
Read process:
Database-side reads the encoding is "D6,D0,B9,FA", because Nls_lang and database character set different, the client to the database end of the character encoding conversion, database end character Set Al32utf8 in "D6,d0,b9,fa" The corresponding encoding cannot be found in the client operating system character set ZHS16GBK, so it has to be replaced with a.

Scene 3:session 2 Insert, Session 1 query, stored correctly in the database, but displays an error.
Insert procedure:
"China" in the client operating system character Set ZHS16GBK encoding is "D6,D0,B9,FA", because Nls_lang and the database character set is different, Oracle will do character encoding conversion, that is, the character set ZHS16GBK "China" code "D6,D0, B9,fa "Al32utf8" to "China" encoding "E4,B8,AD,E5,9B,BD" in the character set "".
Read process:
Database-side read encoding is "E4,B8,AD,E5,9B,BD", because the Nls_lang and database character set is the same, the client to the database end of the character encoding does not make any conversion direct display, encoding "E4,B8,AD,E5,9B,BD" In the client operating system character set ZHS16GBK the corresponding kanji is "Ricoh" (originally 2 characters, now 3 characters, since the ZHS16GBK characters are encoded in 2 bytes).

Scene 4:session 2 Insert, Session 2 query, stored correctly in the database, the display is correct.
The insert process is similar to Scenario 3.
Read process:
Database-side reads the encoding is "E4,B8,AD,E5,9B,BD", because Nls_lang and database character set different, the client to the database end of the character encoding conversion, database end character set Al32utf8 "China" Two words encoding "e4,b8,ad,e5, 9B,BD "translated into the client operating system character set ZHS16GBK" China "code" D6,D0,B9,FA ", and normal display.
Although this situation has undergone two conversions, it is indeed the most correct and recommended way.

Appendix: The corresponding relationship of Oracle character set superset and subset can be viewed: http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#sthref1988

Conclusion: Nls_lang is only relevant to the character set of the client operating system, and if the character set of the client operating system and the database character set cannot be converted correctly, the character set of the client terminal should be changed first instead of simply setting the Nls_lang to be the same as the database character set.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.