Thoroughly understand the Oracle Character Set

Source: Internet
Author: User

The following is a detailed analysis of the character set in Oracle. For more information, see

Basic Concepts
Character set ):
Is a collection of all abstract characters supported by the system. A character is a general term for all types of texts and symbols, including Chinese characters, punctuation marks, graphical symbols, and numbers. Common Character sets include ASCII, ZHS16GB231280, and ZHS16GBK.

Character Encoding (Character Encoding ):
Is a set of rules that can be used to pair a set of natural language characters (such as an alphabet or syllable table) with other sets (such as computer code. That is, the correspondence between the Symbol Set and the digital system is established. AndCharacter sets correspond to common character encodings:ASCii, ZHS16GBK, ZHT16BIG5, ZHS32GB18030, etc.
Character sets are actually character sets. character encoding means how to convert these characters into bytes for saving, reading, and transmitting.

Universal Code (Unicode ):It contains almost all the characters available to humans and is constantly increasing every year. It can be seen as a general character set. It unifies all the characters in the world and encodes them in a unified manner, so that character incompatibility and character conversion will no longer occur.
It has three encoding methods:
1. UTF-32 Encoding:
4 bytes are fixed to indicate a character, which causes space utilization efficiency.
2. UTF-16 Encoding:Encode the commonly used 60000 + characters in two bytes, and use the remaining 4 bytes.
3. UTF-8 encoding:It is compatible with ASCII encoding. It uses two bytes for Latin and Greek characters. Other common characters including Chinese characters use three bytes. The remaining few characters use four bytes.

Basic Principles of Oracle Character Set
Before understanding the basic principles of the Oracle character set, you must first distinguish the following three concepts:
1. Oracle Database Server Character Set: Which character encoding is used to store characters in Oracle? You can use the following statement to check the settings of the database character set.

Copy codeThe Code is as follows:
SQL> select * from v $ nls_parameters where parameter = 'nls _ CHARACTERSET ';
PARAMETER VALUE
-----------------------------------------------
NLS_CHARACTERSET AL32UTF8


2. client operating system character set: the character encoding used by the client operating system to store characters.
For Windows, you can use the chcp command to obtain the code page ):

Copy codeThe Code is as follows:
C: Usersxianzhu> chcp
Active code page: 936


According to the code page, find the character set corresponding to the official Microsoft documentation "National Language Support (NLS) API Reference.
For Linux, set the character set in/etc/sysconfig/i18n:

Copy codeThe Code is as follows:
LANG = "zh_CN.GB2312" (specify the character set of the current operating system)
SUPPORTED = "zh_CN.GB2312" (specify the character set SUPPORTED by the current operating system)
SYSFONT = "lat0-sun16" (specifies the font of the current operating system)


3. Client NLS_LANG parameter: this parameter is used to indicate to Oracle the character set of the client operating system.
With the above three basic concepts, I will explain the basic principles of Oracle Character Set conversion:
1. Set the client's NLS_LANG to the character set of the client's Operating System
2. If the database character set is equal to NLS_LANG, no conversion is made when the database and client transmit characters.
3. If they are not the same, they need to be converted between different character sets. Only when the character set of the client operating system is a subset of the database character set can the conversion be correct. Otherwise, garbled characters may occur.
Analysis of several common situations
Let's take a look at an example first, and then look at the nature of the phenomenon. We will analyze this example.
This example is as follows:

Copy codeThe Code is as follows:
1. database character set is Unicode (UTF-8 encoding)
Our database version is 10.2.0.4.0, and the database character set is:
SQL> select * from v $ nls_parameters where parameter = 'nls _ CHARACTERSET ';
PARAMETER VALUE
----------------------------------------------------------------------
NLS_CHARACTERSET AL32UTF8
2. The client operating system character set is code page 936 (character set is ZHS16GBK)
You can use chcp to obtain the windows code page)
C: Documents and Settingsa105024Desktop> chcp
Active code page: 936
3. Create a test table
SQL> create table test (id number, var varchar2 (30 ));
Table created.
4. insert data
Here we start two sessions in the same operating system. The NLS_LANG of session1 is set to the same as the database character set (that is, AL32UTF8 ):
C: Documents and Settingsa105024Desktop> set nls_lang = Simplified Chinese_China.AL32UTF8
Connect to the database and insert a piece of data:
Session_1> insert into test values (1, 'China ');
1 row created.
Session_1> commit;
Commit complete.
Set the NLS_LANG of session2 to the same as the client operating system (ZHS16GBK ):
C: Documents and Settingsa105024Desktop> set nls_lang = Simplified Chinese_China.ZHS16GBK
Connect to the database and insert a piece of data:
Session_2> insert into test values (2, 'China ');
1 row created.
Session_2> commit;
Commit complete.
5. Execute the query
Execute the query on session 1:
Session_1> select * from test;
ID VAR
-------------------------------
1 China
2 Juan
Execute the query on session 2:
Session_2> select * from test;
ID VAR
------------------------------
1 ???
2 China


The above example looks very strange. session1 and 2 can display their inserted strings normally, and neither can display the strings inserted by the other party. First, we need to know how the two strings are stored in the database. We can use the dump function to get the encoding of Characters in the database:

Copy codeThe Code is as follows:
SQL> select id and dump (var, 1016) from test;
Id dump (VAR, 1016)
--------------------------------------------------------------
1 Typ = 1 Len = 4 CharacterSet = AL32UTF8: d6, d0, b9, fa
2 Typ = 1 Len = 6 CharacterSet = AL32UTF8: e4, b8, ad, e5, 9b, bd


According to the AL32UTF8 encoding, the correct encoding of the word "China" is (all three bytes ):
Medium -- e4, b8, ad
Country-e5, 9b, bd
Therefore, the encoding of the string inserted by session 1 in the database is incorrect, and session 2 is correct. This is why you must set NLS_LANG as the character set of the client operating system.
However, based on the experiment above, we can know that the database is properly stored, which does not mean that the client can be properly displayed; similarly, the real-time database is not properly stored, and sometimes the client can be properly displayed, why? Don't worry, please let me know:

Scenario 1: session 1 is inserted, session 1 is queried, and an error is stored in the database, but it is displayed correctly.
Insert Process:
The Chinese character encoding in the client operating system Character Set ZHS16GBK is "d6, d0, b9, fa", because NLS_LANG is the same as the database character set, the database directly saves the character encoding passed by the client to the database without any conversion. Therefore, the encoding stored in the database is also "d6, d0, b9, fa ",
Read process:
The encoding read from the database is "d6, d0, b9, fa". Because NLS_LANG is the same as the database character set, the client directly displays the character encoding transmitted from the database without any conversion, encode "d6, d0, b9, fa" the Chinese character corresponding to the character set ZHS16GBK in the client operating system is "China ".

From the above analysis, we can see that although the read is correct, it is because the negative is positive. In fact, the storage in the database is incorrect. Therefore, we must be very careful about this situation and avoid it in the generated database. In fact, as long as you perform the length operation on it, you can easily uncover its mask:

Copy codeThe Code is as follows:
Session_1> select length (var) from test where id = 1;
LENGTH (VAR)
-----------
3


The length is 3! The actual length is only 2, which causes a lot of trouble.

Scenario 2: session 1 insertion, session 2 query, storage errors in the database, and display errors.
The insert process is the same as that in scenario 1.
Read process:
The data database reads "d6, d0, b9, fa" encoding. Because the NLS_LANG character set is different from the database character set, the client converts the character encoding transmitted from the database, the database character set AL32UTF8 is encoded as "d6, d0, b9, fa" and cannot be found in the ZHS16GBK Character Set of the client operating system? .

Scenario 3: session 2 insertion, session 1 query, Correct storage in the database, but an error is displayed.
Insert Process:
"China" the encoding of the two characters in the ZHS16GBK Character Set of the client operating system is "d6, d0, b9, fa". Since the NLS_LANG character set is different from the database character set, Oracle will perform character encoding conversion, that is, convert the "China" encoding "d6, d0, b9, and fa" in the character set ZHS16GBK to the "e4, b8, ad" encoding "of" China "in the character set" AL32UTF8, e5, 9b, bd ".
Read process:
The database reads the encoding "e4, b8, ad, e5, 9b, bd". Because the NLS_LANG is the same as the database character set, the client directly displays the character codes transmitted from the database without any conversion. The encoding formats include e4, b8, ad, e5, and 9b, bd "the character set ZHS16GBK corresponding to the operating system character set on the client is" Juan "(originally two characters, but now it is three characters, because the Chinese character of ZHS16GBK is encoded in two bytes ).

Scenario 4: session 2 insertion, session 2 query, Correct storage in the database, and displayed correctly.
The Insert Process is similar to scenario 3.
Read process:
The encoding for reading data from the database is "e4, b8, ad, e5, 9b, bd". Because NLS_LANG is different from the database character set, the client converts the character encoding transmitted from the database. The data library Character Set AL32UTF8 contains "Chinese" two-character encoding "e4, b8, ad, e5, 9b, bd "converts it to the" China "encoding code" d6, d0, b9, fa "in the ZHS16GBK Character Set of the client operating system and is displayed normally.
Although this case has been converted twice, it is indeed the most correct and recommended method.

Appendix: for the relationship between Oracle Character Set supersets and subsets, see:Http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#sthref1988

Conclusion:
NLS_LANG is only related to the character set of the client operating system. If the character set of the client operating system cannot be correctly converted from the database character set, you should first change the character set of the client terminal, instead of simply setting NLS_LANG to be the same as the database character set.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.