[Oracle] Understand Oracle Character Set

Last Update:2013-11-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Oracle] A Character set is a collection of all abstract characters supported by the system. A character is a general term for all types of texts and symbols, including Chinese characters, punctuation marks, graphical symbols, and numbers. Common Character sets include ASCII, ZHS16GB231280, and ZHS16GBK. Character Encoding (Character Encoding): a set of rules that can be used for a set of natural language characters (such as the alphabet or syllable table ), it is paired with another set (such as computer code. That is, the correspondence between the Symbol Set and the digital system is established. Similar to character sets, common character encodings include ASCii, ZHS16GBK, ZHT16BIG5, and ZHS32GB18030. Character sets are actually character sets. character encoding means how to convert these characters into bytes for saving, reading, and transmitting. Unicode: contains almost all the characters available to humans. It is constantly increasing every year and can be seen as a general character set. It unifies all the characters in the world and encodes them in a unified manner, so that character incompatibility and character conversion will no longer occur. It has three encoding methods: UTF-32 encoding: fixed to use 4 bytes to represent a character, there is a space utilization efficiency problem. UTF-16 encoding: two bytes are used for the more than 60000 characters that are commonly used, and four bytes are used for the rest. UTF-8 encoding: compatible with ASCII encoding. Two bytes are used for Latin and Greek. Other common characters including Chinese characters use three bytes. The remaining rarely used characters use four bytes. Before understanding the basic principles of the Oracle character set, you must first distinguish the following three concepts: 1. oracle Database Server character set: the character encoding used by Oracle to store characters. You can use the following statement to check the character set settings of the database.

[sql] SQL> select * from v$nls_parameters where parameter='NLS_CHARACTERSET';    PARAMETER                      VALUE  ------------------------------ -----------------  NLS_CHARACTERSET               AL32UTF8

2. client operating system character set: the character encoding used by the client operating system to store characters. For Windows, you can use the chcp command to obtain the code page ):

[sql] C:\Users\xianzhu>chcp  Active code page: 936

According to the code page, find the character set corresponding to the official Microsoft documentation "National Language Support (NLS) API Reference. For Linux, set the character set in/etc/sysconfig/i18n:

[Plain] LANG = "zh_CN.GB2312" (specify the character set of the current operating system) SUPPORTED = "zh_CN.GB2312" (specify the character set SUPPORTED by the current operating system) SYSFONT = "lat0-sun16" (specifies the font of the current operating system)

3. Client NLS_LANG parameter: this parameter is used to indicate to Oracle the character set of the client operating system. With the above three basic concepts, I will elaborate on the basic principles of Oracle Character Set conversion: Set the client's NLS_LANG to the character set of the client's operating system. If the database character set is equal to NLS_LANG, if the character sets of the database and the client are not converted during character transfer, the character sets must be converted between different character sets. Only when the character set of the client operating system is a subset of the database character set can the character set be correctly converted, otherwise, garbled characters may occur. The following is an example of the analysis of several common situations. We will analyze this example by looking at the nature of phenomena. This example is as follows:

[SQL] 1. database character set is Unicode (UTF-8 encoding) Our database version is 10.2.0.4.0, database character set is: SQL> select * from v $ nls_parameters where parameter = 'nls _ CHARACTERSET '; parameter value ---------------------------------------- -------------------------------- NLS_CHARACTERSET AL32UTF8

2. the client operating system character set is code page 936 (the character set is ZHS16GBK). You can use chcp to obtain the windows code page C: \ Documents ents and Settings \ a105024 \ Desktop> chcp Active code page: 936 3. create test table

SQL> create table test(id number,var varchar2(30));

Table created.

4. insert data here two sessions are started in the same operating system. The NLS_LANG of session1 is set to the same as the database character set (that is, AL32UTF8): C: \ Documents and Settings \ a105024 \ Desktop> set nls_lang = Simplified Chinese_China.AL32UTF8 connect to the database and insert a data entry:

Session_1> insert into test values (1, 'China'); 1 row created. Session_1> commit; Commit complete.

Set NLS_LANG of session2 to the same as the client operating system (ZHS16GBK): C: \ Documents and Settings \ a105024 \ Desktop> set nls_lang = Simplified Chinese_China.ZHS16GBK to connect to the database and insert a data entry:

Session_2> insert into test values (2, 'China'); 1 row created. Session_2> commit; Commit complete.

5. Execute the query on session 1:

Session_1> select * from test; id var ---------- --------------------- 1 China 2 queries on session 2: Session_2> select * from test; id var ---------- -------------------- 1 ??? 2 China

The above example looks very strange. session1 and 2 can display their inserted strings normally, and neither can display the strings inserted by the other party. First, we need to know how the two strings are stored in the database. We can use the dump function to get the encoding of Characters in the database:

[sql] SQL> select id,dump(var,1016) from test;  ID DUMP(VAR,1016)  -- ------------------------------------------------------------   1 Typ=1 Len=4 CharacterSet=AL32UTF8: d6,d0,b9,fa   2 Typ=1 Len=6 CharacterSet=AL32UTF8: e4,b8,ad,e5,9b,bd

According to the AL32UTF8 encoding, the correct encoding of the word "China" is (all three bytes): Medium -- e4, b8, ad country -- e5, 9b, bd therefore, the encoding of the string inserted by session 1 in the database is incorrect, and session 2 is correct. This is why you must set NLS_LANG as the character set of the client operating system. However, based on the experiment above, we can know that the database is properly stored, which does not mean that the client can be properly displayed; similarly, the real-time database is not properly stored, and sometimes the client can be properly displayed, why? Don't worry, please let me know: Scenario 1: session 1 insertion, session 1 query, storage errors in the database, but the display is correct. Insert Process: the encoding of the Chinese characters in the ZHS16GBK Character Set of the client operating system is "d6, d0, b9, fa", because the NLS_LANG and database character set are the same, the database directly saves the character encoding passed by the client to the database without any conversion. Therefore, the encoding stored in the database is also "d6, d0, b9, fa". The read process is as follows: the encoding read from the database is "d6, d0, b9, fa". Because NLS_LANG is the same as the database character set, the client directly displays the character encoding transmitted from the database without any conversion, encode "d6, d0, b9, fa" the Chinese character corresponding to the character set ZHS16GBK in the client operating system is "China ". From the above analysis, we can see that although the read is correct, it is because the negative is positive. In fact, the storage in the database is incorrect. Therefore, we must be very careful about this situation and avoid it in the generated database. In fact, as long as you perform the length operation on it, you can easily uncover its mask:

[sql] Session_1>select length(var) from test where id=1;    LENGTH(VAR)  -----------            3

The length is 3! The actual length is only 2, which causes a lot of trouble. Scenario 2: session 1 insertion, session 2 query, storage errors in the database, and display errors. The insert process is the same as that in scenario 1. Read process: the database reads data encoded as "d6, d0, b9, fa". Because NLS_LANG is different from the database character set, the client converts the character encoding transmitted from the database, the database character set AL32UTF8 is encoded as "d6, d0, b9, fa" and cannot be found in the ZHS16GBK Character Set of the client operating system?. Scenario 3: session 2 insertion, session 1 query, Correct storage in the database, but an error is displayed. Insert Process: the encoding of the Chinese characters in the ZHS16GBK Character Set of the client operating system is "d6, d0, b9, fa". Because the NLS_LANG character set is different from the database character set, oracle will perform character encoding conversion, that is, to encode "d6, d0, b9" of "China" in the character set ZHS16GBK, fa "converted to Character Set" AL32UTF8 "in" China "encoding" e4, b8, ad, e5, 9b, bd ". Read process: the database reads the encoding "e4, b8, ad, e5, 9b, bd". Because NLS_LANG is the same as the database character set, the client directly displays the character encoding transmitted from the database without any conversion, encoding "e4, b8, ad, e5, 9b, bd "the character set ZHS16GBK corresponding to the operating system character set on the client is" Juan "(originally two characters, but now it is three characters, because the Chinese character of ZHS16GBK is encoded in two bytes ). Scenario 4: session 2 insertion, session 2 query, Correct storage in the database, and displayed correctly. The Insert Process is similar to scenario 3. Read process: the database reads the encoding "e4, b8, ad, e5, 9b, bd". Because NLS_LANG is different from the database character set, the client converts the character encoding transmitted from the database. The database character set AL32UTF8 contains "Chinese" two-character encoding "e4, b8, ad, e5, 9b, bd "converts it to the" China "encoding" d6, d0, b9, fa "in the ZHS16GBK Character Set of the client operating system and displays it normally. Although this case has been converted twice, it is indeed the most correct and recommended method. Conclusion: NLS_LANG is only related to the character set of the client operating system. If the character set of the client operating system cannot be correctly converted from the database character set, the character set of the client terminal should be changed first, instead of simply setting NLS_LANG to be the same as the database character set.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Oracle] Understand Oracle Character Set

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support