[Oracle] Thorough understanding of Oracle Character Set _oracle

Source: Internet
Author: User

The

basic Concept
Character set (Character set):
is a collection of all abstract characters supported by a system. Characters are the general name of various words and symbols, including the national characters, punctuation marks, graphic symbols, numbers and so on. Common character sets are ASCII,ZHS16GB231280,ZHS16GBK and so on.

Character encoding (Character Encoding):
is a set of rules that can be used to pair a set of characters in a natural language (such as an alphabet or syllable table) with one of its collections (such as computer coding). The corresponding relationship between the symbol set and the digital system is established. As opposed to the character set, common character encodings are: ascii,zhs16gbk,zht16big5,zhs32gb18030, and so on. The definition of a
character set is actually a collection of characters, whereas character encoding refers to how these characters are converted into bytes for saving, reading, and transferring.

Universal Code (UNICODE): contains almost all the available characters for human beings, and is continuously increasing each year, and can be viewed as a common character set. It will unify all the characters in the world, unified coding, no longer appear character incompatibility and character conversion problems.
It has the following three encodings:
1.utf-32 Encoding:
fixed use 4 bytes to represent a character, there is a problem of space utilization efficiency.
2.utf-16 Encoding: encodes two bytes for more than 60,000 characters that are commonly used, and uses 4 bytes for the remainder.
3.utf-8 Encoding: compatible ASCII encoding; Latin, Greek, etc. use two bytes; other commonly-used characters, including Chinese characters, use three bytes; the remaining few characters use four bytes.

Oracle Character set rationale
before you understand the fundamentals of Oracle character sets, you must first distinguish between the following three concepts:
1. Oracle database server Character set: That is, what character encoding Oracle uses to store characters, you can detect the setting of the database character set by using the following statement.

Copy Code code as follows:

Sql> SELECT * from v$nls_parameters where parameter= ' nls_characterset ';
PARAMETER VALUE
------------------------------ -----------------
Nls_characterset Al32utf8

2. Client operating system Character set: that is, which character encoding the client operating system stores characters.
If it is windows, you can use the CHCP command to get the code page:
Copy Code code as follows:

C:\users\xianzhu>chcp
Active Code page:936

According to the code page, to Microsoft's Official document National Language Support (NLS) API Reference found its corresponding character set.
If it is Linux, the character set is set in/etc/sysconfig/i18n:
Copy Code code as follows:

Lang= "ZH_CN. GB2312 "(Specifies the character set of the current operating system)
Supported= "ZH_CN. GB2312 "(Specifies the character set supported by the current operating system)
Sysfont= "Lat0-sun16" (Specifies the font for the current operating system)

3. Client Nls_lang parameter: This parameter is used to instruct Oracle on the character set of the client operating system.
With the above 3 basic concepts, let me elaborate on the basic principles of Oracle character set translation:
1. Set the client's Nls_lang as the character set of the client operating system
2. If the database character set equals Nls_lang, the database and the client transmit characters without any conversion
3. If they are unequal, they need to be converted between different character sets, and only the client operating system character set is based on a subset of the database character set to be converted correctly, otherwise garbled characters will appear.
analysis of several common situations
Let's look at an example, and then look at the nature of the phenomenon, we will analyze this example.
The example is as follows:
Copy Code code as follows:

1. Database character set is Unicode (UTF-8 encoding)
Our database version is 10.2.0.4.0, and the database character set is:
Sql> SELECT * from v$nls_parameters where parameter= ' nls_characterset ';
PARAMETER VALUE
---------------------------------------- ------------------------------
Nls_characterset Al32utf8
2. Client operating system character set is code page 936 (character set is ZHS16GBK)
You can use CHCP to get the code page for Windows
C:\Documents and Settings\a105024\desktop>chcp
Active Code page:936
3. Create a test table
Sql> CREATE TABLE Test (ID number,var varchar2 (30));
Table created.
4. Inserting data
Here, the same operating system to start two Session,session1 Nls_lang set to the same as the database character set (that is, Al32utf8):
C:\Documents and Settings\a105024\desktop>set nls_lang=simplified Chinese_china.al32utf8
To connect to a database and insert a piece of data:
Session_1>insert into test values (1, ' China ');
1 row created.
session_1>commit;
Commit complete.
The Session2 Nls_lang is set to be the same as the client operating system (i.e. ZHS16GBK):
C:\Documents and Settings\a105024\desktop>set nls_lang=simplified CHINESE_CHINA.ZHS16GBK
To connect to a database and insert a piece of data:
Session_2>insert into test values (2, ' China ');
1 row created.
session_2>commit;
Commit complete.
5. Execute Query
To execute a query on session 1:
Session_1>select * from Test;
ID VAR
---------- ---------------------
1 China
2 Juan Ricoh
To execute a query on session 2:
Session_2>select * from Test;
ID VAR
---------- --------------------
1???
2 China

The example above looks very strange, session1 and 2 can display their own inserted strings, and can not properly display the other inserted strings. To find out, we first need to know how the database is stored in the two strings. We can use the dump function to get the character's encoding in the database:
Copy Code code as follows:

Sql> Select Id,dump (var,1016) from test;
ID DUMP (var,1016)
-- ------------------------------------------------------------
1 typ=1 len=4 Characterset=al32utf8:d6,d0,b9,fa
2 Typ=1 len=6 CHARACTERSET=AL32UTF8:E4,B8,AD,E5,9B,BD

According to Al32utf8 's code, the word "China" is encoded correctly (all 3 bytes):
Middle--e4,b8,ad
National--E5,9B,BD
So the string inserted in session 1 is encoded incorrectly in the database, session 2 is correct. This is also why it is important to set the Nls_lang as the client operating system's character set.
But according to the above experiment we can know that the database is stored correctly, does not mean that the client can be normal display, the same time, the real-time database is not properly stored, sometimes the client can also be normal display, this is why? Don't worry, please listen to me slowly:

Scene 1:session 1 Insert, Session 1 query, store errors in the database, but display correctly.
Insert procedure:
"China" in the client operating system character Set ZHS16GBK encoding is "D6,D0,B9,FA", because the Nls_lang and the database character set, the database end of the client-transmitted character encoding without any conversion directly into the database, so the database stored in the encoding is also " D6,d0,b9,fa ",
Read process:
Database-side read encoding is "D6,D0,B9,FA", because the Nls_lang and database character set is the same, the client to the database end of the character encoding does not make any conversion direct display, encoding "D6,D0,B9,FA" In the client operating system character set zhs16gbk the corresponding Chinese character is "China".

From the above analysis, although the reading is correct, but that is because negative negative positive, in fact, storage in the database is wrong, so be particularly careful in this case, in the build library to avoid. In fact, as long as the length of its operation can easily uncover its mask:

Copy Code code as follows:

Session_1>select Length (Var) from Test where id=1;
LENGTH (VAR)
-----------
3

Reached the length of incredibly 3! The actual length is only 2, which can cause a lot of trouble.

Scene 2:session 1 Insert, Session 2 query, store errors in the database, and display errors.
The insertion process, like Scene 1, is no longer described here.
Read process:
Database-side reads the encoding is "D6,D0,B9,FA", because Nls_lang and database character set different, the client to the database end of the character encoding conversion, database end character Set Al32utf8 in "D6,d0,b9,fa" The corresponding encoding cannot be found in the client operating system character set ZHS16GBK, so it has to be replaced with a.

Scene 3:session 2 Insert, Session 1 query, stored correctly in the database, but displays an error.
Insert procedure:
"China" in the client operating system character Set ZHS16GBK encoding is "D6,D0,B9,FA", because Nls_lang and the database character set is different, Oracle will do character encoding conversion, that is, the character set ZHS16GBK "China" code "D6,D0, B9,fa "Al32utf8" to "China" encoding "E4,B8,AD,E5,9B,BD" in the character set "".
Read process:
Database-side read encoding is "E4,B8,AD,E5,9B,BD", because the Nls_lang and database character set is the same, the client to the database end of the character encoding does not make any conversion direct display, encoding "E4,B8,AD,E5,9B,BD" In the client operating system character set ZHS16GBK the corresponding kanji is "Ricoh" (originally 2 characters, now 3 characters, since the ZHS16GBK characters are encoded in 2 bytes).

scene 4:session 2 Insert, Session 2 query, stored correctly in the database, and displayed correctly. The
Insert procedure is similar to Scenario 3.
Read procedure:
The encoding of the database-side reads is "E4,B8,AD,E5,9B,BD", and because the Nls_lang and the database character set are different, the client converts the character encoding on the database-side. Database-side Character set Al32utf8 "China" encoded "E4,B8,AD,E5,9B,BD" into the client operating system character set ZHS16GBK "China" code "D6,D0,B9,FA", and normal display.
This situation, although it has undergone two conversions, is indeed the most correct and recommended way.

Appendix: The corresponding relationship of Oracle character set superset and subset can be viewed: http://download.oracle.com/docs/cd/B19306_01/server.102/ b14225/applocaledata.htm

Conclusion:
Nls_lang is only relevant to the character set of the client operating system. If the character set of the client operating system and the database character set cannot be converted correctly, you should first change the character set of the client terminal, rather than simply set Nls_lang to the same as the database character set.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.