Oracle Character Set and encoding
Morven. Huang: C #, ORACLE, etc. oracle handbook Series 10: Character Set, encoding, and Oracle
Part 1 Character Set and coding knowledge
Character Set:
Some characters are collected to one place as needed and assigned with names, so a character set is available.
Encoding:
After the current collection is complete, in order for computers that only know numbers to store characters, people have to assign an "ID number" to each character in the collection ", this is encoding. From then on, we can finally store characters in the computer by storing the encoding.
In the long history of the character set and coding world (pseudo), there have been a number of names that have made computer workers so familiar that some of these names have become a float cloud, some of them are still in our code.
ASCII:
ÜASCII character set: Contains 128 uppercase and lowercase English letters, Arabic numbers, punctuation marks, and some invisible controllers.
ÜASCII code: Represents a character with 7 characters. The encoding range is [0-127] (Hex [00-7F]), where [0-31] (Hex [00-1F]) and 127 (Hex7F) are controllers, the rest are visible characters.
GB2312:
ÜGB2312 Character Set: ASCII character set + about 7000 Chinese characters.
ÜGB2312 Encoding: Compatible with ASCII encoding. Determine the bytes. For example, if the value is <= 127, it indicates ASCII encoding. For example, if the value is> 127, it must be combined with another byte to represent one character. The theoretical Chinese character encoding space is 128X256, with more than 30 thousand characters.
GBK:
ÜGBK character set: GB2312 Character Set + about 20000 Chinese characters.
ÜGBK Encoding: Compatible with GB2312 encoding. The idle encoding space of GB2312 encoding is used.
GB18030:
ÜGB18030 Character Set: GBK character set + several Chinese characters + several ethnic minority characters, which is currently the latest Chinese character set.
ÜGB18030 Encoding: Compatible with GBK encoding. Continue to use the idle encoding space of GBK encoding, and use four bytes for excess encoding space.
BIG5:
ÜBIG5 Character Set: ASCII character set + about 13000 Chinese characters (Traditional Chinese ).
ÜBIG Encoding: Compatible with ASCII encoding. The encoding mode is similar to GB2312.
UNICODE(UNICODE is broad and confusing in daily use. It can be one of the following meanings in different contexts .)
ÜUNICODE Standard: A set of standards put forward by some organizations to regulate the display and encoding of human texts.
ÜUNICODE Character Set: Currently, the latest UNICODE Character Set contains more than 0.1 million characters in various languages.
ÜUNICODE encoding(Narrow UNICODE encoding)PossibleUCS-2,OrUTF-16; Generalized UNICODE encoding can refer to the following four types of UNICODE standard encoding .)
1.UTF-32 Coding: 4 bytes are fixed to indicate a character, which causes space utilization efficiency.
2.UTF-16 Coding: Two bytes are used to encode more than 60000 commonly used characters, and four bytes are used for the rest (that is, 'supplemental character mentlementary characters.
3.UCS-2 Coding: Is the implementation of UNICODE earlier versions, it is the only difference with the UTF-16 is that it does not include 'supplemental characters', so it only uses two bytes for character encoding. Currently, this encoding mode is out of date.
4.UTF-8 Coding: Compatible with ASCII encoding. Two bytes are used for Latin and Greek characters. Other common characters including Chinese characters use three bytes. The remaining few characters use four bytes.
ISO8859-1(Oracle comrades may have seen this WE8ISO89859P1. That's right, that's it .)
ÜISO8859-1 Character Set: ASCII character set + several Western European characters, such as letters? ,?.
ÜISO8859-1 Coding: Represents a single character with 8 bits, and removes the control letter ([0-31], and 127) from the original ASCII code ).
Code page(You can regard "code page" as a synonym for "encoding. Why is this name available? Historical issues .)
ÜANSI code pages: You must have met ANSI. When you think about saving another text file. ANSI code pages is actually a series of encoding sets. One of them is activated based on the operating system region settings as the default ANSI encoding. For example, the ANSI code page on the company's computer (English System) may be 1252, while the Chinese system at home may be 936. Therefore, you can use ANSI to store a text file containing Chinese characters at home. You can view the currently used ANSI code page in the Registry key HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Control \ NLS \ CodePage \ ACP. C # You can view it through Encoding. Default.
ÜOEM code pages: OEM code pages is used by console applications (such as SQLPLUS. In addition to the CJK environment (Chinese-Japanese-Korean), Windows uses different ANSI code pages and OEM code pages. For example, the company uses 437 on the English system. You can use the CHCP command to view the currently used OEM code page. C # can be viewed through Console. OutputEncoding.
Code page 1252:
ÜCp1252 Character Set: ASCII character set + several Western European characters + several special characters, such? ,‰.
ÜCp1252 Encoding: Represents a single character with 8 characters. The encoding range is [0-255] (Hex [00-FF]). The [0-127] part is the same as ASCII, and most of the new content is Western European characters. For example, some letters with the logo? ,?, And a special symbol like this)
PS1: Code page information on two PCs in reality
PC 1: Windows XP, ANSI code page = 1252, OEM code page = 437
PC 2: Chinese Version Windows 7, ANSI code page = 936, OEM code page = 936
PS2: Download The cp1252 and cp437 encoding tables. Click here. Early console applications often need to draw rough tables and other graphics, therefore, we can see many special symbols such as horizontal and vertical lines in 437.
PS3: CP1252, ISO8859-1, ASCII comparison, the actual use of the encoding range: CP1252> ISO8859-1> ASCII. ASCII is [0-127], CP1252 is [0-255], and the ISO8859-1 removes the [0-31] and 127 invisible controllers in cp1252, the special symbols in [128-159] (Hex [80-9F]) are removed in the same step.
Part 2 encoding and Character Set in Oracle
1. Why do we need two character sets?
Oracle has two character sets:
1) database Character Set
2) National Character Set
Why do we have two character sets? If I know that only English is needed, set the database character set to US7ASCII. If I know that only Western European characters are needed, set the database character set to WE8MSWIN1252 or WE8ISO89859P1, or simply use AL32UTF8. You see, I only need to set "database character set". What is the need for "National Character Set?
In fact, many existing databases cannot support the UNICODE character set because of historical issues and the "short-sighted" that cannot be avoided by database creators. For example, you need to store Chinese characters in the database of the US7ASCII database character set, at this time, the combination of "National Character Set" and NVARCHAR2 can save your life. For fields whose data type is NVARCHAR2 (and NCHAR, NCLOB), it uses the National Character Set and is irrelevant to the settings of the database character set. Since 9i, the national character set is optional only AL16UTF16 and AL32UTF8, UTF-16 and UTF-8 are UNICODE coding standard implementation, because some can represent almost all the text in the world.
Of course, if the database character set itself enables the UNICODE Character Set, there is no need to use the NVARCHAR2, NCHAR, NCLOB types.
2. xuanjicang of Character Set names
Oracle has certain rules to name character sets. For example:
AL32UTF8
[AL] supports All languages ).
[32] each character occupies a maximum of 32 characters (4 bytes ).
[UTF8] encoding for UTF-8.
WE8MSWIN1252
[WE] supports Western Europe ).
[8] each character occupies 8 bits (single byte ).
MSWIN1252 is encoded as CP1252.
US7ASCII
[US] indicates United States ).
[7] each character occupies 7 characters.
[ASCII] is encoded as ASCII.
Other types such as ZHS16GBK, ZHT16BIG5, and US8PC437 (encoded as OEM cp437) can be applied.
3. The example is very important.
3. 1. Prepare two databases
God said that there would be an example, so there were two databases of the same version, A and B:
SELECT parameter, VALUE
FROM nls_database_parameters
WHERE parameter IN ('nls _ CHARACTERSET ', 'nls _ NCHAR_CHARACTERSET ')
-- Database:
PARAMETER VALUE
-------------------------------------------------
NLS_CHARACTERSET WE8MSWIN1252
NLS_NCHAR_CHARACTERSET AL16UTF16
-- Database B:
PARAMETER VALUE
-----------------------------------------------
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16
-- Create A table in A and B respectively.
Create table charset_test
(Id NUMBER (
4) Primary key,
Vc VARCHAR2 (
20),
Nvc NVARCHAR2 (
20));
. Tools are very important
Before testing, to avoid confusion caused by the features of the tool itself, we will introduce how UNICODE is supported by several client tools:
ÜSQLPLUS: UNICODE character set is not supported. Whether Chinese is supported depends on the current OEM code page. If cp437 is used, it is impossible to enter or display Chinese. However, cp936 supports Chinese input and output.
ÜPLSQL Developer: The query result window of version 7.0 supports the UNICODE character set, but the editing window (that is, the window for inputting SQL statements) does not. Version 8.0 fully supports UNICODE.
ÜOracle SQL Developer: The query result window and the editing window both support the UNICODE Character Set.
3. garbled characters
Here, Oracle SQL Developer is used to insert and query Chinese Characters in A and B respectively:
Insert into charset_test VALUES (
1, 'Zhong', 'zhong ');
COMMIT;
-- Database
SELECT * FROM charset_test;
1? ?
-- Database B
SELECT * FROM charset_test;
1Medium
Skip the VARCHAR2 field for the moment. First, follow the NVARCHAR2 field. Why is it not displayed properly in database? There are only the following possibilities:
Ü the client operating system does not support displaying Chinese characters.
Ü Oracle client tool (Oracle SQL Developer) does not support displaying Chinese characters.
Ü the Oracle client has incorrect settings (such as NLS_LANG.
Ü data stored in the database is already incorrect.
First, does the client operating system support Chinese language affect the applications running on it? There should be two cases: one is that the application depends on the Chinese support of the operating system; the other is that some software has its own language pack and fonts (such as some Adobe products ,.. NET program can also package the font file during compilation), so it should not depend on the operating system.
I guess Oracle SQL Developer should belong to the previous one, and I checked the operating system, make sure that it supports the East Asian language (Control panel-Regional and Language options-language tab-Supplemental programming ages support-Install files for East Asian programming ages). If the checkbox is selected, the East Asian Language Pack has been installed ).
Second, both the query result window and the editing window support the UNICODE Character Set.
Third, because it does not depend on the Oracle client's OCI, The NLS_LANG settings in the client registry have no impact on Oracle SQL Developer.
Fourth, we use the DUMP () function to determine the specific content in the NVARCHAR2 field.
DUMP () Syntax: DUMP ( [, [, [, ])
The format parameter: If it is 8, the result is represented in octal format. If it is 16, it indicates hexadecimal format, if it is the other number between 0 and 16, the system uses the 10 hexadecimal notation. If the number is greater than 16, this character is printed directly if it is a visible ASCII character. If it is a control character, it is printed as "^ x ", in other cases, the result is displayed in hexadecimal notation. If 1000 is added to the format, the output result is included with the character set information used.
Here we use:
Select dump (nvc,
1016) FROM charset_test;
-- Database
Typ =
1Len =
2CharacterSet = AL16UTF16:
0, Bf
-- Database B
Typ =
1Len =
2CharacterSet = AL16UTF16: 4e, 2d
We know that the UTF-16 encoding of the word "medium" is 4E2D, apparently the data stored in database A is already wrong, 00BF is actually an inverted question mark character, the raw data stored in the database is no longer correct, not to mention the display of the client.
3. 4. Find different
So why are the two databases different? The suspect soon fell into the database character set, because the difference between A and B is only in the database character set, one is WE8MSWIN1252, and the other is AL32UTF8. After testing, the conclusion is:
Oracle SQL Developer ignores NLS_LANG, and the string directly usesDatabase Character SetAfter encoding, the client transmits the code to the server. Because the character set of database A does not support Chinese characters, it is unfortunately replaced with the default BF and eventually stored in the database. It is always wrong. Database B, on the contrary, the Chinese language is "alive" during transmission and successfully reaches the server end. It is then automatically converted to the NVARCHAR2 encoding and stored in the database.
3. 5. How to Make the NVARCHAR2 Field Work
It seems that the NVARCHAR2 field in database A can never be used normally. This is not the case. For Oracle SQL Developer, NVARCHAR2 can be inserted and queried normally through some settings.
Find {ORACLE_HOME} \ sqldeveloper \ bin \ sqldeveloper. conf (depending on your Oracle SQL Developer installation path) and add a configuration line:
AddVMOption-Doracle. jdbc. convertNcharLiterals = true
At the same time, add the "N" prefix before the Chinese string:
Insert into charset_test VALUES (
2, '中', n' 中 ');
-- Chinese characters in the NVARCHAR2 column are no longer garbled.
SELECT * FROM charset_test WHERE id =
2;
2? Medium
This configuration serves the following purpose: Before the INSERT statement is transmitted from the client to the server, Oracle SQL Developer checks (in fact, JDBC checks) The statement. If the "N" prefix is found, then encode this part of the string according to the UTF-16 to get the hexadecimal string. This is equivalent to executing this command:
Insert into charset_test VALUES (2, 'zhong', UNISTR ('\ 4e2d '));
C # NVARCHAR2 does not need to be configured properly. You only need to use the parameters during INSERT execution and select the correct parameter type:
Cmd. CommandText = "insert into charset_test values (3,: vc,: nvc )";
OracleParameter p1 = new OracleParameter ("vc", OracleDbType. Varchar2 );
OracleParameter p2 = new OracleParameter ("nvc", OracleDbType. NVarchar2 );
P1.Value = "medium ";
P2.Value = "medium ";
Cmd. Parameters. Add (p1 );
Cmd. Parameters. Add (p2 );
Cmd. ExecuteNonQuery ();
4. Client NLS_LANG settings and encoding conversion
As I mentioned earlier, Oracle SQL Developer ignores the NLS_LANG settings of the client. What about other tools? (Here we mainly focus on character sets and encoding. We will not discuss the impact of NLS_LANG on date formats, sorting methods, and Digit Display formats)
ÜSQLPLUSThe insert and query operations depend on the NLS_LANG settings of the client. Generally, the client NLS_LANG settings must be consistent with the current OEM Codepage, for example, US8PC437.
ÜPL/SQL Developer,Insert and query depend on the NLS_LANG settings of the client. Generally, the client NLS_LANG settings must be consistent with the database character set.
Using SQLPLUS, you can clearly see the Oracle code conversion process:
1) when the Oracle client submits an SQL statement to the server, the Oracle client converts the string encoding transmitted from the application based on the NLS_LANG and database character set. If the NLS_LANG character set is the same as the database character set, it is not converted. Otherwise, it is converted to the database character set and transmitted to the server. After receiving the string encoding, the server directly stores normal CHAR or VARCHAR2 types. For NCHAR or NVARCHAR2 types, the server converts them to the national character set and then stores them.
2) When querying data, the server returns the data stored in the database as is. The client compares the data set information in the returned metadata with the settings of NLS_LANG and NLS_NCHAR (if NLS_NCHAR is not set, the default value is the character set setting in NLS_LANG. If the character set information in the metadata is the same as that set in the client, the conversion is not performed. Otherwise, the conversion is required. The conversion of national character sets is based on the NLS_NCHAR settings.
Here I will also give A few test examples using SQLPLUS, execute the same statement in database A and database B respectively, and then capture packets through the network to view the specific content transmitted from the Oracle client to the server.
Example1Client NLS_LANG: WE8MSWIN1252
SQL command: insert into charset_test values (1 ,'? ', Null );
Network Packet capture (Database A and database character set: WE8MSWIN1252): 91
Explanation: because the application (SQLPLUS) uses Codepage437 encoding, why? The encoding is 91. After 91 is passed to the Oracle client, the Oracle client misjudges according to NLS_LANG that the encoding used is Codepage1252. Since the NLS_LANG setting is consistent with the database character set, the Oracle client does not perform encoding conversion, 91 is directly transmitted to the server and stored. Considering that the database character set is Codepage1252, it is clear that 91 is the wrong data (character [?] The code in Codepage1252 is E6, not 91 ).
This error leads to an interesting phenomenon, that is, the correct character [?] can be seen in the SQL plus query on the same client. This is because 91 is also directly returned during the SELECT process, and the Oracle client directly transmits the code to the application instead of the encoding conversion. it happens that the application can correctly parse 91 Based on the encoding used. However, you may obtain different query results from a different client machine or a different client tool.
Network Packet capture (Database B, database character set: AL32UTF8): E2 80 98
Explanation: because the application (SQLPLUS) uses Codepage437 encoding, why? The encoding is 91. After 91 is passed to the Oracle client, the Oracle client misjudges according to NLS_LANG that the encoding used is Codepage1252, while 91 corresponds to the character ['] In Codepage1252. according to the conversion from Codepage1252 to UTF8 of the Data character set, it is finally converted to E2 80 98, that is, the ['] Under UTF8.
Example2Client NLS_LANG: US7ASCII
SQL command: insert into charset_test values (1 ,'? ', Null );
Network Packet capture (Database A): BF
Explanation: because the application (SQLPLUS) uses Codepage437 encoding, why? The encoding is 91. After 91 is passed to the Oracle client, the Oracle client misjudges according to NLS_LANG that the encoding used is ASCII, while 91 is invalid in ASCII, which is converted from ASCII to the data Character Set Codepage1252, finally, it is converted to BF, which is the default replacement encoding used when Codepage1252 encounters invalid encoding.
Network Packet capture (Database B): EF BF BD
Explanation: because the application (SQLPLUS) uses Codepage437 encoding, why? The encoding is 91. After 91 is passed to the Oracle client, the Oracle client misjudges according to NLS_LANG that the encoding used is ASCII, while 91 is invalid in ASCII, which is converted from ASCII to the data Character Set UTF8, ef bf bd is converted to ef bf bd. ef bf bd is the default replacement encoding used when UTF8 encounters invalid encoding.
Example3Client NLS_LANG: US8PC437
SQL command: insert into charset_test values (1 ,'? ', Null );
Network Packet capture (Database A): E6
Description: E6 is the character [?]. Codepage1252 encoding is correct. Since the application (I .e. SQLPLUS) uses Codepage437, the encoding information obtained from the Oracle client from NLS_LANG is also Codepage437, so the correct encoding conversion is performed.
Network Packet capture (Database B): C3 A6
Explanation: C3 A6 is the character [?] Because the application (that is, SQLPLUS) uses Codepage437, the encoding information obtained by the Oracle client from NLS_LANG is also Codepage437, so the correct encoding conversion is performed.
I think that only SQLPLUS is always so beautiful, everything looks reasonable and understandable. When other tools appeared, the world became messy. Oracle SQL Developer ignored the client's NLS_LANG settings, which made things simple, but PL/SQL Developer was another one, I spent four days trying to figure out the encoding conversion process. In the end, it only proved to be an unreasonable thing. The only correct conclusion seems to be: If PL/SQL Developer is used, you have to set NLS_LANG to be consistent with the database character set. The rest can only be self-satisfied.
5. Effects of NLS_LANG on ODP. NET
View Code
The only method affected by the client NLS_LANG is the OracleString GetNonUnicodeBytes () method, which depends on the Character Set set locally on the client. For example, we change NLS_LANG from AMERICAN_AMERICA.WE8MSWIN1252 to american_america.us7
Specifically, 230 (that is, HexE6) is the character '? 'Encoding, while 63 (that is, Hex3F) is the question mark in ASCII (because there is no '? ', So use the question mark instead ).
6. Other questions about VARCHAR2 and NVARCHAR2
NVARCHAR2 (N), where N refers to the number of characters, not the number of bytes. However, its maximum length is in bytes, that is, 4000 bytes.
VARCHAR2 (N), where N may be the number of characters or the number of bytes. You can explicitly specify this parameter during Declaration, for example, VARCHAR2 (10 BYTE) or VARCHAR2 (10 CHAR). If this parameter is not explicitly specified, it is determined by the NLS_LENGTH_SEMANTICS parameter. It should be noted that you can successfully declare VARCHAR2 (4000 CHAR) does not guarantee that you can really store 4000 characters. If it exceeds 4000 bytes, the Oracle will still report an error.