Oracle Character Set

Source: Internet
Author: User
Tags sqlplus

First Part character set and coding common sense

Character:

People collect certain characters in one place and name them as needed, so they have a character set.

Coding:

After the work of the current face collection is finished, in order to let the computer that only knows the numbers "stupid" can store characters, people have to assign "ID number" to each character in the collection, this is the code, from then on, finally can store the characters in the computer in the way of memory encoding.

In the long history of character sets and coding worlds (pseudo), there have been a number of names that have thunderclap piercing computer workers, some of which have become floating clouds, some of which are still tossing in our code.

ASCII:

ü ASCII Character Set : Contains a total of 128 uppercase and lowercase English, Arabic numerals, punctuation, and some invisible controls.

ü ASCII encoding : Uses 7 bits to represent one character. The encoding range is [0-127] (that is, hex[00-7f]), where [0-31] (hex[00-1f]) and 127 (hex7f) are the controls, and the rest are visible characters.

GB2312:

ü GB2312 Character set : ASCII character set + 7000 characters or so.

ü GB2312 encoding : Compatible with ASCII encoding. A byte is judged, such as a value <=127, to be equivalent to an ASCII encoding, or, if the value is >127, it needs to be merged with another byte thereafter to represent a character. Its theoretical Chinese character coding space is 128x256, more than 30,000 characters.

GBK:

ü GBK Character Set : GB2312 Character set + 20000 kanji characters.

ü GBK encoding : Compatible with GB2312 encoding. The GB2312 encoding is used to encode the unused space.

GB18030:

ü GB18030 character set: GBK character set + several Kanji + several minority characters, for the current domestic latest character set.

ü GB18030 encoding : Compatible with GBK encoding. Continue to use the GBK encoding unused encoding space, for exceeding the encoded space is represented by 4 bytes.

BIG5:

ü BIG5 Character set : ASCII character set + 13000 kanji (Traditional).

ü BIG code : compatible with ASCII encoding. Its encoding pattern is similar to GB2312.

Unicode: (The term Unicode appears to be broad and confusing in everyday use, and can be one of the following in different contexts.) )

ü UNICODE standard : A set of standards proposed by some organizations, the display of human text, coding and so on a series of provisions.

ü Unicode Character Set : Currently the latest version of the Unicode character set contains more than 100,000 characters in various languages.

ü Unicode encoding : (narrow Unicode encoding may refer to UCS-2, or it may refer to UTF-16 The generalized Unicode encoding can refer to a number of encoding implementations of the Unicode standard, including the following four types. )

1. UTF-32 encoding : Fixed use of 4 bytes to represent a character, there is a problem of space utilization efficiency.

2. UTF-16 encoding : Use two bytes for the relatively common 60,000 characters and 4 bytes for the remainder (that is, ' supplemental characters Supplementary characters ').

3. UCS-2 encoding : is the implementation of the earlier version of Unicode, the only difference between it and UTF-16 is that it does not include ' supplemental characters ', so it uses only two bytes to encode characters. This encoding mode is now obsolete.

4. UTF-8 encoding : Compatible with ASCII encoding, Latin, Greek, etc. use two bytes; other characters commonly used characters, including Chinese characters, use three bytes, and the remaining very few use four bytes.

iso8859-1: (The Comrades who use Oracle may have seen this we8iso89859p1, yes, it is.) )

ü iso8859-1 Character set : ASCII character set + several Western European characters, such as Letter Â, E.

ü iso8859-1 encoding : Use 8 bits to represent one character, and remove the control (i.e. [0-31], and 127) from the original ASCII encoding.

code page: (You can think of "code page" as a synonym for "coding".) Why does it have this name? Problems left over from history. )

ü ANSI code pages: You must have seen ANSI, when you want to save a text file. ANSI code pages is actually a series of encoding collections that activate one as the default ANSI encoding based on the operating system locale. For example, the ANSI code page on the corporate computer (English system) may be 1252, while the Chinese system in the home may be 936. So at home you can use ANSI to store a text file containing Chinese, not in the company. You can view the currently used ANSI code page in the registry key: HKEY_LOCAL_MACHINE\SYSTEM\CURRENTCONTROLSET\CONTROL\NLS\CODEPAGE\ACP. C # can be viewed through encoding.default.

ü OEM code pages: OEM code pages are used by console applications such as Sqlplus. In addition to the CJK Environment (Chinese-japanese-korean), Windows uses a different ANSI code page and an OEM code page. For example, 437 is used on the company's English system. You can use the CHCP command to view the OEM code page currently in use, and C # can be viewed through console.outputencoding.

Code Page 1252 :

ü cp1252 Character set : ASCII character set + several Western European characters + some special symbols, such as ™,‰.

ü cp1252 encoding : Use 8 bits to represent one character. The encoding range is [0-255] (i.e. HEX[00-FF]), the [0-127] part is the same as ASCII, and most of the additions are Western European characters, such as some with superscript letters, E, and special symbols like this one)


PS1
: code page information on two PCs in the real world

PC 1: English version of Windows Xp,ansi code page=1252, OEM code page=437

PC 2: Chinese version of Windows 7,ansi code page=936, OEM code page=936

PS2: cp1252 and cp437 encoding table download please click here, early console applications often need to draw some rough tables and so on, so you can see a lot of different vertical lines in the 437 special symbols.

PS3: CP1252, Iso8859-1, ASCII comparison, the actual use of the encoding range: Cp1252>iso8859-1>ascii. ASCII is [0-127],cp1252 is [0-255],iso8859-1] removes the cp1252 in [0-31] and 127 of these invisible controls, the same in addition to [128-159] (that is, hex[80-9f]) in the special symbol.

Part II Oracle encoding and character set in the

1. Why do I need two character sets?

There are two sets of characters in Oracle:

1) database Character set

2) national Character set

Why is there a two character set? If I know that only English is required, set the database charset =us7ascii, if I know only Western European characters, set the database charset =we8mswin1252 or WE8ISO89859P1, or simply use Al32utf8. You see, I just need to set the "database character set", so what is the "national character set" necessary?

In fact, given the legacy of history and the "myopia" that database creators cannot avoid, many existing databases cannot support Unicode character sets, for example, to store Chinese in a database of existing US7ASCII database character sets, this time "national character set" + NVARCHAR2 such a combination will save your life. For a field with a data type of NVARCHAR2 (as well as nchar, NCLOB), it uses the national character set, regardless of the setting of the database character set. Since 9i, only Al16utf16 and al32utf8,utf-16 and UTF-8 are the only implementations of the Unicode encoding standard for national character sets, which can represent almost all the text in the world.

Of course, if the database character set itself makes the Unicode character set, it is not necessary to use NVARCHAR2, NCHAR, NCLOB these types.

2. The mystery of character set names

Oracle's naming of character sets actually has some rules to find, such as:

Al32utf8

"AL" supports all languages (all Language).

"32" uses up to 32 bits per character (4 bytes).

The "UTF8" code is UTF-8.

we8mswin1252

"WE" supports Western European languages (Western Europe).

"8" Each character needs to occupy 8 bits (single byte).

The "MSWIN1252" code is CP1252.

Us7ascii

"Us" means us (United States).

"7" requires 7 bits per character.

The ASCII encoding is ASCII.

Other, such as zhs16gbk,zht16big5,us8pc437 (encoded as OEM cp437), can be analogous.

3. examples are important

3.1. Preparation of two databases

God says to have an example, so there are two databases of the same version, A and B:

SELECT parameter, VALUE
From Nls_database_parameters
WHERE parameter in (' Nls_characterset ', ' Nls_nchar_characterset ')

--Database A:
PARAMETER VALUE
------------------------------ -------------------
Nls_characterset we8mswin1252
Nls_nchar_characterset AL16UTF16

--Database B:
PARAMETER VALUE
------------------------------ -----------------
Nls_characterset Al32utf8
Nls_nchar_characterset AL16UTF16

--Create a table in A and B, respectively.
CREATE TABLE Charset_test
(ID number (4) PRIMARY KEY,
VC VARCHAR2 (20),
NVC NVARCHAR2 (20));


3.2.
tools are important

Before testing, to avoid confusion about the nature of the tool itself, describe the support of several client tools for Unicode:

ü SQLPLUS: Unicode character set is not supported. Whether to support Chinese depends on the current OEM code page, and if it is cp437, it is not possible to enter or display Chinese. However, if it is cp936, it can support Chinese input and output.

ü plsql Developer: The 7.0 version of the Query Results window supports the Unicode character set, but the edit window (that is, the window that enters the SQL statement) is not supported. Version 8.0 fully supports Unicode.

ü Oracle SQL Developer: Both the Query Results window and the edit window support the Unicode character set.


3.3.
There's garbled characters .

Here, using Oracle SQL Developer, insert and query Chinese in a, B, respectively:

View Code

Temporarily skip the VARCHAR2 field first to focus on the NVARCHAR2 field, why a library does not display properly? There are several possibilities:

ü The client operating system does not support displaying Chinese.

The Üoracle Client tool (here is Oracle SQL Developer) does not support the display of Chinese.

Üoracle client has related settings (such as Nls_lang) are incorrect.

ü The data stored in the database is already incorrect data.

1th, does the client operating system support Chinese affect the applications running on it? There should be two situations in which an application relies on the operating system's Chinese support, and some software has its own language packs and fonts (such as some of Adobe's products). NET program can also choose to package the font file when compiling, it should not depend on the operating system.

I guess Oracle SQL developer should belong to the previous one, and I checked the operating system to determine that it already supports East Asian languages (Control panel-regional and language options-language tab- Supplemental languages support-install files for East Asian languages, if the checkbox is already selected, the East Asian language pack is already installed.

2nd, both the Query Results window and the edit window support the Unicode character set.

3rd, the Nls_lang setting in the client registry has no effect on Oracle SQL developer because it does not rely on the OCI of the Oracle client.

4th, we use the dump () function to determine the specific contents of the NVARCHAR2 field.

The syntax of dump (): Dump (<VALUE>[,<FORMAT>[,<OFFSET>[,<LENGTH>]])

The format parameter, if 8, indicates that the result uses a 8 binary representation, and if 16 represents 16, 0 is used if the other number is 16 to 10. If the number is greater than 16, there are several cases: if the ASCII character is visible print this character directly, if the control character is printed as "^x", in other cases the results are displayed as 16. Adding 1000 to format means that in addition to the output, the character set information used by the output is included.

Here we use:

View Code

We know that the "medium" word UTF-16 encoding is 4e2d, obviously the data stored in a library is already wrong, 00BF is actually an inverted question mark character, the original data stored in the database has been wrong, not to mention the client's display.


3.4.
Find a different

So why would two libraries be different? Suspicion soon landed on the database character set, because the difference between A and B is only on the database character set, one is we8mswin1252 and the other is Al32utf8. After testing, the conclusion is:

Oracle SQL developer ignores Nls_lang, and strings are directly encoded by the database character set and transferred to the server side by the client. Because the A library database character set does not support Chinese characters, it is unfortunately replaced by the default BF and is eventually stored in the database, which is always the wrong way to go. b Library In contrast, the Chinese in the process of transmission "survived" and successfully reached the server side, and eventually automatically converted to the encoding used by NVARCHAR2 and stored in the library.


3.5.
How to make the NVARCHAR2 field work

It seems that the NVARCHAR2 field in a library will never work properly, not so, for Oracle SQL Developer, with some settings, you can make NVARCHAR2 normal to insert, query.

Locate {oracle_home}\sqldeveloper\sqldeveloper\bin\sqldeveloper.conf (depending on your ORACLE SQL Developer installation path) and add a line of configuration:

Addvmoption-doracle.jdbc.convertncharliterals=true

Add the "N" prefix to the Chinese string at the same time:

View Code

The purpose of this configuration is to: Before the INSERT statement is transferred from the client to the server side, Oracle SQL Developer detects (actually JDBC detection) statements, and if the "N" prefix is found, This part of the string is encoded in UTF-16 to get the 16 binary strings. That is, the equivalent of executing this command:

INSERT into Charset_test VALUES (2, ' Medium ', unistr (' \4e2d '));

C # does not require a special configuration to allow NVARCHAR2 to function properly, just use parameters and choose the correct parameter type when executing insert:

View Code


4.
Nls_lang setting and encoding conversion of the client

As I said earlier, Oracle SQL developer ignores client Nls_lang settings, what about other tools? (Here we focus on character sets and encodings, not the effects of Nls_lang on date formats, sorting methods, digital display formats, etc.)

ü SQLPLUS, insert and query are dependent on the client Nls_lang settings. Typically, client Nls_lang settings are consistent with the current OEM codepage, such as us8pc437.

ü PL/SQL Developer, both inserts and queries depend on the client Nls_lang settings. Typically, client Nls_lang settings are consistent with the database character set.

Using Sqlplus, you can clearly see the process of the Oracle Encoding transformation:

1) When an Oracle client submits an SQL statement to the server, the Oracle client converts the string encoding that is sent from the application to the Nls_lang and database character sets. If the Nls_lang is the same as the database character set, it is converted to the database character set and transferred to the server. After the server receives the string encoding, it is stored directly for the normal char or VARCHAR2 type, and for the nchar or NVARCHAR2 type, the server side converts it to the national character set for re-storage.

2) When querying data, the server side returns the data stored in the library as is, and the client compares the settings of the Nls_lang and Nls_nchar based on the character set information in the returned metadata (if Nls_nchar is not set, its default value is the character set in Nls_lang) , if the character set information in the metadata is consistent with the client settings, the conversion is not performed. The conversion of the national character set is based on the Nls_nchar setting.

Here I also give a few test examples using Sqlplus, respectively, in A, B two libraries execute the same statement, and then through the network capture package to see from the Oracle client transfer to the server specific content.

Example 1 Client nls_lang:we8mswin1252

SQL command: INSERT into charset_test values (1, ' Æ ', null);

NET grab packet (a library, database character set is we8mswin1252): 91

Explanation: Since the encoding used by the application (that is, sqlplus) is Codepage437, æ the encoding is 91. When 91 was passed to Oracle client, Oracle client misjudged its use of encoding according to Nls_lang, and because the Nls_lang setting is consistent with the database character set, Oracle client does not encode the conversion. 91 is passed directly to the server and stored, considering that the database character set is Codepage1252, it is clear that 91 is the wrong data (the character [Æ] is encoded under Codepage1252 E6, not 91).

This error has led to an interesting phenomenon in that the same client uses the Sqlplus query to actually see the correct character [Æ], because 91 is also returned directly from Select, and the Oracle client does not encode the conversion but is passed directly to the application. It happens that the application can parse 91 correctly based on the encoding it uses. But a different client machine, or a different client-side tool, could get a different query result.

Network Grab packet (b library, database character set is Al32utf8): E2 80 98

Explanation: Since the encoding used by the application (that is, sqlplus) is Codepage437, æ the encoding is 91. When 91 is passed to the Oracle client, the Oracle client misjudged its use of the encoding according to Nls_lang is Codepage1252, and 91 in Codepage1252 corresponds to the character ['], The conversion from Codepage1252 to data character Set UTF8 is eventually converted to E2 80 98, which is the ['] under UTF8.

Example 2 Client Nls_lang:us7ascii

SQL command: INSERT into charset_test values (1, ' Æ ', null);

Network Grab Bag (a library): BF

Explanation: Since the encoding used by the application (that is, sqlplus) is Codepage437, æ the encoding is 91. When 91 was passed to Oracle client, Oracle client misjudged its use of ASCII based on Nls_lang, while 91 was invalid in ASCII, converted from ASCII to data character set Codepage1252, and eventually converted to BF , the BF is the default replacement encoding used by Codepage1252 when it encounters invalid encoding.

Network Capture (b-Library): EF BF BD

Explanation: Since the encoding used by the application (that is, sqlplus) is Codepage437, æ the encoding is 91. When 91 was passed to Oracle client, Oracle client misjudged its use of ASCII based on Nls_lang, while 91 was invalid in ASCII, and converted from ASCII to data character set UTF8, eventually converted to EF BF Bd,ef BF BD is the default replacement encoding used by UTF8 when it encounters invalid encoding.

Example 3 Client nls_lang:us8pc437

SQL command: INSERT into charset_test values (1, ' Æ ', null);

NET Grab bag (a library): E6

Explanation: E6 is the correct Codepage1252 encoding for the character [Æ], this time because the application (i.e. Sqlplus) uses the encoding information obtained from Nls_lang by the Codepage437,oracle client Codepage437, The correct encoding is then converted.

NET Grab bag (b library): C3 A6

Explanation: C3 A6 is the correct UTF8 encoding for the character [Æ], this time because the application (i.e. Sqlplus) uses the Codepage437,oracle client to obtain the encoded information from Nls_lang also Codepage437, The correct encoding is then converted.

I think that only sqlplus days are always so beautiful, everything seems reasonable and can be explained. When other tools appear, the world becomes mess, and Oracle SQL developer completely ignoring client Nls_lang settings makes things easier, but PL/SQL developer is another matter, I spent 4 days trying to figure out the encoding conversion process, only to prove that it was an unreasonable thing, the only thing that seems right now is that if you want to use PL/SQL Developer, you have to set Nls_lang to match the database character set. Others can only be blessed with self-seeking.


The effect of 5.nls_lang on odp.net

View Code

The only thing affected by client Nls_lang is the OracleString getnonunicodebytes () method, which relies on the client-side setting of the character set, such as we put Nls_lang from American_america. we8mswin1252 changed into American_america. Us7ascii

where 230 (i.e. HexE6) is the encoding of the character ' Æ ', and 63 (i.e. hex3f) is the question mark in ASCII (because there is no ' æ ' in the ASCII character set, it is replaced with a question mark).


6.
Other questions about VARCHAR2, NVARCHAR2

NVARCHAR2 (n), where n refers to the number of characters, not the number of bytes. The maximum length, however, is in bytes, or 4000 bytes.

VARCHAR2 (n), where n may refer to the number of characters or the number of bytes. You can explicitly specify it at the time of declaration, such as VARCHAR2 (ten bytes) or VARCHAR2 (ten CHAR), which is determined by the parameter nls_length_semantics when it is not explicitly specified. It should be noted that you can successfully declare VARCHAR2 (4000 CHAR) does not guarantee that you can really store 4,000 characters, if more than 4000 bytes, the error Oracle will still error.

"References and references":

1. http://www.cnblogs.com/skynet/archive/2011/05/03/2035105.html

2. HTTP://WWW.LAOXIONG.NET/CATEGORY/ORACLE/ORAINTERNAL/PAGE/2

3. http://en.wikipedia.org/wiki/Windows-1252

4. Http://en.wikipedia.org/wiki/ASCII

5. http://en.wikipedia.org/wiki/Code_page_936

6. http://en.wikipedia.org/wiki/Code_page_437

7. Http://en.wikipedia.org/wiki/UTF-8

8. Http://en.wikipedia.org/wiki/UTF-16/UCS-2

9. http://en.wikipedia.org/wiki/UTF-32/UCS-4

Ten. http://en.wikipedia.org/wiki/GB_18030

Http://en.wikipedia.org/wiki/Unicode.

OReilly Oracle PL SQL programming 5th Edition,steven Feuerstein, Bill Pribyl

Http://www.laruence.com/2009/08/22/1059.html.

Http://en.wikipedia.org/wiki/Windows_code_page.

Oracle Character Set

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.