Welcome to the Oracle community forum and interact with 2 million technical staff. 1. character Set and encoding common sense Character Set: People collect certain characters in one place as needed and assign them with names, so they have a character set. For example: ASCII character set: Contains uppercase and lowercase English letters, Arabic numerals, punctuation marks, and some invisible control operators in total 1
Welcome to the Oracle community forum and interact with 2 million technical staff> enter 1. character Set and encoding common sense Character Set: People collect certain characters in one place as needed and assign them with names, so they have a character set. For example: ASCII character set: Contains uppercase and lowercase English letters, Arabic numerals, punctuation marks, and some invisible control operators in total 1
Welcome to the Oracle community forum and interact with 2 million technical staff> enter
1. Common sense of Character Set and encoding
Character Set:
Some characters are collected to one place as needed and assigned with names, so a character set is available.
For example:
ASCII:
ASCII character set: Contains uppercase/lowercase English letters, Arabic numerals, punctuation marks, and some invisible control characters.
ASCII code: 7 characters are used to represent a single character. The encoding range is [0-127] (Hex [00-7F]), where [0-31] (Hex [00-1F]) and 127 (Hex7F) are controllers, the rest are visible characters.
GB2312:
GB2312 Character Set: ASCII character set + about 7000 Chinese characters.
GB2312 encoding: compatible with ASCII encoding. Determine the bytes. For example, if the value is <= 127, it indicates ASCII encoding. For example, if the value is> 127, it must be combined with another byte to represent one character. The theoretical Chinese character encoding space is 128X256, with more than 30 thousand characters.
GBK:
GBK character set: GB2312 Character Set + about 20000 Chinese characters.
GBK encoding: compatible with GB2312 encoding. The idle encoding space of GB2312 encoding is used.
GB18030:
GB18030 Character Set: GBK character set + several Chinese characters + several ethnic minority characters. It is the latest character set in China.
GB18030 encoding: compatible with GBK encoding. Continue to use the idle encoding space of GBK encoding, and use four bytes for excess encoding space.
BIG5:
BIG5 Character Set: ASCII character set + about 13000 Chinese characters (Traditional Chinese ).
BIG encoding: compatible with ASCII encoding. The encoding mode is similar to GB2312.
UNICODE: The word UNICODE is broad and confusing in daily use. It can be one of the following meanings in different contexts .)
UNICODE Standard: a set of standards proposed by some organizations, and a series of provisions on the display and encoding of human text.
UNICODE Character Set: currently, the latest UNICODE Character Set contains more than 0.1 million characters in various languages.
UNICODE encoding: (narrow UNICODE encoding may refer to UCS-2 or UTF-16; Generalized UNICODE encoding can refer to several types of UNICODE standard encoding, including the following four .)
1. UTF-32 encoding: fixed to use 4 bytes to represent a character, there is a space utilization efficiency problem.
2. UTF-16 encoding: two bytes are used for the more than 60000 characters that are commonly used, and four bytes are used for the rest (I .e. 'supplemental character mentlementary characters.
3. UCS-2 encoding: is the implementation of UNICODE earlier versions, the only difference between it and the UTF-16 is that it does not include 'supplemental characters', so it only uses two bytes for character encoding. Currently, this encoding mode is out of date.
4. UTF-8 encoding: compatible with ASCII encoding; uses two bytes for Latin and Greek; other common characters including Chinese characters use three bytes; the remaining few characters use four bytes.
ISO8859-1 :( Oracle comrades may have seen this WE8ISO89859P1, that's it .)
ISO8859-1 Character Set: ASCII character set + several Western European characters, such as letters? ,?.
ISO8859-1 encoding: represents a single character with 8 bits and removes the control letter ([0-31], and 127) from the original ASCII code ).
Code page: (you can regard "code page" as a synonym for "encoding. Why is this name available? Historical issues .)
ANSI code pages: You must have met ANSI. When you think about saving text files. ANSI code pages is actually a series of encoding sets. One of them is activated based on the operating system region settings as the default ANSI encoding. For example, the ANSI code page on the company's computer (English System) may be 1252, while the Chinese system at home may be 936. therefore, you can use ANSI to store a text file containing Chinese characters at home. You can view the currently used ANSI code page. C # in the Registry key HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Control \ NLS \ CodePage \ ACP. You can view it through Encoding. Default.
OEM code pages: OEM code pages are used by console applications (such as SQLPLUS. In addition to the CJK environment (Chinese-Japan-Korean), Windows uses different ANSI code pages and OEM code pages. for example, the company uses 437 on the English system. you can use the CHCP command to view the currently used OEM code page. C # You can use the Console. view OutputEncoding.
Code page 1252:
Cp1252 Character Set: ASCII character set + several Western European characters + several special characters, such? ,‰.
Cp1252 encoding: 8-bit represents a character. The encoding range is [0-255] (Hex [00-FF]). The [0-127] part is the same as ASCII, and most of the new content is Western European characters. For example, some letters with the logo? ,?, And a special symbol like this)
PS1: code page information on two PCs in reality
PC1: Windows XP, ANSI code page = 1252, OEMcode page = 437
PC2: Chinese Version Windows 7, ANSI codepage = 936, OEM code page = 936
PS2: Download The cp1252 and cp437 encoding tables. Click here. Early console applications often need to draw rough tables and other graphics, therefore, we can see many special symbols such as horizontal and vertical lines in 437.
PS3: CP1252, ISO8859-1, ASCII comparison, the actual use of the encoding range: CP1252> ISO8859-1> ASCII. ASCII is [0-127], CP1252 is [0-255], and the ISO8859-1 removes the [0-31] and 127 invisible controllers in cp1252, the special symbols in [128-159] (Hex [80-9F]) are removed in the same step.
2. Character Set 2.1. Encoding schema
§ Single-byte character set:
7bit: ASCII 7-bit American (US7ASCII)
8bit: ISO8859-1 West European (WE8ISO8859P1), EBCDICCode Page 500 8-bit West European (WE8EBCDIC500), DEC 8-bit effecuopean (WE8DEC)
§ Variable-length multi-Byte Character Set: for example, Japanese Extended UNIX Code (JEUC), ChineseGB2312-80 (GB2312-80), AL32UTF8 (UTF-8)
§ Fixed length multi-Byte Character Set: only the National Character Set (AL16UTF16) is a fixed length multi-Byte Character Set and is Unicode encoded in 2 bytes.
§ Unicode (AL32UTF8, AL16UTF16, UTF8): Oracle uses AL32UTF8, UTF8, and UTFE as database character sets, and AL16UTF16 and UTF8 as national language character sets.
2.2. Comparison between the database character set and the National Character Set
Why do we have two character sets? If I know that only English is needed, set the database character set to US7ASCII. If I know that only Western European characters are needed, set the database character set to WE8MSWIN1252 or WE8ISO89859P1, or simply use AL32UTF8. you see, I only need to set "database character set". What is the need for "National Character Set?
In fact, many existing databases cannot support UNICODE character sets because of historical issues and the "short-sighted" that cannot be avoided by database creators. For example, you need to store Chinese characters in the database of the US7ASCII database character set, at this time, the combination of "National Character Set" and NVARCHAR2 can save your life. For fields whose data type is NVARCHAR2 (and NCHAR, NCLOB), it uses the National Character Set and is irrelevant to the settings of the database character set. Since 9i, the national character set is optional only AL16UTF16 and AL32UTF8, UTF-16 and UTF-8 are UNICODE coding standard implementation, because some can represent almost all the text in the world.
Of course, if the database character set itself enables the UNICODE Character Set, there is no need to use the NVARCHAR2, NCHAR, NCLOB types.
Comparison between the database character set and the national character set:
Database Character Set
National Character Set
Defined at creation
Defined at creation
Cannot be changed without reconstruction
Except for some exceptions, it cannot be changed without reconstruction.
Use char, varchar2, clob, and long to store data Columns
Store Data columns with NCHAR, NVARCHAR2, and NCLOB
Variable-length Character Set Storage
You can use AL16UTF16 or UTF8 to store Unicode
Oracle has certain rules to name character sets. For example:
AL32UTF8
[AL] supports All languages ).
[32] each character occupies a maximum of 32 characters (4 bytes ).
UTF8 is encoded as a UTF-8.
WE8MSWIN1252
[WE] supports Western Europe ).
[8] each character occupies 8 bits (single byte ).
MSWIN1252 is encoded as CP1252.
US7ASCII
[US] indicates United States ).
[1] [2] [3]