Oracle global support: Character Set system introduction and server-side and client-side settings

Source: Internet
Author: User
Tags arabic numbers

1. Common sense of Character Set and encoding

Character Set:

Some characters are collected to one place as needed and assigned with names, so a character set is available.

For example:

ASCII:

ASCII character set: Contains 128 uppercase and lowercase English letters, Arabic numbers, punctuation marks, and some invisible controllers.

ASCII code: Represents a character with 7 characters. The encoding range is [0-127] (Hex [00-7F]), where [0-31] (Hex [00-1F]) and 127 (Hex7F) are controllers, the rest are visible characters.

GB2312:

GB2312 Character Set: ASCII character set + about 7000 Chinese characters.

GB2312 Encoding: Compatible with ASCII encoding. Determine the bytes. For example, if the value is <= 127, it indicates ASCII encoding. For example, if the value is> 127, it must be combined with another byte to represent one character. The theoretical Chinese character encoding space is 128X256, with more than 30 thousand characters.

GBK:

GBK character set: GB2312 Character Set + about 20000 Chinese characters.

GBK Encoding: Compatible with GB2312 encoding. The idle encoding space of GB2312 encoding is used.

GB18030:

GB18030 Character Set: GBK character set + several Chinese characters + several ethnic minority characters, which is currently the latest Chinese character set.

GB18030 Encoding: Compatible with GBK encoding. Continue to use the idle encoding space of GBK encoding, and use four bytes for excess encoding space.

BIG5:

BIG5 Character Set: ASCII character set + about 13000 Chinese characters (Traditional Chinese ).

BIG Encoding: Compatible with ASCII encoding. The encoding mode is similar to GB2312.

UNICODE(UNICODE is broad and confusing in daily use. It can be one of the following meanings in different contexts .)

UNICODE Standard: A set of standards put forward by some organizations to regulate the display and encoding of human texts.

UNICODE Character Set: Currently, the latest UNICODE Character Set contains more than 0.1 million characters in various languages.

UNICODE encoding(Narrow UNICODE encoding)PossibleUCS-2,OrUTF-16; Generalized UNICODE encoding can refer to the following four types of UNICODE standard encoding .)

1.UTF-32 Coding: 4 bytes are fixed to indicate a character, which causes space utilization efficiency.

2.UTF-16 Coding: Two bytes are used to encode more than 60000 commonly used characters, and four bytes are used for the rest (that is, 'supplemental character mentlementary characters.

3.UCS-2 Coding: Is the implementation of UNICODE earlier versions, it is the only difference with the UTF-16 is that it does not include 'supplemental characters', so it only uses two bytes for character encoding.Currently, this encoding mode is out of date..

4.UTF-8 Coding: Compatible with ASCII encoding. Two bytes are used for Latin and Greek characters. Other common characters including Chinese characters use three bytes. The remaining few characters use four bytes.

ISO8859-1(Oracle comrades may have seen this WE8ISO89859P1. That's right, that's it .)

ISO8859-1 Character Set: ASCII character set + several Western European characters, such as letters? ,?.

ISO8859-1 Coding: Represents a single character with 8 bits, and removes the control letter ([0-31], and 127) from the original ASCII code ).

Code page(You can regard "code page" as a synonym for "encoding. Why is this name available? Historical issues .)

ANSI code pages: You must have met ANSI. When you think about saving another text file. ANSI code pages is actually a series of encoding sets. One of them is activated based on the operating system region settings as the default ANSI encoding. For example, the ANSI code page on the company's computer (English System) may be 1252, while the Chinese system at home may be 936. Therefore, you can use ANSI to store a text file containing Chinese characters at home. You can view the currently used ANSI code page in the Registry key HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Control \ NLS \ CodePage \ ACP. C # You can view it through Encoding. Default.

OEM code pages: OEM code pages is used by console applications (such as SQLPLUS. In addition to the CJK environment (Chinese-Japanese-Korean), Windows uses different ANSI code pages and OEM code pages. For example, the company uses 437 on the English system. You can use the CHCP command to view the currently used OEM code page. C # can be viewed through Console. OutputEncoding.

Code page 1252:

Cp1252 Character Set: ASCII character set + several Western European characters + several special characters, such? ,‰.

Cp1252 Encoding: Represents a single character with 8 characters. The encoding range is [0-255] (Hex [00-FF]). The [0-127] part is the same as ASCII, and most of the new content is Western European characters. For example, some letters with the logo? ,?, And a special symbol like this)

PS1: Code page information on two PCs in reality

PC1: Windows XP, ANSI code page = 1252, OEMcode page = 437

PC2: Chinese Version Windows 7, ANSI codepage = 936, OEM code page = 936

PS2: Download The cp1252 and cp437 encoding tables. Click here. Early console applications often need to draw rough tables and other graphics, therefore, we can see many special symbols such as horizontal and vertical lines in 437.

PS3: CP1252, ISO8859-1, ASCII comparison, the actual use of the encoding range: CP1252> ISO8859-1> ASCII. ASCII is [0-127], CP1252 is [0-255], and the ISO8859-1 removes the [0-31] and 127 invisible controllers in cp1252, the special symbols in [128-159] (Hex [80-9F]) are removed in the same step.

2. Character Set 2.1. Encoding schema

§ Single-byte character set:

7bit: ASCII 7-bit American (US7ASCII)

8bit: ISO8859-1 West European (WE8ISO8859P1), EBCDICCode Page 500 8-bit West European (WE8EBCDIC500), DEC 8-bit effecuopean (WE8DEC)

§ Variable-length multi-Byte Character Set: for example, Japanese Extended UNIX Code (JEUC), ChineseGB2312-80 (GB2312-80), AL32UTF8 (UTF-8)

§ Fixed length multi-Byte Character Set: only the National Character Set (AL16UTF16) is a fixed length multi-Byte Character Set and is Unicode encoded in 2 bytes.

§ Unicode (AL32UTF8, AL16UTF16, UTF8): Oracle uses AL32UTF8, UTF8, and UTFE as database character sets, and AL16UTF16 and UTF8 as national language character sets.

2.2. Comparison between the database character set and the National Character Set

Why do we have two character sets? If I know that only English is needed, set the database character set to US7ASCII. If I know that only Western European characters are needed, set the database character set to WE8MSWIN1252 or WE8ISO89859P1, or simply use AL32UTF8. You see, I only need to set "database character set". What is the need for "National Character Set?

In fact, many existing databases cannot support the UNICODE character set because of historical issues and the "short-sighted" that cannot be avoided by database creators. For example, you need to store Chinese characters in the database of the US7ASCII database character set, at this time, the combination of "National Character Set" and NVARCHAR2 can save your life. For fields whose data type is NVARCHAR2 (and NCHAR, NCLOB), it uses the National Character Set and is irrelevant to the settings of the database character set. Since 9i, the national character set is optional only AL16UTF16 and AL32UTF8, UTF-16 and UTF-8 are UNICODE coding standard implementation, because some can represent almost all the text in the world.

Of course, if the database character set itself enables the UNICODE Character Set, there is no need to use the NVARCHAR2, NCHAR, NCLOB types.

Comparison between the database character set and the national character set:

Database Character Set

National Character Set

Defined at creation

Defined at creation

Cannot be changed without reconstruction

Except for some exceptions, it cannot be changed without reconstruction.

Use char, varchar2, clob, and long to store data Columns

Store Data columns with NCHAR, NVARCHAR2, and NCLOB

Variable-length Character Set Storage

You can use AL16UTF16 or UTF8 to store Unicode

Oracle has certain rules to name character sets. For example:

AL32UTF8

[AL] supports All languages ).

[32] each character occupies a maximum of 32 characters (4 bytes ).

[UTF8] encoding for UTF-8.

WE8MSWIN1252

[WE] supports Western Europe ).

[8] each character occupies 8 bits (single byte ).

MSWIN1252 is encoded as CP1252.

US7ASCII

[US] indicates United States ).

[7] each character occupies 7 characters.

[ASCII] is encoded as ASCII.

Other types such as ZHS16GBK, ZHT16BIG5, and US8PC437 (encoded as OEM cp437) can be applied.

2.3. Select the database Character Set

Because the database character set is used to identify and maintain the source code of SQL and PLSQL, it must be a superset of EBCDIC or 7-bit ASCCII, which makes the platform localized, therefore, it is impossible to use a fixed length multi-byte character set as the database character set (that is, AL16UTF16 cannot be used as the database character set)

ASCII-based character sets cannot be used on the EBCDIC platform, and vice versa

2.4. Select a country Character Set

There are two options: AL16UTF16 and UTF8

If the country character set is not specified during database creation, the default value is AL16UTF16. When NVARCHAR2 and other data types are used, the text content is stored and managed using the country Character Set instead of the default database character set.

It depends on space and efficiency. AL16UTF16 is fixed length (2 bytes), while UTF8 is variable length, which can be between 1 and 3 bytes, in UTF8, European characters are 1 to 2 bytes. Therefore, space is saved relative to AL16UTF16, while the Asian character set is stored in three bytes. Therefore, it requires more space than AL16UTF16.

AL16UTF16 is a fixed-length encoding, so it is faster than UTF8.

3. Specifying Language-Dependent Behavior

There are threeways to specify National Language Support (NLS) parameters:

§ The initialization parameter of the server specifies the default server environment, which is invalid for the client.

§ Asenvironment variables for the client to specify locale-dependent behavioroverriding the defaults set for the server.

§ As theALTER SESSIONparameter to override the default set for the client or the server

Alter session is the highest, environment variables are the second, and initialization parameters are the lowest;

3.1. Set the server

NLS_LANGUAGE initialization parameters: Specify the message language, date and month names, their abbreviations, A. D., B. C. and other representations, as well as the default sorting mechanism.

NLS_TERRITORY initialization parameter: Specifies the numbering of the week and date, default date format, 10 hexadecimal character, groupsepartor and default ISO, local currency identifier, and week start time.

The NLS_LANGUAGE initializationparameter determines the default values of the following parameters: NLS_DATE_LANGUAGE, NLS_SORT.

NLS_TERRITORYdetermines the default values for the following parameters: NLS_CURRENCY, NLS_ISO_CURRENCY, NLS_DATE_FORMAT, NLS_NUMBERIC_CHARCTERS.

Parameters

Value

NLS_LANGUAGE

NLS_DATE_LANGUAGE

NLS_SORT

AMERICAN

AMERICAN

BINARY

NLS_TERRITORY

NLS_CURRENCY

NLS_ISO_CURRENCY

NLS_DATE_FORMAT

NLS_NUMBERIC_CHARCTERS

AMERICAN

$

AMERICAN

DD-MON-RR

,.

 

 

 

 

 

 

 

 

 

 

3.2. Specifying Language-DependentBehavior for the Session

? Environment variable:

-NLS_LANG = French_France.UTF8

? Additional environmentvariables:

-NLS_DATE_FORMAT

-NLS_DATE_LANGUAGE

-NLS_SORT

-NLS_NUMERIC_CHARACTERS

-NLS_CURRENCY

-NLS_ISO_CURRENCY

-NLS_CALENDAR

Each component controls a subset of NLSfeatures:

NLS_LANG = <language >_< territory>. <charset>

Where:

Language: Overrides the value ofNLS_LANGUAGEand controls the same features

NLS_LANGUAGE

Territory: Overrides the value ofNLS_TERRITORYand controls the same features

NLS_TERRITORY

Charset: Specifies the character encodingscheme used by client application (usually that of the user's terminal)

NLS_LANG defines a client terminal's characterencoding scheme. Different clients can use

Different encoding schemes. Data passedbetween the client and the server is converted

Automatically between the two encodingschemes. The database encoding scheme shocould be a superset, or equivalent, ofall the client encoding schemes. The conversion is transparent to the clientapplication.

The following section describes that if the database character set is in English, for example, WE8MSWIN1252, it may cause a problem to store Chinese characters:

When the databasecharacter set and the client character set are the same, Oracle assumes that thedata being sent or stored Ed is of the same character set, so no validations orconversions are saved med. although the benefit of this scenariois betterperformance, misuse can lead to possible data inconsistency problems, such asstoring data from another character set that is different from the databasecharacter set.

For example, yourdatabase character set is US7ASCIIand you are using Simplified Chinese Windowsas your client terminal. By setting NLS_LANGto SIMPLIFIED

CHINESE_HONGKONG.US7ASCIIasthe client character set, it is possible for you to store

Multibyte SimplifiedChinese characters inside a single byte database. this means that Oracle willtreat these characters as single-byte US7ASCIIcharacters, and therefore, allSQL string manipulation functions such as SUBSTRor LENGTHwill be based on bytesrather than characters. all of your non-ASCII characters cocould be lostfollowing an export and import into another database.

3.3. Set PL/SQL Developer on common Windows clients

1. elements that affect PL/SQL DEVELOPER character set display

Current OS Character Set

Use chcp to view the coding page. 936 indicates that the current character set of the operating system is simplified Chinese, which is the same as ZHS16GBK ..

ORACLE Registry

NLS_LANG in the ORACLE directory, such as AMERICAN_AMERICA.ZHS16GBK

By default, this key value reads the current character set of the operating system and can be modified.

System Environment Variable NLS_LANG

Environment variables affect DOS windows, SQL * plus, PL/SQL DEVELOPER, and other applications.

ORACLE Database character settings

Select * from NLS_DATABASE_PARAMETERS n where n. PARAMETER = 'nls _ CHARACTERSET ';

PARAMETER VALUE

NLS_CHARACTERSET ZHS16GBK

'Unicode support 'of the first option in TOOLS in the PL/SQL DEVELOPER menu'

If this option is selected, Unicode data is retrieved from the Oracle Server as is and displayed as Unicode.

Text. If this option is disabled, Unicode data from the server is

Converts the NLS_LANG key to the character set of the Oracle Client.

When writing Chinese CharactersYou must specify the environment variable NLS_LANG.(Note that it is an environment variable, not a registry). If it is specified, the data can be written correctly in Chinese. No other requirements are required.

4. Obtain the Language Character Set Information

Database ServerNLS_DATABASE_PARAMETERS (Props $):

§ Parameter (NLS_CHARACTERSET, NLS_NCHAR_CHARCTERSET)

§ VALUE

In the query results, NLS_CHARACTERSET indicates the character set, and NLS_NCHAR_CHARACTERSET indicates the national character set.

Client Character SetNLS_INSTANCE_PARAMETERS (from V $ NLS_PARAMETERS ):

§ Parameters (explicitly set initialization parameters)

§ Value

Session Character Set EnvironmentNLS_SESSION_PARAMETERS:

§ Parameter (session parameters)

§ VALUE

V $ NLS_VALID_VALUES(Display NLS data boot file, including definitions that may not be supported or used externally)

§ Parameter (language, sort, territory, characterset)

§ VALUE

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.