Understanding Oracle Character Set

Last Update:2018-12-06 Source: Internet

Author: User

Tags ultraedit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is the Oracle character set?

The Oracle character set is a collection of symbols for interpretation of byte data. It has different sizes and an inclusive relationship.

Oracle supports the national language architecture, allowing you to store, process, and retrieve data in a localized language. It makes database tools, error messages, sorting order, date, time, currency, numbers, and calendar automatically adapt to localization languages and platforms.

The most important parameter that affects the character set of oracle databases is the NLS_LANG parameter. The format is as follows:

NLS_LANG = language_territory.charset

It has three components (language, region, and Character Set), each of which controls the NLS subset features. Where:

Language specifies the Language of the server message, territory specifies the date and number format of the server, and charset specifies the character set. For example, AMERICAN _ AMERICA. ZHS16GBK.

From the composition of NLS_LANG, we can see that the real impact on the database character set is actually the third part. Therefore, if the character set between the two databases is the same as that in the third part, data can be imported and exported to each other. The preceding information is only prompted in Chinese or English.

Character Set of the database

Character Set is specified during database creation and cannot be changed after database creation. Therefore, it is important to select a correct character set during database creation.

When creating a database, we can specify the character set and national character set ).

Character Set for storage:

CHAR, VARCHAR2, CLOB, LONG, and other data types

Used to indicate table names, column names, and PL/SQL variables.

SQL and PL/SQL program units

The national character set is used for storage:

NCHAR, NVARCHAR2, NCLOB, and other data types

Once your character set is selected, the characters that can be stored in the database are limited, so the character set you choose should be able to accommodate all the characters you will use. Different character sets make different combinations of binary codes. For example, there is a string of binary information:, according to the 16-byte GBK character set, can represent the word "China. For a single-byte character set, the binary string represents four special characters with ASC code 214, 208, 185, and 250. This is the role of character sets and the form in which information is understood.

How to query Oracle character sets

Many people have encountered data import failures due to different character sets. This involves three character sets: one is the character set on the El server side, the other is the character set on the oracle client side, and the other is the dmp file character set. During data import, the three character sets must be consistent before the data can be correctly imported.

1. query character sets of Oracle Server

There are many ways to find the character set of the oracle server. The intuitive query method is as follows:

SQL>select userenv(‘language’) from dual;

The result is as follows: AMERICAN _ AMERICA. ZHS16GBK.

2. How to query the dmp file Character Set

The dmp file exported using Oracle's exp tool also contains character set information. The 2nd and 3rd bytes of the dmp file record the character set of the dmp file. If the dmp file is not large, for example, only a few MB or dozens of MB, you can use UltraEdit to open it (in hexadecimal mode) and view the content of 2nd 3rd bytes, such as 0354, then, use the following SQL statement to find the corresponding character set:

SQL> select nls_charset_name(to_number('0354','xxxx')) from dual;  
ZHS16GBK

If the dmp file is large, for example, 2 GB or above (this is also the most common case), you can use the following command (on a unix host) to open it slowly or completely ):

cat exp.dmp |od -x|head -1|awk '{print $2 $3}'|cut -c 3-6

Then, you can use the preceding SQL statement to obtain its character set.

3. query the character set of the Oracle client

This is relatively simple. In Windows, the NLS_LANG corresponding to OracleHome in the Registry can also be set in the Dos window, for example:

set nls_lang=AMERICAN_AMERICA.ZHS16GBK

In this way, only the environment variables in this window are affected. On Unix platforms, it is the environment variable NLS_LANG.

$echo $NLS_LANG 
AMERICAN_AMERICA.ZHS16GBK

The character set of the client must be the same as that of the server to correctly display non-Ascii characters of the database. If multiple settings exist, alter session> environment variable> registry> parameter file

The character set must be consistent, but the language settings can be different. We recommend that you use English for language settings. If the character set is zhs16gbk, The nls_lang can be American_America.zhs16gbk.

1.1. The data types to be stored in the database are the primary considerations for character set selection.
For databases that only store English information,Generally, it is more appropriate to use a single-byte character set, such as US7ASCII or WE8ISO8859P1.And is optimal in terms of performance and space,

Similarly, databases that store Chinese information are not suitable if the single-byte character set is used. In this case, although the character set of the database is US7ASCII or WE8ISO8859P1 encoding, the data encoding stored in the database is actually another encoding format. This inconsistency can easily cause problems, we recommend that you do not use this method. ORACLE provides a variety of character sets for customers to choose from, that is, to meet the different encoding needs of different types of text.

1.2. The application needs to be given priority to character set selection.
Currently, for international purposes, the software needs to process different languages and texts,Especially when a system needs to contain multiple languages, Unicode is generally used as a general solution, even if there is a loss of space and operational efficiency, it is worthwhile.We recommend that you use AL32UTF8 or UTF8 encoding for the database character set.

When the customer inputs data in the application, the encoding format of the data is determined by the region and language settings of the customer's operating system. For example, in the simplified Chinese XP environment, the entered Chinese encoding is GBK encoding. After the customer input is complete, the program first judges the customer's local environment, converts the encoding to UNICODE, and transmits it to the server through NET. Since the character sets of the client and server databases are in UTF8 format, ORACLE does not convert the characters during the transfer process and stores the data in the database in UTF8 format. The query is a reverse process. The application extracts UTF8 encoded data from the database, and then the application extracts the data according to the local environment of the customer, convert the UTF-8 encoded data to the local encoding format of the customer, and display the result data to the customer. The key to this solution is that the application must be able to support UNICODE encoding. The conversion of encoding is the responsibility of the application. The database only provides a data storage function.
For some programs, the character set conversion function provided by ORACLE can be used for the same purpose because UNICODE is not supported enough and the encoding conversion function is not provided. When the customer inputs data in the application, the encoding format of the data is determined by the region and language settings of the customer's operating system. For example, in the simplified Chinese XP environment, the entered Chinese encoding is GBK encoding. After the customer input is complete, the program directly transmits the data to the server through NET. Because the character set of the client and the server database is inconsistent, ORACLE converts the client encoding to the UTF8 format, and then stores the data in the database in the UTF8 format. The advantage of this solution is that the program does not support UNICODE and is automatically converted from the ORACLE database. Because the character set of the database is UTF8 and is a superset of other character sets, no data loss occurs during the conversion process. For English character symbols, single-byte storage is used in UTF8. The conversion workload is very small and can be ignored. For some Asian character sets, two to three bytes are generally required in UTF8, the required database space is increased, and the conversion workload is relatively large, resulting in performance loss.

Note:

For Windows Simplified Chinese platform, the default character set is ZHS16GBK

English platform: the default character set is WE8ISO8859P1 Western European language

Multi-language platforms supported: AL32UTF8

II. Character Set changes

After the database is created, if you need to modify the character set, you usually need to recreate the database and convert it by importing and exporting it.

We can also change the conversion character set in the following ways. The database should be in RESTRICTED mode.

Changing the character set after Oracle7 is an extremely dangerous way by modifying the props $ method. Avoid it as much as possible.

We also know that although it is safe and reliable to change the character set through alter database character set, there are strict subset and superset constraints. In fact, we seldom use this method. In fact, Oracle still has another way to change the character set.

SQL> shutdown immediate

SQL> startup mount

SQL> alter system enable restricted session;

SQL> alter system set job_queue_processes = 0;

SQL> alter system set aq_tm_processes = 0;

SQL> alter database open;

SQL> alter session set events '10046 trace name context forever, level 12 ';

SQL> alter database character set INTERNAL_USE ZHS16GBK

The internal process of the alter database character set operation is exactly the same, that is, the help provided by INTERNAL_USE is that the Oracle DATABASE bypasses the validation of the subset and superset.

This method is useful in some aspects, such as testing. Everyone should be very careful when applying it to the product environment. It is best to copy a database and use it, except for you, no one is responsible for the consequences:

For DBA, there is a very important principle: Do not put your database in danger!

This requires us to back up the database before performing any operations that may change the database structure. Many DBAs do not have backup operations and have learned a lot.

Iii. Export, Import, and convert

Import and export is a commonly used data migration and Conversion Tool. Because the exported files are platform-independent, they are most commonly used in cross-platform migration.

During the export operation, it is very important to set the character set of the client, that is, the client's NLS_LANG settings.

The NLS_LANG parameter consists of the following parts:

NLS_LANG = <Language >_< Territory>. <Clients Characterset>

Language: Specifies the language in which oracle messages are sent, including the date, day, And month.

Territory: Specifies the format of currency and number, region and the habit of calculating the week and date.

Characterset: controls the character set used by client applications. Usually set or equal to the client (such as Windows) code page, or for unicode applications set to UTF8, you can use the chcp command to view the current system code page on Windows

Tip: All components defined by NLS_LANG are optional. Use the default value for any unspecified project. If a territory or character set is specified, it must include the character set that separates [_ territory), periods (.)]. Otherwise, this value is used as the name of a language.

For example, to set only part of the NLS_LANG territory, use the following format:

To view the client's NLS_LANG settings, you can use the following methods:

View on Windows: In [HKEY_LOCAL_MACHINE \ SOFTWARE \ ORACLE \ KEY_OraDb 10g _ Home1] NLS_LANG key value

Set % NLS_LANG % = SIMPLIFIED CHINESE_CHINA.ZHS16GBK in the temporary session

In Linux, view echo $ NLS_LANG.

Set in temporary session: in the Command window, export NLS_LANG = SIMPLIFIED CHINESE_CHINA.ZHS16GBK

4. Modify the exported file Character Set

We know that the exported Character Set id is recorded in the exported file. by viewing the 2nd and 3 bytes of the exported file header, we can find the hexadecimal Character Set ID, on Windows, we can use UltraEdit and other tools to open the dmp file and view its exported Character Set: Modify the dmp file. If the dmp file is too large, such as greater 500 M , You can use anysql tool to change the character set-dmp2utf8 In the dmp file, link to this http://www.anysql.net/tools/new_tool_dmp2utf8.html

On Unix, run the following command:

Cat expdat. dmp | od-x | head

Oracle provides standard functions to convert Character Set names and IDs:

SQL> select nls_charset_id ('zhs16gbk') from dual;

NLS_CHARSET_ID ('zhs16gbk ')

--------------------------

852

SQL> select nls_charset_name (852) from dual;

NLS_CHAR

--------

ZHS16GBK

SQL> select to_char (nls_charset_id ('zhs16gbk'), 'xxx') from dual; -- hexadecimal conversion:

TO_CHAR ('8

----------

0354

Corresponding to the 2nd and 3 bytes in the figure above, we know that the exported file character set is ZHS16GBk.

V. Analysis of issues related to character sets
5.1In UTF8Run SQL in the environmentStatement error:
SQL * PLUSToolsAutomatic transcoding is not provided.When the database character set is UTF8 and the client's NLS_LANG is also UTF8, when the SQL statement is run in SQL * PLUS, all the statements are in English and no problem occurs, if a statement contains Chinese characters or other special characters, an error is reported during SQL statement execution. SQL * PLUS also displays garbled characters for returned results containing Chinese characters. The cause of this error is that when an SQL statement contains special characters such as Chinese characters, since these characters are encoded in GBK, ORACLE does not perform character conversion, instead, the SQL statement is directly sent to the server for resolution. At this time, the server's character set is UTF8, so it will produce errors when parsing GBK-encoded characters in SQL statements in UTF8 encoding format. If you set the client's NLS_LANG to the character set of the local environment, such as ZHS16GBK, you can directly enter an SQL statement containing Chinese Characters in SQL * PLUS, ORACLE automatically converts the SQL statement to the UTF8 encoding format when submitting it to the server. Therefore, the SQL statement can run normally. For English letters, because the encoded value in UTF8 uses ASCII encoded values, English letters can be directly used without conversion, this is why no error occurs when SQL statements or output results are all in English. The correct method is to make the SQL to be run into a script file and use the Code Conversion Tool to convert it to a file in UTF8 encoding format. (note! Notepad in XP provides the Code Conversion Function. You can save the file or select the file to be saved as the last one in the pop-up dialog box, encode the file, select UTF8, and save it, to convert the file to the UTF8 encoding format ). Open the script with IE and select "encoding"-"UTF8" to check whether the SQL script contains garbled characters or "?". Symbol. If not, it indicates that the encoding format is UTF8, and running this script in SQL * PLUS will not produce errors. After running, if the output result contains Chinese characters, you need to output the result SPOOL to a file, and then use the Code Conversion Tool to convert the result file from UTF8 to the local encoding format, open the dashboard to view the properly displayed Chinese characters. Because IE has the Code Conversion Function, you can directly open the output result file in IE without using the Code Conversion Tool, select UTF8 encoding, and display the result file containing Chinese characters.

. 2Database garbled:
Database garbled issues mainly related to the customer's local environment, client NLS_LANGSettings, the server-side database character set settings are related to the threeIf their settings are inconsistent or incorrect, garbled characters may easily occur. The following describes the following situations:
5.2.1And garbled characters caused by improper settings of database character sets:
For example, in a database that stores simplified Chinese characters, US7ASCII is used for its character set. When its client NLS_LANG uses US7ASCII, this system can be used independently, because the two are consistent, ORACLE does not convert character sets. The GBK code entered by the customer is stored directly in the database. When querying data, the actual data obtained from the client is also encoded by GBK, so the display is normal. However, when other systems need to retrieve data from this database, or their data needs to be exceeded and IMP is sent to other databases, the problem will begin. The character set of other systems is generally ZHS16GBK, or the NLS_LANG of other system clients is set to ZHS16GBK. At this time, Character Set conversion is inevitable. Although the database character set is set to US7ASCII, we know that the actual stored data encoding is ZHS16GBK. Unfortunately, ORACLE does not know that it will regard the stored ZHS16GBK encoding data as US7ASCII encoded data and convert it to the ZHS16GBK conversion algorithm according to US7ASCII. It can be imagined that in this case, garbled characters are inevitable.

4.2.2 , Database character set and client NLS_LANGSet different garbled characters:
For example, for a database that needs to store simplified text information, its character set settings and client NLS_LANG settings can generally use ZHS16GBK encoding. However, if UTF8 is used for the database character set, it is also possible because the ZHS16GBK encoding is a subset of UTF8. ORACLE automatically performs encoding conversion when data is exchanged between the database and the client, and UTF8 encoded data is actually stored in the database. In this case, other databases and the database can exchange data normally, because ORACLE will automatically convert the data. In actual use, the character set ZHT16MSWIN950 of traditional XP has been convertedAL32UTF8When a character set is used, some special characters and Some uncommon Chinese characters will become garbled characters. It was later confirmed that XP needed to install a font patch software and finally solved the problem smoothly.

4.2.3 Client NLS_LANGGarbled characters different from the local environment:
Generally, garbled characters may occur when the client NLS_LANG and the local environment adopt different character sets, unless the character set in the local environment is a subset of the character set in the client NLS_LANG. This is the case if you set the client NLS_LANG to UTF8. Because there is no operating system that can directly use the UNICODE Character Set, the character set used in the customer's Localization environment can only be a character set supported by a certain language, it is a subset of UTF8. Next we will focus on this situation.
Although the WINDOWS Kernel currently supports UNICODE, WINDOWS does not support displaying UNICODE characters directly, and does not know which character sets are used. Therefore, by default, it uses the default code page to interpret characters. Therefore, for other types of encoding, you must first convert them into the character set supported by the current default code page of the system.
The default code page in WINDOWS is determined by the selection of the language and region in the control panel settings, which belongs to the customer's local environment settings. Simplified Chinese WindowsCharacter encodingIt is GBK. Its default code page is 936. For other non-WINDOWS operating systems, we can set their default character sets as the user's Localization environment. In addition, most of the tools we use, such as wordboards and SQL * PLUS, do not provide the encoding conversion function. Therefore, garbled characters are often encountered when directly inputting or querying data on the client, it must be converted by applications or some tools to ensure normal use.

5. Experiment operations:Only for understanding character setsCharacterset , About garbled characters.

Environment:

Client Windows Simplified Chinese system, Oracle 10G The client character set is:

The server and client are the same PC, and the database character set is AL32UTF8. The National Character Set is UTF8.

Note: For more information about the character set of the client and server, see the preceding document.

References:

Http://www.oracle.com/technology/tech/globalization/htdocs/nls_lang%20faq.htm#_Toc110410543 nls_lang faq

Http://www.eygle.com/special/ lookup Character Set Section

Http://www.itpub.net/viewthread.php? Tid = 838447 & highlight = % D7 % D6 % B7 % FB % BC % AF understand oracle Character Set

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More