This article mainly discusses the following parts:
1. How to view and query the Oracle character set,
2. Modify character sets and common Oracle utf8 character sets
3. Oracle exp Character Set Problems
Body:
1. character set parameters
The most important parameter that affects the character set of Oracle databases is the nls_lang parameter.
The format is as follows: nls_lang = language_territory.charset
It has three components (language, region, and Character Set), each of which controls the NLS subset features.
Where:
Language: Specifies the language of the server message, which affects whether the prompt information is in Chinese or English.
Territory: Specifies the date and number format of the server,
Charset: Specifies the character set.
For example: American _ America. zhs16gbk
From the composition of nls_lang, we can see that the real impact on the database character set is actually the third part.
Therefore, if the character set between the two databases is the same as that in the third part, data can be imported and exported to each other. The preceding information is only prompted in Chinese or English.
How to view the database version
Select * from V $ version contains version information, core version information, and number of digits (32-bit or 64-bit). You can view the number of digits on the Linux/Unix platform through file, for example, file $ ORACLE_HOME/bin/Oracle
Ii. view database character sets
Database Server Character Set select * From nls_database_parameters, which is derived from props $ and represents the character set of the database.
Client Character Set environment select * From nls_instance_parameters, which is from V $ parameter,
Indicates the character set setting of the client, which may be a parameter file, environment variable, or registry.
Select * From nls_session_parameters in the session Character Set environment, which is derived from V $ nls_parameters, indicating the session's own settings, which may be the session environment variable or the session is completed by alter session. If the session has no special settings, it will be consistent with nls_instance_parameters.
The character set of the client must be the same as that of the server to correctly display non-ASCII characters of the database. If multiple settings exist, alter session> environment variable> registry> parameter file
The character set must be consistent, but the language settings can be different. We recommend that you use English for language settings. If the character set is zhs16gbk, The nls_lang can be american_america.zhs16gbk.
Involves three character sets,
1. Character Set on the El server side;
2. Character Set of Oracle client;
3. dmp file character set.
During data import, the three character sets must be consistent before the data can be correctly imported.
2.1 query character sets on Oracle Server
There are many ways to find the character set of the Oracle server. The intuitive query method is as follows:
SQL> select userenv ('language') from dual;
Userenv ('language ')
----------------------------------------------------
Simplified chinese_china.zhs16gbk
SQL> select userenv ('language') from dual;
American _ America. zhs16gbk
2.2 how to query the DMP file Character Set
The DMP file exported using Oracle's exp tool also contains character set information. The 2nd and 3rd bytes of the DMP file record the character set of the DMP file. If the DMP file is not large, for example, only a few MB or dozens of MB, you can use ultraedit to open it (in hexadecimal mode) and view the content of 2nd 3rd bytes, such as 0354, then, use the following SQL statement to find the corresponding character set:
SQL> select nls_charset_name (to_number ('20140901', 'xxxxx') from dual;
Zhs16gbk
If the DMP file is large, for example, 2 GB or above (this is also the most common case), you can use the following command (on a unix host) to open it slowly or completely ):
Cat exp. dmp | OD-x | head-1 | awk '{print $2 $3}' | cut-C 3-6
Then, you can use the preceding SQL statement to obtain its character set.
2.3 query character sets of Oracle client
On Windows, it is the nls_lang of oraclehome in the registry. You can also set it in the DOS window,
For example: Set nls_lang = american_america.zhs16gbk
In this way, only the environment variables in this window are affected.
On UNIX platforms, the environment variable nls_lang is used.
$ Echo $ nls_lang
American_america.zhs16gbk
If the check result shows that the character sets on the server and client are inconsistent, change them to the same character set on the server.
Supplement:
(1). Database Server Character Set
Select * From nls_database_parameters
The source is props $, which indicates the character set of the database.
(2). Client Character Set Environment
Select * From nls_instance_parameters
It is derived from V $ parameter, which indicates the character set setting of the client, which may be a parameter file, environment variable, or registry.
(3). Session Character Set Environment
Select * From nls_session_parameters
The source is V $ nls_parameters, which indicates the session's own settings. It may be the session's environment variable or the alter session is completed. If the session has no special settings, it will be consistent with nls_instance_parameters.
(4). Only when the character set of the client must be the same as that of the server can the non-ASCII characters of the database be correctly displayed.
If multiple settings exist, NLS takes precedence over SQL function> alter session> environment variables or registry> parameter files> default database Parameters
The character set must be consistent, but the language settings can be different. We recommend that you use English for language settings. If the character set is zhs16gbk, The nls_lang can be american_america.zhs16gbk.
3. Modify the character set of Oracle
For 8i or later versions, you can use alter database to modify the character set, but only the subset to the superset. We do not recommend that you modify the props $ table, which may cause serious errors.
Startup nomount;
Alter database Mount exclusive;
Alter system enable restricted session;
Alter system set job_queue_process = 0;
Alter database open;
Alter database character set zhs16gbk;
As mentioned above, the database character set cannot be changed in principle after it is created. Therefore, it is important to consider which character set to use at the beginning of design and installation. For database servers, incorrect Character Set modification may lead to many unpredictable consequences, which may seriously affect the normal operation of the database, therefore, before modification, check whether the two character sets have the relationship between Subsets and supersets. Generally, we do not recommend that you modify the character set of the Oracle database server unless you have. In particular, the two most commonly used character sets zhs16gbk and zhs16cgb231280 do not have a subset or superset relationship. Therefore, in theory, mutual conversion between these two character sets is not supported.
However, there are two ways to modify the character set.
1. You usually need to export the database data, recreate the database, and then import the database data for conversion.
2. you can use the alter database character set statement to modify the character set. However, there are limits on modifying the character set after the database is created. Only when the new character set is the current character set, the character set of the database can be modified, for example, utf8 is a superset of us7ascii. You can use alter database character set utf8 to modify the character set of a database.
3.1 modify the server character set (not recommended)
1. Shut down the database
SQL> shutdown immediate
2. Start to mount
SQL> startup Mount;
SQL> alter system enable restricted session;
SQL> alter system set job_queue_processes = 0;
SQL> alter system set aq_tm_processes = 0;
SQL> alter database open;
-- From parent set to subset
SQL> alter database character set zhs16gbk;
SQL> alter database National Character Set al16utf16;
-- If it is from the subset to the parent set, use the internal_use parameter to skip the supersubset detection.
SQL> alter database character set internal_use al32utf8;
SQL> alter database National Character Set internal_use al16utf16;
SQL> shutdown immediate;
SQL> startup
NOTE: If there are no large objects, there is no impact on language conversion during use (remember that the character set must be supported by Oracle, or cannot be started.
If a message such as 'ora-12717: cannot alter database national character set when nclob data exists' appears,
There are two ways to solve this problem
1. Use the internal_use keyword to modify the region settings,
2. Use re-create, but re-create is a little complicated, so use internal_use
SQL> shutdown immediate;
SQL> startup Mount exclusive;
SQL> alter system enable restricted session;
SQL> alter system set job_queue_processes = 0;
SQL> alter system set aq_tm_processes = 0;
SQL> alter database open;
SQL> alter database National Character Set internal_use utf8;
SQL> shutdown immediate;
SQL> startup;
If you follow the above steps, the National charset region settings will be fine.
3.2 modify the DMP file Character Set
As mentioned above, the 2nd 3rd bytes of the DMP file records the character set information. Therefore, you can directly modify the 2nd 3rd bytes of the DMP file to 'Cheat 'the Oracle check. In theory, this can be modified only from the subset to the superset, but in many cases it can be modified without the subset and superset relationships. Some of our commonly used character sets, such as us7ascii, we8iso8859p1, zhs16cgb231280, and zhs16gbk can be modified. Because only the DMP file is changed, it has little impact.
There are many specific modification methods. The simplest is to directly use ultraedit to modify the 2nd and 3rd bytes of the DMP file.
For example, if you want to change the DMP character set to zhs16gbk, you can use the following SQL statement to find the hexadecimal code corresponding to this character set: SQL> select to_char (nls_charset_id ('zhs16gbk'), 'xx ') from dual;
0354
Modify the 2 and 3 bytes of the DMP file to 0354.
If the DMP file is large and cannot be opened with ue, you need to use the program method.
3.3 client Character Set setting method
1) Unix environment
$ Nls_lang = "Simplified Chinese" _ China. zhs16gbk
$ Export nls_lang
Edit the profile file of an Oracle user
2) Windows
Edit Registry
Regedit.exe --- HKEY_LOCAL_MACHINE --- software --- Oracle-home
Or set in the window:
Set nls_lang = american_america.zhs16gbk
4. Knowledge about character sets:
4.1 Character Set
In essence, according to a certain character encoding scheme, assign a specific set of symbols to different numerical encoding sets. The earliest supported encoding scheme of Oracle Database is us7ascii.
The character set naming rules of Oracle follow the following naming rules:
<Language> <bit size> <encoding>
Namely: <language> <bit number> <encoding>
For example, zhs16gbk indicates that the GBK encoding format and 16-bit (two-byte) simplified Chinese character set are used.
4.2 character encoding scheme
4.2.1 single-byte encoding
(1) single-byte 7-bit character set, which can be 128 characters. The most common character set is us7ascii.
(2) single-byte 8-bit character set, which can be defined as 256 characters, suitable for most European countries
Example: we8iso8859p1 (Western Europe, 8-bit, ISO standard 8859p1 encoding)
4.2.2 multi-byte encoding
(1) variable-length multi-byte encoding
Some characters are represented by one byte. Other characters are represented by two or more characters. Long-length multi-byte encoding is commonly used for Asian languages, such as Japanese, Chinese, and Hindi.
For example, al32utf8 (where Al stands for all, which applies to all languages), zhs16cgb231280
(2) fixed length multi-byte encoding
Each character uses a fixed-length multi-byte encoding scheme. Currently, the only fixed-length multi-byte encoding supported by Oracle is af16utf16, which is only used for national character sets.
4.2.3 unicode encoding
Unicode is a single encoding scheme that covers all the known characters currently used around the world, that is, Unicode provides a unique encoding for each character. UTF-16 is a unicode 16-bit encoding method, a fixed length multi-byte encoding, with 2 bytes representing a Unicode character, af16utf16 is the UTF-16 encoding character set.
UTF-8 is Unicode 8-bit encoding, is a variable-length multi-byte encoding, this encoding can use 1, 2, 3 bytes to represent a Unicode character, al32utf8, utf8 and utfe are UTF-8 encoded character sets
4.3 character set super
When the encoding value of a character set (character set a) contains the encoding value of all other character sets (Character Set B), and the same encoding value of the two character sets represents the same character, character Set a is the Super character of Character Set B, or Character Set B is the subset of Character Set.
In the official documents of Oracle8i and Oracle9i, the subset-superset pairs table is provided. For example, we8iso8859p1 is a subset of we8mswin1252. Because us7ascii is the earliest Oracle Database encoding format, many character sets are supersets of us7ascii. For example, we8iso8859p1, zhs16cgb231280, and zhs16gbk are us7ascii supersets.
4.4 database character set (Oracle Server Character Set)
The database character set is specified during database creation and cannot be changed after database creation. When creating a database, you can specify the character set and national character set ).
4.4.1 Character Set
(1) used to store char, varchar2, clob, long, and other data types
(2) used to mark table names, column names, PL/SQL variables, etc.
(3) used to store SQL and PL/SQL program units
4.4.2 National Character Set:
(1) used to store nchar, nvarchar2, nclob, and other data types
(2) The National Character Set is essentially an additional character set selected for Oracle. It is mainly used to enhance the character processing capability of oracle, the nchar data type can support the use of fixed-length multi-byte encoding in Asia, while the database character set cannot. The National Character Set is redefined in Oracle9i and can only be selected from af16utf16 and utf8 in Unicode encoding. The default value is af16utf16.
4.4.3 query character set parameters
You can query the following data dictionaries or views to view Character Set settings.
Nls_database_parameters, props $, V $ nls_parameters
In the query results, nls_characterset indicates the character set, and nls_nchar_characterset indicates the national character set.
4.4.4 modifying database character sets
As mentioned above, the database character set cannot be changed in principle after it is created. However, there are two feasible methods.
1. If you need to modify the character set, you usually need to export the database data, recreate the database, and then import the database data for conversion.
2. you can use the alter database character set statement to modify the character set. However, there are limits on modifying the character set after the database is created. Only when the new character set is the current character set, the character set of the database can be modified, for example, utf8 is a superset of us7ascii. You can use alter database character set utf8 to modify the character set of a database.
4.5 client character set (nls_lang parameter)
4.5.1 client Character Set meaning
The client character set defines the encoding method of client character data. Any character data sent from or to the client is encoded using the character set defined by the client. The client can be seen as a variety of applications that can be directly connected to the database, for example, sqlplus, exp/imp. The client character set is set by setting the nls_lang parameter.
4.5.2 nls_lang parameter format
Nls_lang = <language >_< territory>. <client Character Set>
Language: displays the Oracle message, validation, and date name.
Territory: Specifies the default date, number, currency, and other formats
Client Character Set: Specifies the character set that the client will use
Example: nls_lang = american_america.us7ascii
American is the language, America is the region, and us7ascii is the client Character Set
4.5.3 client Character Set setting method
1) Unix environment
$ Nls_lang = "Simplified Chinese" _ China. zhs16gbk
$ Export nls_lang
Edit the profile file of an Oracle user
2) Windows
Edit Registry
Regedit.exe --- HKEY_LOCAL_MACHINE --- software --- Oracle-home
4.5.4 NLS parameter query
Oracle provides several NLS parameter customization databases and user machines to adapt to local formats, such as nls_language, nls_date_format, and nls_calender. You can query the following data dictionary or view in V $ view.
Nls_database_parameters: displays the current NLS parameter values of the database, including the database character set values
Nls_session_parameters: displays the parameters set by nls_lang or the value of the parameters changed by alter SESSION (excluding the client Character Set set by nls_lang)
Nls_instance_paramete: displays the parameters defined by the parameter file init <Sid>. ora.
V $ nls_parameters: displays the current NLS parameter values of the database.
4.5.5 modify NLS Parameters
You can modify the NLS parameters using the following methods:
(1) modify the initialization parameter file used for instance startup
(2) modify the environment variable nls_lang
(3) Use the alter session Statement to modify
(4) use some SQL Functions
NLS priority: SQL function> alter session> environment variables or registry> parameter files> default database Parameters
V. exp/IMP and Character Set
5.1 exp/imp
Export and import are a pair of Oracle Data Reading and Writing Tools. Export outputs data from the Oracle database to the operating system file. Import reads data from these files to the Oracle database. Because exp/imp is used for data migration, character sets are involved in the process from the source database to the target database. If the character sets of these four links are inconsistent, character sets will be converted.
Exp
__________________________________________
| Imp import file | <-| environment variable nls_lang | <-| database character set |
------------------------------------------
IMP
__________________________________________
| Imp import file |-> | environment variable nls_lang |-> | database character set |
------------------------------------------
The four character sets are
(1) source database Character Set
(2) user session character set during the export process (set through nls_lang)
(3) user session character set during import (set through nls_lang)
(4) target database Character Set
5.2 export the conversion process
During the export process, if the source database character set is inconsistent with the Export user session character set, the character set conversion will occur, and the export user session Character Set ID will be stored in several bytes in the header of the exported file. Data may be lost during this conversion process.
For example, if the source database uses zhs16gbk and the Export user session Character Set uses us7ascii, because zhs16gbk is a 16-bit Character Set and us7ascii is a 7-bit character set, chinese characters cannot find equivalent characters in us7ascii, so all Chinese characters will be lost and become "?" In this way, the generated DMP file has been lost.
Therefore, if you want to export the source database data correctly, the user session character set during the export process should be equal to the source database character set or the superset of the source database Character Set
5.3 import conversion process
(1) determine the environment for exporting database character sets
You can get the character set settings of the exported file by reading the exported file header.
(2) determine the character set of the imported session, that is, the nls_lang environment variable used to import the session.
(3) imp reads the exported file
Read the Character Set ID of the exported file and compare it with the nls_lang of the import process.
(4) If the exported file character set is the same as the imported session character set, the conversion is not required in this step. If the conversion is different, the data must be converted to the character set used for the imported session. It can be seen that two character set conversion occurs during the process of importing data to the database.
First time: the conversion between the character set of the imported file and the character set used by the imported session. If the conversion process cannot be completed correctly, the import process to the target database cannot be completed.
Second: import the conversion between the session Character Set and the database character set.
About the oralce character set (copying others' character sets is purely self-learned)