Oracle Character Set
1, Introduction
ORACLE The database character set, which is Oracle globalization support (globalization), or national language Support (NLS), is used to store, process, and retrieve data in its native language and format. With globalization support, Oracle provides users with a familiar native environment for their databases, such as date formats, number formats, and storage sequences. Oracle can support multiple languages and character sets, where Oracle8i supports 48 languages, 76 national territories, 229 character sets, while Oracle9i supports 57 languages, 88 national territories, 235 character sets.
2, overview2.1Character Set Overview
In essence, according to a certain character coding scheme, a set of specific symbols are assigned to different numerical codes. The earliest supported encoding scheme for Oracle databases is US7ASCII.
Oracle the character set naming follows the following naming conventions:
That is: < language >< bits >< encoding >
Example: nls_lang= American_america. ZHS16GBK
For example: ZHS16GBK in GBK encoded format, 16-bit (two-byte) Simplified Chinese character set
2.2Character Set Super
When the encoded value of a character set (character set a) contains the encoded value of all another character set (character set B), and the same encoded value for both character sets represents the same character, the character set A is the super of character set B, or character set B is a subset of character set A.
oracle8i and Oracle9i are available in the official documentation for the Subset-super table (Subset-superset pairs), for example: WE8ISO8859P1 is a subset of we8mswin1252. Because US7ASCII is the oldest Oracle database encoding format, there are many character sets that are us7ascii, such as WE8ISO8859P1, zhs16cgb231280, and ZHS16GBK are superset of US7ASCII.
2.3oracleDatabase Character Set
The database character set is specified when the database is created and cannot normally be changed after it is created. When you create a database, you can specify the character set (CHARACTER set) and the national character set (Nation CHARACTERSET).
2.3.1Character Set
-
to save Stored char, Varchar2,clob, long, and other types of data
-
used to label such as table name, column names and PL/SQL variables, such as
-
national character Set
-
The national character set is essentially the additional character set chosen for Oracle, primarily to enhance the character processing capabilities of Oracle, because the NCHAR data type provides support for the use of fixed-length multibyte encodings in Asia, while the database character set does not. The national character set is redefined in oracle9i and can only be selected in Af16utf16 and UTF8 in Unicode encoding, and the default value is Af16utf16
-
You can query the following data dictionary or view to view the set of character sets
nls_database_parameters , props$, V$nls_parameters
Nls_characterset represents the character set in the query results, Nls_nchar_characterset represents the national character set
2.3.4Modifying the database character set
As stated above, the database character set cannot be changed in principle after it is created. If you need to modify the character set, you typically need to export the database data, rebuild the database, import the database data in a way to convert, or modify the character set through the ALTER DATABASE CHARACTER SET statement, but there is a limit to modifying the character set after the database is created. You can modify the database character set only if the new character set is a superset of the current character set, for example UTF8 is a superset of us7ascii, and modifying the database character set can be used to alter the CHARACTER set UTF8.
2.3.5Common Character Set garbled problem
This process has 3 character set settings:
(1) client application character Set
(2) client Nls_lang parameter setting
(3) server-side database character set (Character set) settings
What characters can be displayed in the client app Sqlplus depends on the client operating system locale (the client applies the character set), but when these characters are entered in the application, they can be stored properly in the database and are closely related to the other two character set settings. The client Nls_lang parameter is mainly used for conversion judgment during character data transmission.
garbled production is due to the conversion mismatch between several character sets, in the following cases:
(Note: If there is no subset or superset correspondence between the character sets, this is not considered because the conversion between character sets will have garbled characters in this case)
1 The server-side database character set is the same as the client app character set, unlike the client Nls_lang parameter settings
If the client Nls_lang character set is a subset of the other two character sets, the conversion process will be garbled.
Workaround: Set three character sets to the same character set, or the Nls_lang character set is a superset of the other two character sets
2 The server-side database character set is the same as the client Nls_lang parameter setting and differs from the client application character set
If the client application character set is a superset of the other two character sets, the conversion process will appear garbled
3 ) client application character set, client Nls_lang parameter setting, server side database character set are different
This situation is more complex, but there are characters that cannot be converted between three character sets, which will be garbled
There are generally two common garbled situations:
(1) Chinese characters become question marks "? ";
When converting from character set A to character set B, if there is no correspondence between the converted characters, Nls_lang uses the substitution character "? "Replaces characters that cannot be mapped;
(2) Chinese characters become unknown characters (although some are Chinese, but different from the original characters)
The conversion has a corresponding relationship, but the character encoding in character set a represents a different meaning than character encoding in character set B.
This process has 4 character set settings:
(1) Source database character Set
(2) Nls_lang parameters in exp process
(3) Nls_lang parameters in imp process
(4) Target database character set
garbled reason
1 when the source database character set is not equal to the Nls_lang parameter in the EXP process, and the source database character set is a subset of Nls_lang during the EXP process, the export file is guaranteed to be correct, and other cases the exported file characters are garbled
2 The Nls_lang character set in the EXP process is not equal to the Nls_lang character set in the IMP process, and the Nls_lang character set in the exp process is a child of the Nls_lang character set in the IMP process to ensure that the first conversion is normal or garbled in the first conversion.
3 If the first conversion is normal, the Nls_lang character set in the IMP process is a subset of the target database character set or the same, in order to ensure that the second conversion is normal, otherwise garbled in the second conversion.
3, current situation analysis
Character set statistics of several pilot network provinces currently implementing database resource pool
character set |
|
|
Al32utf8 |
15 |
17.86% |
UTF8 |
20 |
23.81% |
we8mswin1252 |
1 |
1.19% |
ZHS16GBK |
45 |
53.57% |
blank |
3 |
3.57% |
Total |
84 |
|
The above is the pilot network province currently running Oracle database sampling of a total of 84 sets, including the character set ZHS16GBK and UTF8 two share ratio of about 77.38%. Because of the diversity of character sets, there are problems that can not be written when the state network company sends data to the network Province company, and there are garbled problems in different database migrations in the network province.
4, Risk consequences
-
probably out Database data is now confusing.
-
-
may cause database tables char, VARCHAR2, clob long
-
can corrupt data files and lose data.
-
The operating system and database server character set are incompatible and are not easily detectable by data errors in the server.
-
can't deposit some special symbols.
5, Solution suggestions
database Character set cannot be changed in principle after creation . Therefore, it is important to consider which character set to use at the beginning of design and installation. For database server, the incorrect modification of the character set will result in a lot of unpredictable consequences that can seriously affect the normal functioning of the database, so be sure to verify that there are subsets and superset relationships between the two character sets before you modify them. In general, we do not recommend modifying the character set of the Oracle database server side unless it is a last resort. In particular, there is no subset and superset relationship between the two character sets ZHS16GBK and Al32utf8 that we use most often, so it is theoretically not supported to convert between the two character sets.
To avoid a character set that is not uniform, make the following recommendations:
For the new system, strictly in accordance with the National Network uniform standard character set;
For the old system, if the character set is the unified standard of the national network character set or subset, can be modified to maintain unity;
For the old system, not the national Network unified standard character set and subset, ZHS16GBK can be converted to Al32utf8, during the migration process, the client arbitrarily for one of the encoding, migration data will not appear garbled, but the column length is not enough, and vice versa;
For other character sets, in order to avoid data loss due to different character sets during database migration, Oracle provides the character Set scan Tool (character set scanner), which allows us to test the problems that can be caused by character set conversion during data migration Then, based on the test results, determine the best character set solution in the data migration process.
There are several ways to modify the database character set:
5.1modify from subset to Hyper-set
The character set is modified by the ALTER Databasecharacter SET statement, but it is limited to modify the character set after the database is created, and only if the new character set is a superset of the current character set can the database character set be modified, for example, UTF8 is a superset of us7ascii. Modifying the database character set can be used with the ALTER DB CHARACTER set UTF8.
5.2ModifyDmpfile
DMP the 2nd 3rd byte of the file records the character set information, so directly modifying the contents of the 2nd 3rd byte of the dmp file can ' cheat ' the Oracle check. This is done theoretically only from subset to superset can be modified, but in many cases there is no subset and superset of the situation can also be modified, some of our commonly used character sets, such as US7ASCII,WE8ISO8859P1,ZHS16CGB231280,ZHS16GBK basic can be changed. Because the change is only the DMP file, so the impact is not small.
The specific modification method is more, the simplest is to modify the DMP file's 2nd and 3rd bytes directly with UltraEdit.
For example, to change the DMP file character set to ZHS16GBK, you can use the following SQL to isolate the character set corresponding to the 16 code:sql> Select To_char (nls_charset_id (' ZHS16GBK '), ' xxxx ') from Dual
0354
Then change the DMP file 2, 3 bytes to 0354.
5.3 Epx/impModifying character sets
You need to export the database data, rebuild the database, and then import the database data in a way that transforms
Back up data for all users in the database
Assess the risk of conversion (using the Csscan tool)
Change the character set of a database or create a new database
Import the backed up user data
Note: Before converting, to use Oracle's Csscan tool to scan the database, evaluate the data for possible corruption before and after the character set conversion, if the evaluation is optimistic, you can make a conversion attempt, but no expert recommends this, even on the test database, no outside the "security" word If the assessment is bad, you must give up.
Oracle Character Set