Brief introduction
Today's applications are often designed for international use. These applications may need to handle strings in different languages. Unicode is a language-independent character representation standard.
Because the Java programming language already uses Unicode internally to represent characters, the development of internationalized applications is much easier. However, you cannot consider only the application side. The back-end database must also be able to handle Unicode characters. This article discusses several topics that help developers implement DB2 UDB applications for internationalization.
What Unicode standards are supported in DB2?
There is only one Unicode standard, but there are different Unicode character encoding schemes. The most common Unicode character encodings are UTF-8 and UCS-2:
UTF-8: Uses 1 to 4 bytes to represent the encoding of each character. This encoding scheme encodes ASCII characters in one byte and encodes non-ASCII characters with multiple bytes (up to 4 bytes).
UCS-2: Each Unicode character is encoded to 2 bytes. This coding scheme can represent more than 65,000 characters, which covers most characters of the most important languages in the world. UCS-2 is also used inside Java.
As mentioned earlier, the complete Unicode standard consists of more than 65,000 characters. Other characters outside this range are essentially those that may no longer be in use, or have limited use. For example, some Asian characters are used only in names. These extra characters, also known as supplemental characters, can be represented in UTF-8 with 4 bytes. In addition, there is a coding scheme called UTF-16, which can also be used to represent supplementary characters. To do this, UTF-16 uses 2 x 2 bytes.
The DB2 UDB only supports UTF-8 and UCS-2 encodings. Although DB2 does not support supplemental characters, supplemental characters can be stored in the DB2 UDB. It should be noted that more than 65,000 characters are sufficient for most applications. Processing supplemental characters in a Java application requires a special mechanism. Therefore, processing supplemental characters requires not only database support, but also application support.
How to use Unicode encoding schemes to store data in DB2
You can store Unicode data in a DB2 Unicode database. In a Unicode database, all tables use Unicode encoding schemes to store character data. DB2 also allows the use of Unicode format to store character data in Unicode tables in non-Unicode databases.
Defining a Unicode Database
Unicode databases store character data in Unicode format and do not display data in Unicode format. When you create a database, DB2 determines the encoding page for the database. The encoding page of a database can be determined implicitly or explicitly.
DB2 UDB for Linux, UNIX, and Windows can implicitly determine the encoding page based on an environment variable. When a database is created using the Create DB statement, the encoding page of the database is deduced from the locale setting of the operating system. For Windows operating systems, the encoding page for the database is deduced from the ANSI encoding page settings in the registry. For UNIX operating systems, the environment is deduced from the locale, including language, region, and encoding sets. It should be noted that the registry variable Db2codepage can be used to overwrite the encoding page. However, if the Db2codepage registry variable is set to an incorrect value, it can cause unpredictable results and potential data loss.
You can also explicitly specify an encoding page in the CREATE DATABASE statement using the Using CODESET clause. A CREATE database statement with a USING codeset UTF-8 clause indicates that the database can contain UTF-8 or UCS-2 encoded character data. For clarity, it is recommended that you explicitly define a Unicode database:
CREATE DATABASE db_name USING CODESET codeset
TERRITORY territory_name COLLATE USING collating sequence
The codeset should be specified in uppercase characters, such as Utf-8,territory_name is the encoding that provides zone-specific support, collating sequence represents a method of comparing character data.
For example
CREATE DATABASE UCSAMPLE USING CODESET UTF-8 TERRITORY US