A practical way to provide Unicode support for DB2 UDB for Linux, UNIX, and Windows

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

Today's applications are often designed for international use. These applications may need to handle strings in different languages. Unicode is a language-independent character representation standard.

Because the Java programming language already uses Unicode internally to represent characters, the development of internationalized applications is much easier. However, you cannot consider only the application side. The back-end database must also be able to handle Unicode characters. This article discusses several topics that help developers implement DB2 UDB applications for internationalization.

What Unicode standards are supported in DB2?

There is only one Unicode standard, but there are different Unicode character encoding schemes. The most common Unicode character encodings are UTF-8 and UCS-2:

UTF-8: Uses 1 to 4 bytes to represent the encoding of each character. This encoding scheme encodes ASCII characters in one byte and encodes non-ASCII characters with multiple bytes (up to 4 bytes).

UCS-2: Each Unicode character is encoded to 2 bytes. This coding scheme can represent more than 65,000 characters, which covers most characters of the most important languages in the world. UCS-2 is also used inside Java.

As mentioned earlier, the complete Unicode standard consists of more than 65,000 characters. Other characters outside this range are essentially those that may no longer be in use, or have limited use. For example, some Asian characters are used only in names. These extra characters, also known as supplemental characters, can be represented in UTF-8 with 4 bytes. In addition, there is a coding scheme called UTF-16, which can also be used to represent supplementary characters. To do this, UTF-16 uses 2 x 2 bytes.

The DB2 UDB only supports UTF-8 and UCS-2 encodings. Although DB2 does not support supplemental characters, supplemental characters can be stored in the DB2 UDB. It should be noted that more than 65,000 characters are sufficient for most applications. Processing supplemental characters in a Java application requires a special mechanism. Therefore, processing supplemental characters requires not only database support, but also application support.

How to use Unicode encoding schemes to store data in DB2

You can store Unicode data in a DB2 Unicode database. In a Unicode database, all tables use Unicode encoding schemes to store character data. DB2 also allows the use of Unicode format to store character data in Unicode tables in non-Unicode databases.

Defining a Unicode Database

Unicode databases store character data in Unicode format and do not display data in Unicode format. When you create a database, DB2 determines the encoding page for the database. The encoding page of a database can be determined implicitly or explicitly.

DB2 UDB for Linux, UNIX, and Windows can implicitly determine the encoding page based on an environment variable. When a database is created using the Create DB statement, the encoding page of the database is deduced from the locale setting of the operating system. For Windows operating systems, the encoding page for the database is deduced from the ANSI encoding page settings in the registry. For UNIX operating systems, the environment is deduced from the locale, including language, region, and encoding sets. It should be noted that the registry variable Db2codepage can be used to overwrite the encoding page. However, if the Db2codepage registry variable is set to an incorrect value, it can cause unpredictable results and potential data loss.

You can also explicitly specify an encoding page in the CREATE DATABASE statement using the Using CODESET clause. A CREATE database statement with a USING codeset UTF-8 clause indicates that the database can contain UTF-8 or UCS-2 encoded character data. For clarity, it is recommended that you explicitly define a Unicode database:

CREATE DATABASE db_name USING CODESET codeset
TERRITORY territory_name COLLATE USING collating sequence

The codeset should be specified in uppercase characters, such as Utf-8,territory_name is the encoding that provides zone-specific support, collating sequence represents a method of comparing character data.

For example

CREATE DATABASE UCSAMPLE USING CODESET UTF-8 TERRITORY US

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A practical way to provide Unicode support for DB2 UDB for Linux, UNIX, and Windows

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A practical way to provide Unicode support for DB2 UDB for Linux, UNIX, and Windows

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support