Teach you to make GBK and Unicode tables

Last Update:2017-02-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some time ago, in the participating projects encountered a Unicode and GB transcoding failure problem, some of the characters are not commonly used encoding has been translated into the "??", the Chinese characters did not show up, so they did some research on the related problems and finally solved the problem. Now, combining the previous two Unicode and GB fundamentals, this paper introduces the method of making Gbk-unicode coding tables.

The Java strings String class is powerful, not only for some basic string operations, but also to construct a string of the specified character set as needed, as described in this article, which is the basic idea of this method:

1, traversing all the Chinese characters in the GBK encoding table, using the GB encoding of the word to construct a string. GBK Code table in each part of the Chinese character block is relatively neat, easy to traverse.

2, using the GetBytes () method to obtain the byte array of the character, because Java is Unicode to represent the characters, so the Unicode of this Chinese character is in it.

Here's a sample code:

{
int count = 0;
for (int segindex=0xb0; segindex<=0xf7; segindex++) {
for (int charindex=0xa1; charindex<=0xfe; charindex++) {
byte [] gbkbytes = new byte[] {(byte) (Segindex), (byte) CharIndex};
byte [] unicodebytes;
String str = new String (gbkbytes, "GBK");

Unicodebytes = str.getbytes ("Unicode");
if (unicodebytes.length = = 4) {
count++;
String buffer = "";
for (int i=0;i<gbkbytes.length;i++)
Buffer + + (int) (0x00ff&gbkbytes[i]) + "";
for (int i=3;i>1;i--)
Buffer + + (int) (0x00ff&unicodebytes[i]) + "";
Buffer + + "";
Osw.write (buffer);
}
}
}
}

This section is the code that traverses and processes the Chinese characters in the GBK/2 area, the first byte range of the GBK/2 area is in [0xb0,0xf7], the tail byte range is in [0xa1,0xfe], and the character set used when constructing the string is GBK:

String str = new String (gbkbytes, "GBK");

There are four elements in the byte array obtained using GetBytes (), the first two do not know what to do, may be related to the structure of the string itself, the next two bytes is the true Unicode code. But these two bytes are in reverse order, to start from the last byte, this is related to Big_endian and Little_endian, here is not much to say.

When each inner loop ends, the first two digits in the buffer string are one GB, and the next two digits are a Unicode code, which is written to the file.

After such a file is obtained, the file is loaded in another program, the Unicode value is loaded into an array, and the GB code is the index, so it is convenient for the unicode code to be searched by GB code.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Teach you to make GBK and Unicode tables

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support