Teach you to make GBK and Unicode tables

Source: Internet
Author: User
Tags array character set count range string

Some time ago, in the participating projects encountered a Unicode and GB transcoding failure problem, some of the characters are not commonly used encoding has been translated into the "??", the Chinese characters did not show up, so they did some research on the related problems and finally solved the problem. Now, combining the previous two Unicode and GB fundamentals, this paper introduces the method of making Gbk-unicode coding tables.

The Java strings String class is powerful, not only for some basic string operations, but also to construct a string of the specified character set as needed, as described in this article, which is the basic idea of this method:

1, traversing all the Chinese characters in the GBK encoding table, using the GB encoding of the word to construct a string. GBK Code table in each part of the Chinese character block is relatively neat, easy to traverse.

2, using the GetBytes () method to obtain the byte array of the character, because Java is Unicode to represent the characters, so the Unicode of this Chinese character is in it.

Here's a sample code:

{
int count = 0;
for (int segindex=0xb0; segindex<=0xf7; segindex++) {
for (int charindex=0xa1; charindex<=0xfe; charindex++) {
byte [] gbkbytes = new byte[] {(byte) (Segindex), (byte) CharIndex};
byte [] unicodebytes;
String str = new String (gbkbytes, "GBK");

Unicodebytes = str.getbytes ("Unicode");
if (unicodebytes.length = = 4) {
count++;
String buffer = "";
for (int i=0;i<gbkbytes.length;i++)
Buffer + + (int) (0x00ff&gbkbytes[i]) + "";
for (int i=3;i>1;i--)
Buffer + + (int) (0x00ff&unicodebytes[i]) + "";
Buffer + + "";
Osw.write (buffer);
}
}
}
}

This section is the code that traverses and processes the Chinese characters in the GBK/2 area, the first byte range of the GBK/2 area is in [0xb0,0xf7], the tail byte range is in [0xa1,0xfe], and the character set used when constructing the string is GBK:

String str = new String (gbkbytes, "GBK");

There are four elements in the byte array obtained using GetBytes (), the first two do not know what to do, may be related to the structure of the string itself, the next two bytes is the true Unicode code. But these two bytes are in reverse order, to start from the last byte, this is related to Big_endian and Little_endian, here is not much to say.

When each inner loop ends, the first two digits in the buffer string are one GB, and the next two digits are a Unicode code, which is written to the file.

After such a file is obtained, the file is loaded in another program, the Unicode value is loaded into an array, and the GB code is the index, so it is convenient for the unicode code to be searched by GB code.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.