Introduction to Character set encoding in Java (iii) GB2312,GBK and Chinese web pages

Last Update:2017-02-27 Source: Internet

Author: User

Tags character set html page

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

GB2312 is a very important word for China's developers, and the ins and outs of it do not need me to repeat them here, and to be clear and correct goolge. I just want to mention that the previous section says that coded character sets and character set encoding are not the same thing, and that some character set coding does not actually do anything, GB2312 is just such an object!

GB2312 initially refers to a coded character set that contains the English characters included in ASCII, plus 6,763 Simplified Chinese characters and some other symbols other than ASCII. As with Unicode there are UTF-8 and UTF-16 (of course, UTF-8 and UTF-16 are not qualified to encode only Unicode, in fact, you can use it to encode the video is OK, but the file is not the editor of the player Support, haha), GB2312 also has its own coding scheme, but this scheme uses the number of a character in GB2312 as a store value (similar to the UTF-32 approach), so the coding scheme doesn't even have a formal name. When we talk about GB2312 everyday, we often refer to this character set, also refers to this coding scheme.

GBK is a follow-up standard for GB2312, adding more Chinese characters and special symbols, and similarly, GBK is also referring to his character set and his coding.

GBK or the system default encoding of today's Chinese Windows operating system (which is the root of the garbled problem in almost all pages of the file).

We can verify this by using the following Java code:

Stringencoding=system.getproperty ("file.encoding");

SYSTEM.OUT.PRINTLN (encoding);

Output results are

GBK

What Isn't that what your output is? How can it be? It's over, my brand is going to smash, wait, you use the traditional version of XP? I mean, what the hell is your comrade doing here? To! To! ）

When it comes to GB2312 and GBK, you have to put in the code for Chinese pages. While many of the newly developed web systems and new online Web sites are starting to use UTF-8, a significant portion of the Chinese media insists on using GB2312 and GBK, such as the Sina page. Two of these points are noteworthy.

First, the part of the META tag on the page can often be seen

charset=gb2312

In this way, unfortunately, this "charset" is actually used to specify what character set code the page uses, rather than what character set to use. For example, have you ever seen someone write "Charset=utf-8" and have seen someone write "charset=iso-8859-1", but have you ever met someone who wrote "Charset=unicode"? Of course not, because Unicode is a character set, not an encoding.

However, it is charset this name misled a lot of programmers, really think here to specify the character set, and thus make them further mistaken UTF-8 and UTF-16 is a character set! Fortunately, the XML has been modified, and this location has been changed to the correct name: encoding.

Second, the page says the GB2312, actually not really is GB2312 (surprised?) ）。 Let's do an experiment, for example, find a GB2312 Chinese character "亸" (the word is really not in the GB2312, you can go to GB2312 's code table to find, guaranteed not to find), the word in the GBK. Then you put it in an HTML page, try to open it in the browser, and then choose the browser's code as "GB2312", what do you see? It's completely normal!

Conclusion I don't need to say you understand, the browser is actually using GBK to display.

Sina's page also has many such examples, everywhere writes the charset=gb2312, but uses the innumerable GB2312 characters which do not exist. This approach to browser display page is not a problem, but when the need to crawl the page and save the time to bring trouble, the program will not be based on the page "claimed" encoding to read and save, but only try to guess the correct encoding.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More