Introduction to Character set encoding in Java (iii) GB2312,GBK and Chinese web pages

Source: Internet
Author: User
Tags character set html page

GB2312 is a very important word for China's developers, and the ins and outs of it do not need me to repeat them here, and to be clear and correct goolge. I just want to mention that the previous section says that coded character sets and character set encoding are not the same thing, and that some character set coding does not actually do anything, GB2312 is just such an object!

GB2312 initially refers to a coded character set that contains the English characters included in ASCII, plus 6,763 Simplified Chinese characters and some other symbols other than ASCII. As with Unicode there are UTF-8 and UTF-16 (of course, UTF-8 and UTF-16 are not qualified to encode only Unicode, in fact, you can use it to encode the video is OK, but the file is not the editor of the player Support, haha), GB2312 also has its own coding scheme, but this scheme uses the number of a character in GB2312 as a store value (similar to the UTF-32 approach), so the coding scheme doesn't even have a formal name. When we talk about GB2312 everyday, we often refer to this character set, also refers to this coding scheme.

GBK is a follow-up standard for GB2312, adding more Chinese characters and special symbols, and similarly, GBK is also referring to his character set and his coding.

GBK or the system default encoding of today's Chinese Windows operating system (which is the root of the garbled problem in almost all pages of the file).

We can verify this by using the following Java code:

Stringencoding=system.getproperty ("file.encoding");

SYSTEM.OUT.PRINTLN (encoding);

Output results are

GBK

What Isn't that what your output is? How can it be? It's over, my brand is going to smash, wait, you use the traditional version of XP? I mean, what the hell is your comrade doing here? To! To! )

When it comes to GB2312 and GBK, you have to put in the code for Chinese pages. While many of the newly developed web systems and new online Web sites are starting to use UTF-8, a significant portion of the Chinese media insists on using GB2312 and GBK, such as the Sina page. Two of these points are noteworthy.

First, the part of the META tag on the page can often be seen

charset=gb2312

In this way, unfortunately, this "charset" is actually used to specify what character set code the page uses, rather than what character set to use. For example, have you ever seen someone write "Charset=utf-8" and have seen someone write "charset=iso-8859-1", but have you ever met someone who wrote "Charset=unicode"? Of course not, because Unicode is a character set, not an encoding.

However, it is charset this name misled a lot of programmers, really think here to specify the character set, and thus make them further mistaken UTF-8 and UTF-16 is a character set! Fortunately, the XML has been modified, and this location has been changed to the correct name: encoding.

Second, the page says the GB2312, actually not really is GB2312 (surprised?) )。 Let's do an experiment, for example, find a GB2312 Chinese character "亸" (the word is really not in the GB2312, you can go to GB2312 's code table to find, guaranteed not to find), the word in the GBK. Then you put it in an HTML page, try to open it in the browser, and then choose the browser's code as "GB2312", what do you see? It's completely normal!

Conclusion I don't need to say you understand, the browser is actually using GBK to display.

Sina's page also has many such examples, everywhere writes the charset=gb2312, but uses the innumerable GB2312 characters which do not exist. This approach to browser display page is not a problem, but when the need to crawl the page and save the time to bring trouble, the program will not be based on the page "claimed" encoding to read and save, but only try to guess the correct encoding.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.