An article on Coding written by the java hut of jiasber (2) gb2312, GBK and Chinese webpage

Source: Internet
Author: User
Gb2312 is a very important term for Chinese developers. It does not need to be described here. Google knows exactly what it means. I just want to mention that the previous section mentioned that encoding character sets and character set encoding are not the same thing, and some character set encoding does not actually do anything. gb2312 is exactly like this!
Gb2312 initially refers to an encoding character set, which contains English characters contained in ASCII, and adds 6763 simplified Chinese characters and other symbols other than ASCII. The same as Unicode has UTF-8 and UTF-16 (of course, UTF-8 and UTF-16 are not limited to encoding Unicode only, in fact, you can use it to encode the video, only the compiled files are not supported by the player. Haha) gb2312 also has its own encoding scheme, however, this scheme uses the number of a character in gb2312 as the storage value (similar to the UTF-32), and thus this encoding scheme does not even have a formal name. We usually refer to this character set and the encoding scheme when talking about gb2312.
GBK is a later standard of gb2312. It adds more Chinese characters and special characters. Similarly, GBK also refers to its character set and encoding.
GBK is now the default encoding of the Chinese Windows operating system (this is the root cause of the garbled file on almost all webpages ).
We can verify this by using the following Java Code : String Encoding = System. getproperty ( " File. Encoding " );
System. Out. println (encoding );

the output result is
GBK
(what? Isn't your output like this? How is it possible? My brand is about to break. Wait. What kind of XP do you use? I mean, what is the chaos with your comrade here? Go! Go !)
when it comes to gb2312 and GBK, you have to mention the Chinese webpage encoding. While many newly developed web systems and newly launched International-focused websites are beginning to use UTF-8, there are still a considerable portion of Chinese media sticking to using gb2312 and GBK, such as Sina's pages. Two of them are worth noting.
first, the meta tag on the page is often written in the form of
charset = gb2312
. Unfortunately, this "charset" is used to specify the character set encoding used by the page, rather than the character set. For example, you have seen someone write "charset = UTF-8" and someone write "charset = ISO-8859-1", but have you seen someone write "charset = Unicode? Of course not, because Unicode is a character set rather than encoding.
however, the charset name misleads many Program Members, who really think that the character set must be specified here, so that they further mistakenly think that UTF-8 and UTF-16 is a character set! (All evil) Fortunately, the XML has been modified, and the location is changed to the correct name: encoding.
second, the gb2312 mentioned on the page is actually not actually gb2312 (Surprised ?). Let's make an experiment. For example, find a Chinese character "Taobao" that does not exist in gb2312 (this word is indeed not in gb2312. You can find it in the gb2312 code table to ensure it cannot be found ), the word is in GBK. Then you put it in an HTML page, try to open it in the browser, and then select the browser code as "gb2312". What do you see? It is completely normal!
I don't need to say you understand the conclusion. The browser actually uses GBK for display.
there are also many such examples on Sina's page. charset = gb2312 is everywhere, but it uses countless characters that do not exist in gb2312. This method is not a problem for the browser to display the page, but it is troublesome for the program to capture and save the page. The program will not be able to read and save the page according to the code "claimed" by the page, instead, we can only guess the correct encoding as much as possible.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.