Confusing URL coding

Source: Internet
Author: User
Tags character set coding standards http request resource rfc

URL Full Name Uniform Resource Locator, the literal translation of "Uniform Resource Locator", that is, the Web page address, is accessible to any corner of the Internet, the implication is that the URL is not affected by country, race, language, coding differences in the constraints, is not code-independent. However, we often typed in the browser, such as "http://url/Chinese" url, but also access to the correct, since the URL contains Chinese, then how to make other countries do not have Chinese-encoded computers can also access to the same Web site?

RFC 1738The URL has a clear stipulation, the URL must consist of English letters, numbers, and some punctuation marks, can not use other text, so all contain Chinese URL should be illegal! In fact, the browser is smart for us to do a lot of human hack, for example, the browser will be filled in the Address bar URL to be encoded before use, so, in any case, a properly encapsulated in the HTTP Package URI field will not appear in Chinese characters. That is to say, the URL that actually happens is also as stated in RFC 1738, non-ASCII code first converts to ASCII code sequence, but RFC 1738 does not specify the specific encoding method, but to the application (browser) and Web program author's own decision. This has caused "URL encoding" to become a confusing field. can also cause some strange phenomena to occur.

We respectively in Firefox and IE with Baidu and Google search "Taobao."

In Firefox, Baidu "Taobao", appears:



The URL that actually occurs for the request is:



is consistent with the address bar and the search results are correct. In the address bar directly input "http://www.baidu.com/s?wd= Taobao" is also so, in Firefox Google "Taobao":



The URL that actually occurs for the request is:



As you can see, the URL and address bar in which the request actually occurred are inconsistent and the search results are correct. At this point, re-request the URL (not refresh) of the address bar, and the Address bar appears as:



The actual requests that occurred are:



At this point, the address bar is consistent with the actual request, and the search results are correct. Before further analysis, first look at the two operations in JS



We know that escape () is the calculation of Unicode encoding, the legendary Orthodox URL encoding encodeURI () is the Utf-8 encoding, (in simple terms, Unicode encoding is pure encoding, UTF-8 is a Unicode encoding implementation, The binary Unicode encoding will be encoded again in a more space-saving way to encode the complete Unicode collection two times. The result of escape () is that each Unicode character is divided into%u, encodeURI is divided by% per byte, that is, "Amoy" and "Treasure" Unicode encoding are "6DD8" and "5b9d" respectively, their utf-8 encoding is "E6 B7 98" and "E5 AE 9D", in addition, their GBK codes are "CC D4" and "B1 A6" respectively.

A preliminary conclusion: Baidu search in Firefox, through the form submitted by the Chinese into the GBK encoding, participate in the HTTP package package. In FF Google search, the form submitted by the Chinese to utf-8 encoding, but the URL displayed in the Address bar is its Chinese image (if the address bar is copied down, copy the actual is the URL after the transcoding, can not copy the text in the URL). If you enter the Chinese URL directly in the FF address bar, at this point, the characters in the URL will be GBK code, whether Baidu or Google are so.


Can't copy the Chinese inside

So it appears that the Firefox default processing URL in Chinese, are encoded by the GBK encoding, here and the page encoding is irrelevant (the browser can not detect the page will be accessed encoding).

So what about Baidu and Google's support for Unicode encoding and UTF-8 encoding?

"Taobao" Unicode Encoding "%6d%d8%5b%9d", Access to "Http://www.baidu.com/s?wd=%6D%D8%5B%9D" in FF



Search to garbled. "Taobao" Utf-8 code for (so-called authentic "URL" code) "%e6%b7%98%e5%ae%9d", in the FF access to "Http://www.baidu.com/s?wd=%E6%B7%98%E5%AE%9D", GET,



is also garbled.

To see if Google can parse utf-8 code, access to "Http://www.google.cn/search?q=%E6%B7%98%E5%AE%9D" in FF,



The results are correct and Google can parse the Utf-8 code correctly. To see if Google can parse Unicode encoding, Access "Http://www.google.cn/search?q=%6D%D8%5B%9D" in FF, and get:



is garbled.

Preliminary conclusions two, the so-called Orthodox URL coding encodeURI is not omnipotent, to see the implementation of each site, Baidu Search does not support this so-called Orthodox, but all adopt the GBK system encoding as their own URL encoding. Google supports "Orthodox URL coding", and also supports the GBK system's code, more robust.


Then look at the situation in IE, in IE in Baidu and Google through form search "Taobao" results and FF in line, but directly in the address bar input Chinese URL is a bit strange, in IE to visit the "http://www.baidu.com/s?wd= Taobao", GET,



The result is certainly correct, and the actual request is



Here you can see, IE launched the HTTP request did not even pass any code, abruptly will "Taobao" as the original GBK characters, so that the other language coding operating system can not recognize this URL, where the "\314\324\261\246" is a kind of I do not know what is the code , even Wireshark do not know, because "http://www.baidu.com/s?wd=\314\324\261\246" is obviously a wrong request.

In addition, the Unicode encoding and UTF-8 encoded URLs are consistent with the performance of IE and FF.

Thus, it can be concluded that:

1,RFC 1738 documents are very rough, resulting in a lack of URL coding standards. The actual URL coding standards are related to operating systems, browsers, and Web applications;
2,FF encoding of non-ASCII URLs in the same way as the operating system's default encoding
3,google supports "Orthodox URL encoding" (IE utf-8 URL encoding: utf-8 byte plus%), Baidu does not support
4,ie does not encode non-ASCII URLs, sending URL requests directly according to the operating system's default encoding, in other words, IE does not even follow RFC 1738, or IE has bugs in the URL's transcoding.
5,ff the URL displayed in the address bar has been hack, but hack has bugs that should be noted when developing.

Based on this, we need to do this in the Web development process:

1, to deal with the coding problem alone, it is recommended to use a uniform URL encoding, whether GBK or Unicode or URI (UTF-8), must be unified, given that most people are confused that the URI is authentic URL encoding, it is recommended that the URI encoding and decoding at the front and back.
2, the smart choice of web App coding, Utf-8 for the best, GBK for the most time.
3, coding problems to debug browser compatibility.

Above ~

Report:
China Japan Korea Unicode character set
GBK Character Set

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.