URL encoding summary, url Encoding

Source: Internet
Author: User
Tags alphanumeric characters

URL encoding summary, url Encoding
URL encoding Summary

URL is the abbreviation of Universal Resource Locator. It is translated as a uniform Resource Locator. Well, we often call it a webpage address. The format of a URL is usually as follows: protocol type: // server address (port number required when necessary)/path/file name? Parameters, such as http://zh.wikipedia.org: 80/w/index. php? Title = Special; Protocol: HTTP; server address: zh.wikipedia.org; port: 80; Path and file name:/w/index. php; parameter: title = Special. There is also a URL-related concept URI, which is a unified resource identifier and a URL that identifies Internet resources and specifies operations and methods for obtaining resources. In most cases, the URL only has English characters, so there is no encoding problem. If the URL contains Chinese characters, what is the encoding rule? In fact, there is no standard specification for the URL encoding method for RFC, so the performance of different browsers may be different. Below is a summary. I mainly refer to Ruan Yifeng's article on URL encoding. Of course, the test results may vary in some places. You can differentiate them based on your system environment.


1. the URL path contains Chinese characters.

If the URL path contains Chinese characters, it is found that no matter IE6.0 or Chrome, encoding is the UTF-8. Other browsers are not tested at the moment, so we guess they should be consistent.

The URL of the test is: E69687 is the UTF-8 code of "Chinese.


2. URL query parameters contain Chinese Characters

If the query parameters contain Chinese characters, it is found that IE6.0 adopts the operating system encoding, Chrome adopts the UTF-8 encoding..

The URL for the test is: http://www.baidu.com/s? Wd = Chinese. The test shows that the "Chinese" in IE6 is actually converted to % B0 % D9 % B6 % C8, chrome is converted to % E4 % B8 % AD % E6 % 96% 87.


3. Chinese encoding of form parameters

When a form is submitted, whether it is IE6 or Chrome, the Chinese encoding in the parameter is determined based on the character encoding specified in the HTML code.(That is, the character encoding specified by tags in html code ). Of course, this is because the accept-charset is not specified in form. If the accept-charset = "GBK" attribute is added to form, the form parameters are encoded by the accept-charset specified encoding.

The test code is as follows:

<Html> 

It can be found that when you specify charset = UTF-8, When you input "Chinese" in the input, the actual commit will be encoded as % E4 % B8 % AD % E6 % 96% 87, if charset = GBK, the input is encoded as % D6 % D0 % CE % C4 using GBK. The same applies to POST.

If you add the accept-charset attribute and the code is changed to the following, the form parameter encoding is determined by the encoding specified in accept-charset. As shown below, although meta specifies UTF-8 encoding, form parameters are GBK encoded.

<Html> Note that if the input contains spaces, such as "Chinese haha", the space will be encoded as +.


4. Chinese parameter encoding for Javascript HTTP

Previously, I was talking about the Chinese encoding of HTTP requests directly through a browser. What would happen if I sent an HTTP request through Javascript?According to the test, the HTTP request sent through Javascript, IE6 is the operating system encoding, Chrome Chinese parameter encoding is UTF-8.

Test: You can open Chrome's coding ("/wiki/English"), you can find the UTF-8 encoding "English" in the network connection. In IE, You can edit another webpage for testing and use JS for testing.


5 Javascript encoding Functions

There are many Chinese encoding cases mentioned above, and different browser processing methods are also different. This is a very tangled issue, A good method is to use JS functions to uniformly process parameters before submitting a form.

The first function is escape, and escape is a global function. It uses a hexadecimal number (% xx or % uxxxx) to encode the string as a unicode code. Characters less than or equal to 0xFF will be escaped to % xx. characters greater than 0xFF will be transferred to % uxxxx. You can use the unescape function to decode the escape function-encoded string. Escape has been abandoned by the ecma standard. We recommend that you use the encodeURI or encodeURIComponent function.

Escape instance: the result of escape ("Chinese") is "% u4E2D % u6587", and the result of escape ("abc def") is "abc % 20def ", the space is encoded as 0x20. No matter what the webpage code is, after JS escape operations, it will become a unicode code. As mentioned in section 3rd, because the space in the parameter during form submission is encoded as +, the escape function does not encode "+", so escape ("abc + def ") the result is "abc + def ".

The second function is encodeURI, which is also a global function. The purpose of encodeURI is to encode the URI using a UTF-8. ASCII letters and numbers are not encoded ,-_.! ~ * '() Is not encoded, and characters with special meanings in the URI are not encoded (such ;/? : @ & =+ $, #, Etc ). Other characters in the parameter are converted to UTF-8-encoded characters and replaced with a hexadecimal escape sequence (% xx. The corresponding decoding function is decodeURI.

EncodeURI instance: encodeURI ("test http://www.baidu.com/test? V = AB cd + @ # ") the result is" % E6 % B5 % 8B % E8 % AF % 95% 20 http://www.baidu.com/test? V = AB % 20cd + @#".

The third function is encodeURIComponent. Unlike encodeURI, it has special symbols such ";/? : @ & =+ $. The decoding function is decodeURIComponent.

EncodeURIComponent instance: encodeURIComponent ("test http://www.baidu.com/test? V = AB cd + @#") the result is % E6 % B5 % 8B % E8 % AF % 95% 20 http % 3A % 2F % 2Fwww.baidu.com % 2 Ftest % 3Fv % 3Dab % 20cd % 40% 2B.


6 References

  • URL Encoding
  • URL wiki
  • Javascript tutorial


What is the role of url encoding?

"In addition, Chinese characters are encoded with all strange characters, which is conducive to confidentiality. "
This is not the reason --!

It is because it can be used around the world after encoding.
Some operating systems do not support Chinese Characters

What encoding is used for URL transmission?

URL encoding is neither UTF-8 nor gbk.

Rather, it is RFC1738 encoding (except encode the space as the plus sign "+" does not match ).

Similar to www.baidu.com/..w.b9w.fe.it actually goes through the URL address of rfc1738.

In the RFC1738 character set, all non-alphanumeric characters except-_. In the URL will be replaced with a semicolon (%) followed by two hexadecimal numbers. Except that the space is encoded as the plus sign (+.

All the websites we visit will correctly parse the URLs using the RFC1738 character set. This is what international organizations have already set out.

The DNS server generally does not involve this part, because DNS is a domain name parser. As the name suggests, it only explains the part www.baidu.com, and most of the parameters involved in the RFC1738 character set are the following parameters.

When we enter a string in the address bar, No matter what character set you are using, it will eventually be converted to a URL address encoded using the RFC1738 character set.

We can imagine the RFC1738 character set as an ASCII-like character set, which is common and supported by any character set.

I have seen documents from international organizations in this regard, and I have added some personal understandings. I am not talking about them.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.