Web site pseudo static rewrite rewrite Chinese path garbled

Source: Internet
Author: User
Tags character set http request reserved rfc urlencode

first, the origin of the problem .

URL is the Web site, as long as the Internet, you will definitely use.

In general, URLs can only use English letters, Arabic numerals, and certain punctuation marks, and other words and symbols cannot be used. For example, the world has an English-alphabet url "http://www.abc.com", but there is no Greek letter url "http://www.aβγ.com" (read as alpha-beta-gamma. com). This is because the network standard RFC 1738 makes the hard rules:

“... Only alphanumerics [0-9a-za-z], the special characters "$-_.+!* ' ()," [not including the quotes-ed], and reserved charact ERs used for their reserved purposes the May is used unencoded a URL. "
"Only letters and numbers [0-9a-za-z], some special symbols" $-_.+!* "()," [excluding double quotes], and some reserved words can be used without encoding directly for URLs. " ”

This means that if there is a Chinese character in the URL, it must be encoded and used. But the trouble is, RFC 1738 does not specify the coding method, but rather to the application (browser) to decide for themselves. This has caused "URL encoding" to become a confusing field.
Let's take a look at how confusing "url coding" is. I will analyze four different cases in turn, and in each case, the URL encoding method for the browser is not the same. After explaining their differences, I'll talk about how to use JavaScript to find a unified coding method.

Ii. situation 1: The URL path contains Chinese characters

Open IE (I am using version 8.0), enter the URL "http://zh.wikipedia.org/wiki/Spring Festival”。 Note that the words "Spring Festival" are now part of the URL path.

To view the header information of the HTTP request, you will find that the URL of IE's actual query is "http://zh.wikipedia.org/wiki/%E6%98%A5%E8%8A%82”。 In other words, IE automatically encoded "Spring Festival" into "%e6%98%a5%e8%8a%82".

We know that the "Spring" and "section" of the Utf-8 code is "E6 A5" and "E8 8A 82", therefore, "%e6%98%a5%e8%8a%82" is in order, in front of each byte plus%. (For the specific transcoding method, please refer to my "character code notes" I wrote.) )
Testing in Firefox also gets the same result. SoConclusion 1 is that the URL path is encoded using UTF-8 encoding.

iii. Situation 2: Query string contains Chinese characters

Enter the URL in IE.http://www.baidu.com/s?wd= Spring Festival”。 Note that the "Spring Festival" These two words belong to the query string at this time, do not belong to the URL path, do not be confused with the situation 1.

Check the HTTP request header information, will find IE will "Spring Festival" into a garbled.

Switch to 16 in order to clearly see, "Spring Festival" was turned into "B4 BA BD DA".

We know that the GB2312 encoding of "Spring" and "section" (the default encoding for my operating system "Windows XP" in Chinese) is "B4 BA" and "BD DA" respectively. So, IE is actually sending the query string in GB2312 encoded format.
The way Firefox is handled is slightly different. The HTTP head it sends is "WD=%B4%BA%BD%DA". That is, the same GB2312 encoding is used, but the% is added before each byte.

SoConclusion 2 is the encoding of the query string, using the default encoding of the operating system.

Iv. situation the URL generated by the 3:get method contains Chinese characters


This is a direct input to the URL, but more often than not, the HTTP request is made directly using the Get or POST method on the open Web page.
The encoding method is determined by the encoding of the Web page, which is determined by the setting of the character set in the HTML source code.
<meta http-equiv= "Content-type" content= "text/html;charset=xxxx" >
If the last charset in the line above is UTF-8, the URL is encoded in UTF-8, and if it is gb2312,url, it is encoded in GB2312.
For example, Baidu is GB2312 code, Google is UTF-8 code. Therefore, search for the same word "Spring Festival" from their search box, the resulting query string is not the same.
Baidu generated is%b4%ba%bd%da, this is GB2312 code.

Google generates the%e6%98%a5%e8%8a%82, which is the UTF-8 code.

SoConclusion 3 is the encoding of the Get and post methods, using the encoding of the Web page.

V. Situation 4:ajax the URL of the call contains Chinese characters


In the first three cases, the HTTP request was made by the browser, and in the last case, the JavaScript generated the HTTP request, which is the AJAX call. or according to Lu Ruilin Teacher's article, in this case, IE and Firefox are handled in a completely different way.
For example, there are two lines of code:
url = url + "? q=" +document.myform.elements[0].value; Suppose the value submitted by the user in the form is the word "Spring Festival"
Http_request.open (' Get ', url, true);
Then, no matter what character set used in the Web page, ie is always "q=%b4%ba%bd%da" to the server, and Firefox is always "q=%e6%98%a5%e8%8a%82" to the server. Other wordsin Ajax calls, IE always uses GB2312 encoding (the default encoding of the operating system), and Firefox always uses UTF-8 encoding. And that's our conclusion 4.

Six, JavaScript functions: Escape ()


Okay, so far, four things are over.
Assuming you understand the front, you should feel a headache at this point. Because, it's so confusing. Different operating systems, different browsers, and different Web page character sets will result in completely different coding results. Is it too scary for programmers to take every outcome into account? Is there a way to ensure that clients only use one encoding method to send requests to the server?
The answer is to use JavaScript to encode the URL before submitting it to the server, without giving the browser a chance to intervene. Because JavaScript output is always consistent, it ensures that the data that the server gets is in a uniform format.
The JavaScript language is used to encode functions, a total of three, the oldest one is escape (). Although this function is not advocated now, but for historical reasons, many places still use it, so it is necessary to start from it.
In fact, escape () cannot be used directly for URL encoding, and its true function is to return a Unicode encoded value of one character. For example "Spring Festival" The return result is%u6625%u8282, that is, in the Unicode character set, "Spring" is the No. 6625 (hexadecimal) character, and "section" is the No. 8282 (hexadecimal) character.

Its specific rule is, except ASCII letters, numbers, punctuation marks "@ * _ +-." /"encodes all other characters. The symbols between u0000 and u00ff are converted into%xx forms, and the remaining symbols are converted into%uxxxx forms. The corresponding decoding function is unescape ().
Therefore, the "Hello World" of the Escape () code is "Hello%20world." Because the Unicode value of a space is 20 (hexadecimal).

There are two more places to pay attention to.
First, no matter what the original encoding of the Web page is, it becomes a Unicode character once encoded by JavaScript. In other words, the input and output of the Javascipt function are all Unicode characters by default. This applies to the following two functions as well.

Second, Escape () does not encode "+". But we know that when the Web page submits the form, if there are spaces, it will be converted to the + character. When the server processes the data, the + number is processed into spaces. So be careful when you use it.

Seven, JavaScript functions: encodeURI ()


encodeURI () is the function that is really used to encode URLs in JavaScript.
It looks at the entire URL encoding, so in addition to the common symbols, some other in the Web site has special meaning of the symbol "; / ? : @ & = + $, #, no encoding. After encoding, it outputs the utf-8 form of the symbol, plus a% before each byte.

Its corresponding decoding function is decodeURI ().

It should be noted that it does not encode single quotes.

Eight, JavaScript functions: encodeURIComponent ()


The last JavaScript encoding function is encodeURIComponent (). The difference from encodeURI () is that it is used to encode the components of a URL individually, and not to encode the entire URL.
Therefore, "; / ? : @ & = + $, #, these symbols that are not encoded in encodeURI () are encoded in the encodeURIComponent (). As for the specific coding method, the two are the same.

Its corresponding decoding function is decodeURIComponent ().

In fact, the URL is also encoded, Baidu and Google identified the URL is using a different encoding, Google is UTF8, and Baidu is GB2312, so for those who contain the URL in Chinese friends is indeed very painful, after the information to consult, It is found that both Apache and Isapi_rewrite,url are identified by UTF8 encoding.

PS: Character encoding conversion process: GBK, gb2312 >> uinicode >> UTF8

First, when you enter Chinese in the browser address bar (the browser automatically converts):
1 URL path: UTF8 format
2 URL parameter: GBK format
3) resquest. QueryString: The meta Chartset UTF8/GBK decision of the Web page itself (for example: <meta http-equiv= "Content-type" content= "text/html"; Charset=utf-8 "/>)        
4) server. Urlencode:utf8 format

Like what:
Funny video (GBK encoding):%B8%E3%D0%A6%CA%D3%C6%B5
Funny video (UTF8 encoding):%e6%90%9e%e7%ac%91%e8%a7%86%e9%a2%91

The browser will automatically convert:
http://www. Your domain name. com/tag.asp?t= funny Video
http://www. Your domain name. com/tag.asp?t=%b8%e3%d0%a6%ca%d3%c6%b5
http://www. Your domain name. com/tag/funny Video
http://www. Your domain name. com/tag/%e6%90%9e%e7%ac%91%e8%a7%86%e9%a2%91

second, when the page is encoded as (Utf-8) (<meta http-equiv= "Content-type" content= "text/html; charset=utf-8"/>),

1 will not appear garbled address:
http://www. Your domain name. com/tag/funny Video
http://www. Your domain name. com/tag/%e6%90%9e%e7%ac%91%e8%a7%86%e9%a2%91

2) garbled address:
http://www. Your domain name. com/tag.asp?t= funny Video
http://www. Your domain name. com/tag.asp?t=%b8%e3%d0%a6%ca%d3%c6%b5

PS: The reason for the garbled code is that when the rewrite rewrite is made, the parameter is converted to Unicode format by default (input parameter requirements areUTF8 format, output parameters are GBK format), and when passed to the Web page, the UTF8 way to decode, of course, will appear garbled.

Solution:
1, in the page page received to the GBK format of the parameters, by programming the GBK into UTF8 format, that is, not add nu specified items.
2, in rewrite internal conversion, the output parameters from GBK into UTF8 format, that is, add nu specified items (obviously the 2nd method is more convenient).

first, the following is the process of passing in Chinese (rewritein the rules, no Nu added .specified Item):



Second, the following is the process of passing in Chinese (rewrite in the rules, plus nu . specified Item):  


here is the test page:
Joke:
GBK: %d0%a6%bb%b0
UTF8:%e7%ac%91%e8%af%9d

1 , no Nu under:
Page Meta=utf8: (Required length: 6)
Test1.asp?q=%d0%a6%bb%b0           Garbled (ц): Length is 4-length not enough
test1.asp?q=%e7%ac%91%e8%af%9d     Normal: Length is 6

Page META=GBK: (Required length: 4)
Test2.asp?q=%d0%a6%bb%b0           Normal: Length is 4
test2.asp?q=%e7%ac%91%e8%af%9d     Garbled (quilted): Length 6-length too long

Page Meta=utf8: (Required length: 6)
Test1/%d0%a6%bb%b0           garbled (blank)-Length not *2
test1/%e7%ac%91%e8%af%9d     Garbled (ц)-Not enough length ( input parameter requirements areUTF8 format, output parameter is GBK format)

Page META=GBK: (Required length: 4)
Test2/%d0%a6%bb%b0           Garbled (ц)-Not enough length
test2/%e7%ac%91%e8%af%9d     Normal-( input parameter requirements areUTF8 format, output parameter is GBK format)

2 , there are Nu under:
Page Meta=utf8: (Required length: 6)
Test1.asp?q=%d0%a6%bb%b0           Garbled (ц)-Not enough length
test1.asp?q=%e7%ac%91%e8%af%9d     Normal

Page META=GBK: (Required length: 4)
Test2.asp?q=%d0%a6%bb%b0           Normal
test2.asp?q=%e7%ac%91%e8%af%9d     Garbled (quilted)-length too long

Page Meta=utf8: (Required length: 6)
Test1/%d0%a6%bb%b0           Garbled (ц)
test1/%e7%ac%91%e8%af%9d     Normal

Page META=GBK: (Required length: 4)
Test2/%d0%a6%bb%b0           Normal
test2/%e7%ac%91%e8%af%9d     Garbled (quilted)

Conclusion:
1). htaccess file itself is the same result as UTF8, GBK format.
2 rewrite nu items, just before rewriting? The URL path before the question mark is valid, before overriding? The argument after the question mark is invalid.
3 Rewrite has nu: The meaning is to transfer the parameters to the past, without any coding.
4 Rewrite No nu: Invoke the default UTF encoding function, turn the parameter to Unicode format, and then request directly from the page. QueryString can be. ( PS: Character encoding conversion process: GBK, gb2312 >> uinicode >> UTF8)

1, receive page as Meta = UTF8,? Arguments following the question mark:--not related to the NU item, directly to the Chinese can not, to UTF encoding. (to encode, X is wrong)
2, receive the page for meta = GBK,? Arguments after the question mark:-- Not related to the NU item, directly transmitted in Chinese No, you can't ., to UTF the encoding. (To encode , X is wrong)
3, The receive page is meta = UTF8 ,? The URL path before the question mark:--Must have the NU item, directly passes the Chinese ( the browser automatically UTF encoding) or UTF encoding is possible.
4, receive the page for meta = GBK,? The URL path before the question mark: - when there is a nu item, the GBK is encoded; for UTF encoding or direct Chinese ( the browser automatically utf the encoding).

Summarize

Finally, I use PHP to directly use the UrlEncode () function directly to deal with, so use get access will be automatically resolved, but also solve the Chinese path garbled problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.