Garbled characters when the website pseudo-static Rewrite is used to Rewrite the Chinese path

Source: Internet
Author: User

If the pseudo-static data is used directly, Chinese characters are not good for search engines, and garbled characters are often obtained. Sometimes it is normal from google, but it is garbled on Baidu, sometimes it is okay to use garbled characters in all the search engines of the ff browser. To solve this problem, let's summarize my analysis process.

I. Origin of the problem.

The URL is the URL, which is used as long as you access the internet.

Generally, URLs can only contain English letters, Arabic numerals, and some punctuation marks. For example, there is a world site for English letters "http://www.abc.com", but there is no URL for Greek letters "http://www.a beta gamma. com" (read as Alpha Beta-Gamma. com ). This is because the network standard RFC 1738 provides hard rules:

"... Only alphanumerics [0-9a-zA-Z], the special characters "$-_. +! * '(), "[Not including the quotes-ed], and reserved characters used for their reserved purposes may be used unencoded within a URL ."
"Only letters and numbers [0-9a-zA-Z] and some special symbols" $-_. +! * '(), "[Double quotation marks not included], and reserved words can be directly used for URL without encoding ."

This means that if the URL contains Chinese characters, it must be encoded before use. But the trouble is that RFC 1738 does not specify a specific encoding method. Instead, it is decided by the application (browser. This leads to "URL encoding" becoming a chaotic field.
Next let's take a look at the confusion of "URL encoding. I will analyze four different cases in sequence. In each case, the URL encoding methods of the browser are different. After explaining their differences, I will explain how to use Javascript to find a unified encoding method.

Ii. Case 1: The URL path contains Chinese Characters.

Open IE (I used version 8.0) and enter the URL"Http://zh.wikipedia.org/wiki/spring Festival". Note that the word "Spring Festival" is part of the URL path.

View the header information of the HTTP request and you will find that the actual query URL of IE is"Http://zh.wikipedia.org/wiki/%E6%98%A5%E8%8A%82". That is to say, IE automatically encodes "Spring Festival" into "% E6 % 98% A5 % E8 % 8A % 82 ".

We know that the UTF-8 encoding of "Spring" and "section" is "E6 98 A5" and "E8 8A 82" respectively. Therefore, "% E6 % 98% A5 % E8 % 8A % 82" is obtained by adding % before each byte in order. (For specific transcoding methods, refer to my character encoding notes.)
The same results were obtained for testing in Firefox. So,Conclusion 1: The URL path encoding uses UTF-8 encoding.

Iii. Case 2: query strings containing Chinese Characters.

Enter the URL"Http://www.baidu.com? Wd = Spring Festival". Note that the word "Spring Festival" is a query string and does not belong to the URL path. Do not confuse it with scenario 1.

Check the header information of the HTTP request, and IE turns the Spring Festival into a garbled code.

Switch to the hexadecimal mode to make it clear that "Spring Festival" has been converted into "B4 ba bd da ".

We know that the GB2312 encoding for "Spring" and "section" (the default encoding for the Chinese version of "Windows XP" in my operating system) is "B4 BA" and "bd da" respectively ". Therefore, IE actually sends the query string in GB2312 encoding format.
The Firefox processing method is slightly different. It sends an HTTP Head of "wd = % B4 % BA % BD % DA ". That is to say, GB2312 encoding is also used, but % is added before each byte.

So,Conclusion 2: the encoding of the query string is the default encoding of the operating system.

Iv. Case 3: The URL generated by the Get method contains Chinese Characters
.

The preceding section describes how to directly enter the website address. However, more often, an HTTP request is sent directly using the Get or Post method on an opened webpage.
The encoding method is determined by the webpage encoding, that is, the character set setting in the HTML source code.
<Meta http-equiv = "Content-Type" content = "text/html; charset = xxxx">
If the last charset in the row above is a UTF-8, the URL is encoded in UTF-8; if it is GB2312, the URL is encoded in GB2312.
For example, Baidu is GB2312 encoding and Google is UTF-8 encoding. Therefore, when you search for the same word "Spring Festival" in their search box, the query strings generated are different.
Baidu generates % B4 % BA % BD % DA, Which is GB2312 encoded.

Google generates % E6 % 98% A5 % E8 % 8A % 82, which is a UTF-8 code.

So,Conclusion 3: The GET and POST methods use the webpage encoding.

V. Case 4: The URL called by Ajax contains Chinese Characters
.

The first three cases are HTTP requests sent by the browser, and the last case is HTTP requests generated by Javascript, that is, Ajax calls. According to Mr. Lu rilin's article, in this case, the processing methods of IE and Firefox are completely different.
For example, there are two lines of code:
Url = url + "? Q = "+ document. myform. elements [0]. value; // assume that the value submitted by the user in the form is" Spring Festival ".
Http_request.open ('get', url, true );
Therefore, no matter what character set the web page uses, What IE sends to the server is always "q = % B4 % BA % BD % DA ", firefox always sends "q = % E6 % 98% A5 % E8 % 8A % 82" to the server ". That is to say,In Ajax calls, IE always uses GB2312 encoding (default Operating System encoding), while Firefox always uses UTF-8 encoding. This is our conclusion 4.

6. Javascript Functions: escape ()
.

Okay, so far, all four cases have been completed.
If you have understood the above, you may feel a headache at this time. It's so messy. Different operating systems, different browsers, and different Web character sets will lead to completely different encoding results. If programmers want to take every result into account, Isn't that terrible? Is there a way to ensure that the client uses only one encoding method to send requests to the server?
The answer is yes, that is, using Javascript to encode the URL first, and then submit it to the server. Do not give the browser a chance to intervene. Because the Javascript output is always consistent, the data obtained by the server is consistent in format.
The Javascript language is used for encoding functions. There are three functions in total. The oldest one is escape (). Although this function is not recommended for use now, it is still used in many places due to historical reasons, so it is necessary to start from it.
In fact, escape () cannot be used directly for URL encoding. Its real function is to return the Unicode encoded value of a character. For example, the returned result for the Spring Festival is % u6625 % u8282. That is to say, in the Unicode Character Set, the "Spring" is 6,625th (hexadecimal) characters, the "section" is 8,282nd (hexadecimal) characters.

The specific rule is that all characters except ASCII letters, numbers, and punctuation marks @ * _ +-./Are encoded. The symbols between u0000 and u00ff are converted into the form of % xx, and other symbols are converted into the form of % uxxxx. The corresponding decoding function is unescape ().
Therefore, the escape () encoding of "Hello World" is "Hello % 20World ". Because the Unicode value of space is 20 (hexadecimal ).

There are two more points to note.
First, no matter what the original code of the webpage is, once it is encoded by Javascript, it will become a unicode character. That is to say, the input and output of the Javascipt function are Unicode characters by default. This also applies to the following two functions.

Second, escape () does not encode "+. However, we know that when a webpage submits a form, if there is a space, it will be converted to a + character. When the server processes the data, it will process the plus sign as a space. Therefore, be careful when using it.

VII. Javascript function: encodeURI ()
.

EncodeURI () is a function used in Javascript to encode URLs.
It focuses on the encoding of the whole URL. Therefore, in addition to common symbols, it has special meanings for other symbols ";/? : @ & =+ $, # ", Which is not encoded. After encoding, It outputs the UTF-8 format of the symbol and adds % before each byte.

The corresponding decoding function is decodeURI ().

Note that it is not encoded with single quotation marks.

VIII. Javascript function: encodeURIComponent ()
.

The last Javascript encoding function is encodeURIComponent (). The difference with encodeURI () is that it is used to encode the URL components individually, not the whole URL.
Therefore, ";/? : @ & =+ $, # ". All the unencoded symbols in encodeURI () are encoded in encodeURIComponent. The specific encoding methods are the same.

Its corresponding decoding function is decodeURIComponent ().

The URL is actually encoded. The URL identified by Baidu and Google uses different codes. Google uses UTF8, while Baidu uses GB2312, therefore, it is really painful for those who have URLs containing Chinese characters. After reading the relevant information, we found that the URLs are UTF-8 encoded, whether Apache or ISAPI_Rewrite.

PS: character encoding conversion process: gbk, gb2312> uinicode> utf8

First, when entering Chinese characters in the address bar of the browser (the browser will automatically convert ):
1) url path: utf8 format
2) url parameters: gbk format
3) resquest. queryString: The meta chartset utf8/gbk of the webpage (for example, <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>)
4) server. UrlEncode: utf8 format

For example:
Funny video (gbk encoding): % B8 % E3 % D0 % A6 % CA % D3 % C6 % B5
Funny video (utf8 encoded): % E6 % 90% 9E % E7 % AC % 91% E8 % A7 % 86% E9 % A2 % 91

The browser will automatically convert:
Http: // www. Your Domain Name. com/tag. asp? T = funny video
Http: // www. Your Domain Name. com/tag. asp? T = % B8 % E3 % D0 % A6 % CA % D3 % C6 % B5
Http: // www. Your Domain Name. com/tag/funny video
Http: // www. Your Domain Name. com/tag/% E6 % 90% 9E % E7 % AC % 91% E8 % A7 % 86% E9 % A2 % 91

Second, when the webpage code is (UTF-8) (<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>,

1) No garbled addresses:
Http: // www. Your Domain Name. com/tag/funny video
Http: // www. Your Domain Name. com/tag/% E6 % 90% 9E % E7 % AC % 91% E8 % A7 % 86% E9 % A2 % 91

2) garbled address:
Http: // www. Your Domain Name. com/tag. asp? T = funny video
Http: // www. Your Domain Name. com/tag. asp? T = % B8 % E3 % D0 % A6 % CA % D3 % C6 % B5

PS: Garbled characters are generated. During rewrite rewriting, parameters are converted to unicode encoding by default (the input parameter must be in utf8 format and the output parameter must be in gbk format ), when a webpage is transmitted, garbled characters are decoded in utf8 mode.

Solution:
1. After a parameter in gbk format is received on the webpage, the gbk is converted to utf8 by programming, that is, the specified NU item is not added.
2. During rewrite internal conversion, convert the output parameter from gbk to utf8 format, that is, add the specified NU item (obviously, the 2nd methods are more convenient ).

First, the following is the process of transmitting Chinese (rewriteNU is not added to the rule.): 



Second, the following is the process of transmitting Chinese (rewrite Adding NU to the rule ):  

The following is the test page:Joke:
Gbk: % D0 % A6 % BB % B0
Utf8: % E7 % AC % 91% E8 % AF % 9D

1 No NU Below:Page meta = utf8 :( required length: 6)
Test1.asp? Q = % D0 % A6 % BB % B0 garbled characters: 4-insufficient length
Test1.asp? Q = % E7 % AC % 91% E8 % AF % 9D normal: length is 6

Page meta = gbk: (required length: 4)
Test2.asp? Q = % D0 % A6 % BB % B0 normal: length is 4
Test2.asp? Q = % E7 % AC % 91% E8 % AF % 9D garbled (signature? Xuan ?) : 6-too long

Page meta = utf8 :( required length: 6)
Test1/% D0 % A6 % BB % B0 garbled (blank)-insufficient length * 2
Test1/% E7 % AC % 91% E8 % AF % 9D Garbled text (encoding)-insufficient length (the input parameter must be in utf8 format, and the output parameter must be in gbk format)

Page meta = gbk: (required length: 4)
Test2/% D0 % A6 % BB % B0 garbled (invalid)-insufficient length
Test2/% E7 % AC % 91% E8 % AF % 9D normal-(the input parameter must be in utf8 format, and the output parameter must be in gbk format)

2 , Has NU Below:Page meta = utf8 :( required length: 6)
Test1.asp? Q = % D0 % A6 % BB % B0 garbled (garbled)-insufficient length
Test1.asp? Q = % E7 % AC % 91% E8 % AF % 9D normal

Page meta = gbk: (required length: 4)
Test2.asp? Q = % D0 % A6 % BB % B0 normal
Test2.asp? Q = % E7 % AC % 91% E8 % AF % 9D garbled (signature? Xuan ?) -The length is too long.

Page meta = utf8 :( required length: 6)
Test1/% D0 % A6 % BB % B0 garbled (plaintext)
Test1/% E7 % AC % 91% E8 % AF % 9D normal

Page meta = gbk: (required length: 4)
Test2/% D0 % A6 % BB % B0 normal
Test2/% E7 % AC % 91% E8 % AF % 9D garbled (signature? Xuan ?)

Conclusion:
1) The htaccess file is in UTF8 or gbk format.
2) The NU item of rewrite, only before rewriting? The URL path before the question mark is valid. Before rewriting? The parameter following the question mark is invalid.
3) rewrite has a NU term: it means that the parameter is transmitted as is, and no encoding is performed during this period.
4) rewrite has no NU items: Call the default utf Encoding function, transcode the parameters to the unicode format, and then directly request. Querystring on the page. (PS: character encoding conversion process: gbk, gb2312> uinicode> utf8)

1. The receiving page is meta = utf8 ,? Parameter after question mark: -- It has nothing to do with the NU item. It is not allowed to pass Chinese directly. It must be utf encoded. (X is incorrect)
2. The receiving page is meta = gbk ,? Parameter after question mark: -- It has nothing to do with the NU item. It is not allowed to pass Chinese directly. It must be utf encoded. (X is incorrect)
3. The receiving page is meta = utf8 ,? URL path before question mark: -- a nu item must be included, which can be directly transmitted to Chinese (the browser automatically utf Encoding) or utf Encoding.
4. The receiving page is meta = gbk ,? URL path before question mark: -- gbk encoding is required when there is a NU item. If there is no NU item, utf Encoding or direct Chinese (the browser automatically utf Encoding) is required ).

Summary

Finally, I used php to directly use the urlencode () function for processing. In this way, get will automatically resolve the issue and solve the Chinese path garbled problem.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.