Cause of the problem:
The customer ordered a keyword "e+h transmitter", in the home page recommended ads, according to the user searched for keywords in search for a match to run. Technical implementation is ued through JS to obtain the H_keys content of the cookie, assembled to Http://xxxxx/advert/ctp_advert.htm?num=4&keyword={keyword}. This takes out the corresponding cookie information in Chinese and finally initiates a GET request via an Ajax.
So the final request is: Http://xxxxxx/advert/ctp_advert.htm?num=4&keyword=e+h transmitter. When the server receives the corresponding request parameter, the parameter is: E H transmitter, + No. Initial suspicion is related to URL specification and requires URL encode.
Problem Analysis:
Check the next JS encode related content, always found the + number of the secret.
In HTML, because of some nonstandard practices, the + equivalent of a space is processed (when the HTML form is submitted, each form field is URL-encoded before it is sent.) For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for whitespace is not% 20, but the + number, if the form is submitted using the Post method, we can see a content-type header in the HTTP header with a value of application/ X-www-form-urlencoded, most applications can handle this non-standard implementation of URL encoding).
In the search engine made the next attempt:
Keyword = e h transmitter, URL =http://www.google.cn/search?hl=zh-cn&newwindow=1&q=e+h transmitter (space converted to +)
Keyword = e+ h transmitter, url = http://www.google.cn/search?hl=zh-CN&newwindow=1&q=e%2Bh transmitter (+ number is escaped to%2B, program can be processed normally)
Problem solving:
Idea 1:
1. To transmit the + sign normally without being escaped as a space, you need to encode it as%2b. Check the next few coding functions, found that only encodeuricomponent will encode the + number processing.
2. encodeURIComponent default to adopt the UTF-8 character set, in theory only need to add _input_charset=utf-8 in the original request (by the Setlocalevalve in the pipeline to parse), you can get the correct E +h transducer.
In the implementation process, it was found that the results were not expected. After the client through JS encode, after the service side parsing has been garbled. Check the next byte, found that the server has been using GBK in the parsing, the transmitter for the UTF-8 encoding byte { -27,-113,-104,-23,-128,-127}, the client with the GBK parsed into { -27.-113.-104.-23,-63, -63}, for the last two bytes because the characters are not visible, resulting in all being replaced by-63. On-line check, for Utf-8-GBK, utf-8 in certain circumstances will appear this problem (http://lingqi1818.iteye.com/blog/348953).
Idea 2:
Continue to track down the reason why the corresponding _input_charset=utf-8 did not come into effect, debug saw in Setlocalevalve indeed set the request.setcharsetencoding to Utf-8. Preliminary doubts are related to the configuration of the JBoss server, with the uriencoding and Usebodyencodingforuri settings checked. At present, the company uses JBoss 4.05, corresponding to the Russian tomact configuration only specify the corresponding URIENCODING=GBK. Because of this, the _input_charset setting for GBK has no effect on the commit, or is parsed according to GBK.
1. Consider changing the request from Get to post so that you can use the _input_charset
But in the process of implementation, and ued communication process, for post will cause a cross-domain request problem. This scheme can only be done
Idea 3 (successful practice):
1. ued the implementation of pseudo-URL encode, the + number is%2B encoded. Because there is no ready-made function in JS at present, it is only converted by replace (/\+/g, '%2b ').
Summarize
For the processing of the + number, different processing scenarios are required for different business scenarios, describing the following scenarios:
1. Non-AJAX requests
Can directly use the form form of GET, Post UrlEncode protocol, automatic implementation + =%2B Conversion
2. Ajax Requests
* GET request: Very helpless, can only use scenario 3, artificial + number conversion.
* POST request (same app, non-cross-domain request): Use encodeURIComponent + _input_charset=utf-8 to specify encoding for processing.
PS: The above mentioned scenarios are based on the + number is the normal business scenario to consider. At the same time we can also from the business level of a comb, + number processing whether there is a need, can be directly from the business data access to evade that is the best.
Background knowledge:
Uriencoding and Usebodyencodingforuri
It is not possible to set the request.setcharacterencoding parameter in the JSP that receives the data for the data submitted by the URL and the data submitted in the form, because iso-is used by default in Tomcat5.0. 8859-1 re-encodes (decodes) the data submitted by the URL and the data submitted by the Get method in the form, without using the parameter to Recode (decode) The data submitted by the URL and the data submitted in the form by get. To resolve this issue, you should set the Usebodyencodingforuri or Uriencoding property in the Connector tab of the Tomcat configuration file. Where the Usebodyencodingforuri parameter indicates whether the data submitted by the URL and the data submitted in the form are re-encoded with the request.setcharacterencoding parameter, by default, This parameter is False (this parameter is true by default in Tomcat4.0), and the uriencoding parameter specifies a uniform recoding (decoding) encoding for all get method requests, including data submitted by the URL and the Get method submitted in the form. The difference between uriencoding and Usebodyencodingforuri is that uriencoding is a uniform recoding (decoding) of all the data requested by the Get method, Usebodyencodingforuri is the re-encoding (decoding) of the data according to the request.setcharacterencoding parameter of the page that should be requested, and the different pages can have different encodings (decoding). So for the data submitted by the URL and the data that is submitted in the form, you can modify the uriencoding parameter to encode the browser or modify Usebodyencodingforuri to true, and in the JSP page that gets the data The request.setcharacterencoding parameter is set to the browser encoding.
Why URL encoding is required
1. Some characters in the URL will cause ambiguity, =,& number, etc.
2. The encoding format of the URL is ASCII, not Unicode, which means you cannot include any non-ASCII characters in the URL, such as Chinese
Which characters need to be encoded
The RFC3986 documentation stipulates that only English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL.
URLs can be divided into several components, protocols, hosts, paths, and so on. The following characters are reserved characters in RFC3986:! * ‘ ( ) ; : @ & = + $,/? # [ ]
How to encode an illegal character in a URL
URL encoding is also commonly referred to as a percent-encoding (URL Encoding,alsoknown as percent-encoding) because it is encoded in a very simple way, using the% percent sign plus two-bit characters--0123456789abcdef-- Represents a 16 binary form of a byte. The default character set used by URL encoding is US-ASCII. For example A in the US-ASCII code in the corresponding byte is 0x61, then the URL encoding is%61, we enter http://g.cn/search?q=%61%62%63 on the address bar, in fact, the equivalent of searching for ABC on google. Another example of the @ symbol in the ASCII character set of the corresponding byte is 0x40, after the URL encoded by the%40.
[story caused by failure] Processing with a plus sign in the URL