Processing with a plus sign in the URL

Source: Internet
Author: User
Tags jboss jboss server

Cause of the problem:
The customer ordered a keyword "e+h transmitter", in the home page recommended ads, according to the user searched for keywords in search for a match to run. Technical implementation is ued through JS to obtain the H_keys content of the cookie, assembled to http://xxxxx/advert/ctp_advert.htm?num=4&keyword= {keyword}. This takes out the corresponding cookie information in Chinese and finally initiates a GET request via an Ajax.

So the final request is: http://xxxxxx/advert/ctp_advert.htm?num= 4&keyword=e+h transmitter. When the server receives the corresponding request parameter, the parameter is: E H transmitter, + No. Initial suspicion is related to URL specification and requires URL encode.

Problem Analysis:

Check the next JS encode related content, always found the + number of the secret.
In HTML, because of some nonstandard practices, the + equivalent of a space is processed (when the HTML form is submitted, each form field is URL-encoded before it is sent.) For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for whitespace is not%20, but the + number, if the form is submitted using the Post method, we can see in the HTTP header that there is a content-type header with a value of application/ X-www-form-urlencoded, most applications can handle this non-standard implementation of URL encoding).
In the search engine made the next attempt:
Keyword = e h transmitter, url = http://www.google.cn/search?hl=zh-cn&newwindow=1&q=e+h transmitter (space converted to +)
Keyword = e+ h transmitter, url = http://www.google.cn/search?hl=zh-cn&newwindow=1&q=e%2bh transmitter (+ number is escaped to%2B, program is processed correctly) )

Problem solving:

Idea 1:
1. To transmit the + sign normally without being escaped as a space, you need to encode it as%2b. Check the next few coding functions, found that only encodeuricomponent will encode the + number processing.
2. encodeURIComponent default to adopt the UTF-8 character set, in theory only need to add _input_charset=utf-8 in the original request (by the Setlocalevalve in the pipeline to parse), you can get the correct E +h transducer.

In the implementation process, it was found that the results were not expected. After the client through JS encode, after the service side parsing has been garbled. Check the next byte, found that the server has been using GBK in the parsing, the transmitter for the UTF-8 encoding byte { -27,-113,-104,-23,-128,-127}, the client with the GBK parsed into { -27.-113.-104.-23,-63, -63}, for the last two bytes because the characters are not visible, resulting in all being replaced by-63. On-line check, for Utf-8-GBK, utf-8 in certain circumstances will appear this problem (http://lingqi1818.iteye.com/blog/348953).

Idea 2:
Continue to track down the reason why the corresponding _input_charset=utf-8 did not come into effect, DEBUG saw in Setlocalevalve indeed set the request.setcharsetencoding to Utf-8. Preliminary doubts are related to the configuration of the JBoss server, with the uriencoding and Usebodyencodingforuri settings checked. At present, the company uses JBoss 4.05, corresponding to the Russian tomact configuration only specify the corresponding URIENCODING=GBK. Because of this, the _input_charset setting for GBK has no effect on the commit, or is parsed according to GBK.

1. Consider changing the request from Get to post so that you can use the _input_charset

But in the process of implementation, and ued communication process, for post will cause a cross-domain request problem. This scheme can only be done

Idea 3 (successful practice):

1. UED the implementation of pseudo-URL encode, the + number is%2B encoded. Because there is no ready-made function in JS at present, it is only converted by replace (/\+/g, '%2b ').

Summary

For the processing of the + number, different processing scenarios are required for different business scenarios, describing the following scenarios:
1. Non-AJAX requests
Can directly use the form form of GET, POST UrlEncode Protocol, automatic implementation + =%2B Conversion
2. Ajax Requests
* GET Request: Very helpless, can only use scenario 3, artificial + number conversion.
* POST Request (same app, non-cross-domain request): Use encodeURIComponent + _input_charset=utf-8 to specify encoding for processing.

PS: The above mentioned scenarios are based on the + number is the normal business scenario to consider. At the same time we can also from the business level of a comb, + number processing whether there is a need, can be directly from the business data access to evade that is the best.


Background knowledge:

Uriencoding and Usebodyencodingforuri

     for data submitted by url  and get  in the form, the jsp  of the received data It is not possible to set the request.setcharacterencoding parameter in tomcat5.0 , which uses iso-8859-1  to url  submitted data and forms by default get  Re-encode (decode) the data submitted by the url , without using this parameter to re-encode (decode) the data submitted by the get  and the data submitted in the form. To resolve this issue, you should set the usebodyencodingforuri  or uriencoding  attribute in the connector  tab of the tomcat  configuration file. Where the usebodyencodingforuri  parameter indicates whether the data submitted with the  request.setCharacterEncoding  parameter to the URL and the form get  &NBSP, by default, the parameter is false  (the default is true  for this parameter in tomcat4.0 ), and the; uriencoding  parameter specifies that all get   Method requests (including url  submitted data and data submitted in form get ) are uniformly re-encoded (decoded) by encoding  . The difference between  URIEncoding  and usebodyencodingforuri  is that,uriencoding  is a uniform re-encoding (decoding) of the data requested by all get  methods, The usebodyencodingforuri  is the re-encoding (decoding) of the data according to the request.setcharacterencoding  parameter of the page that should be requested. Different pages can have different recoding (decoding) encodings. So for the data submitted by url  and the get  method submitted in the form, you can modify the  URIEncoding  parameter to encode the browser or modify usebodyencodingforuri  to True &NBSP, and in the jsp  page that gets the dataThe request.setcharacterencoding  parameter is set to the browser encoding.

Why URL encoding is required
1. Some characters in the URL will cause ambiguity, =,& number, etc.
2. The encoding format of the URL is ASCII, not Unicode, which means you cannot include any non-ASCII characters in the URL, such as Chinese


Which characters need to be encoded
The RFC3986 documentation stipulates that only English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL.
URLs can be divided into several components, protocols, hosts, paths, and so on. The following characters are reserved characters in RFC3986:! * ‘ ( ) ; : @ & = + $,/? # [ ]


How to encode an illegal character in a URL
URL encoding is also commonly referred to as a percent-encoding (URL Encoding, also known as percent-encoding), because it is encoded in a very simple way, using the percent percent sign plus two-bit characters--0123456789abcdef-- Represents a 16 binary form of a byte. The default character set used by URL encoding is US-ASCII. For example, a in the US-ASCII code of the corresponding byte is 0x61, then the URL encoding is the% 61, we enter the Address bar http://g.cn/search?q=%61%62%63, is actually equivalent to Google on the search for ABC. Another example of the @ symbol in the ASCII character set of the corresponding byte is 0x40, after the URL encoded by the%40.


The difference between Escape,encodeuri and encodeuricomponent in Javascript

Javascript provides 3 pairs of functions used to encode URLs to get legitimate URLs, which are Escape/unescape,encodeuri/decodeuri and Encodeuricomponent/decodeuricompo, respectively. Nent. The process of decoding and encoding is reversible.

different compatibility
The escape function existed from the time of Javascript1.0, and the other two functions were introduced in Javascript1.5. But since Javascript1.5 is already very popular, there is no compatibility problem with encodeURI and encodeuricomponent in practice.

Unicode characters are encoded differently
These three functions are encoded in the same way as ASCII characters, and are denoted by a percent + two-bit hexadecimal character. However, for Unicode characters, escape is encoded as% uxxxx, where xxxx is the 4-bit hexadecimal character used to represent Unicode characters. This approach has been abandoned by the "the". However, this encoding syntax for escape is still maintained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters before they are percent-encoded. This is the RFC recommendation. It is therefore advisable to use these two functions instead of escape for encoding whenever possible.

suitable for different occasions
encodeURI is used to encode a complete URI, while encodeURIComponent is used as a component of the URI.

different security characters
Escape (69) */@+-._0-9a-za-z
encodeURI (82)!#$& ' () *+,/:; [Email protected]_~0-9a-za-z
encodeURIComponent (71)! ' () *-._~0-9a-za-z (Note that the + number is not in its safe word character)


Other issues related to URL encoding
For the handling of URLs containing Chinese, different browsers have different performance. For example, for IE, if you tick the advanced settings "always send URL with UTF-8", then the Chinese portion of the path portion of the URL is sent to the server using UTF-8 and the Chinese part of the query parameter is URL-encoded using the system default character set. To ensure maximum interoperability, it is recommended that all components placed in the URL explicitly specify a character set for URL encoding, rather than relying on the default implementation of the browser.

In addition, many HTTP monitoring tools, such as the browser address bar, will automatically decode the URL once (using the UTF-8 character set) when the URL is displayed, which is why the URL displayed in the address bar contains Chinese when you visit Google search Chinese in Firefox. But the original URL that is actually sent to the server is still encoded. You can see it by using JavaScript on the address bar to access the location.href. Don't be fooled by these illusions when researching URL codecs.

Processing with a plus sign in the URL

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.