Encoding and decoding in the HTTP protocol

Source: Internet
Author: User

HTTP://WWW.CSDN1 2 3.com/html/itweb/20130730/29422_29378_29408.htm

******************************

Introduction of character set and text encoding

1. How the computer displays text

We know that the computer is the binary "form" to save and process the data, that is, whether we use the keyboard to input, or let the computer to read a text file, the computer gets the original content is some binary sequence, when the need to display these binary sequences, the computer will be in accordance with some kind of " Translation mechanism "(that is, encoding), take the binary sequence of each text represented by the" Contour description "(dot matrix or vector), know the outline, the computer can be the binary sequence represented by the actual text shape display on the screen, the idea and use the number to indicate that a student is the same. (Of course, there will be a lot of specific knowledge, the relevant knowledge can be referred to the "computer graphics" in the display of the principle of the part and other related to the principle of computers display basic Books).

2. Character Set

A set of characters in a natural language and a normalized encoding for each character in the collection, together with a set of characters and a normalized encoding. All English letters are included in the ASCII character set, and the encoding rules for these letters are specified. The GB2312 character set contains commonly used simplified Chinese characters, and the encoding rules for these simplified Chinese characters are specified.

3. Character encoding

Character encoding is the establishment of a set of natural language "characters" and the computer can store the processing of binary number mapping rules, that is, in a character set, a specific binary number to represent a unique "character", similar to the student number and students mapping relationship. In order to ensure uniformity, compatibility, the "character" and "inner Code" of the mapping relationship will be assigned to the standard, so that the ASCII, Unicode and other standard encoding, detailed coding methods please refer to the character encoding and other relevant information.

II. coding and decoding in the universal sense

1. Why do I need to encode

When data is not conducive to processing and storage, it is necessary to encode them. Characters are encoded because characters in natural languages are not conducive to computer processing and storage. The image information, video information, sound information compression, optimization, the "format", in order to ensure the quality of media resources, as far as possible to save network bandwidth and local storage space. The URL is encoded in order to avoid ambiguous URL parsing, simplifying the decoding method, such as: The URL takes "&" as the delimiter of different parameters, if the name of a particular parameter or the value itself includes the delimiter "&", if the parameter "&" does not encode the conversion, That will inevitably increase the complexity of URL parsing and increase the probability of parsing errors.

2. How to encode and decode

According to the actual needs of the difference, encoding, decoding algorithm may be very complex, but also may be very simple, but fundamentally speaking, encoding, decoding is only doing translation work, one form of data translation into another form of data, such as the simplest encoding, Decoding is the equivalent of finding value from a map based on key, and then using value instead of the key in the actual data. Complex encodings such as encodeURIComponent and decodeuricomponent,encodeuricomponent in JavaScript are responsible for converting characters in a string that do not conform to the URL encoding specification to "%" In the form of a hexadecimal Unicode inner code sequence, decodeuricomponent is responsible for converting the hexadecimal Unicode inner code sequence in the form "%" to the actual character.

Third, encoding and decoding in the HTTP protocol

1. Encoding and decoding of URLs

First, because the URL is encoded in the ASCII character set, if the URL contains characters from a non-ASCII character set, it needs to be encoded. Furthermore, because many of the characters in the URL are reserved words, they have special meanings in the URL. such as "&" means the parameter delimiter, if you want to use these reserved words in the URL, then you have to encode them.

According to the RFC3986 "% code" specification released in 2005: the non-reserved words in the URL that belong to the ASCII character set are not encoded, the reserved words in the URL need to take their ASCII code, and then the "%" prefix to replace the character (encoding) For non-ASCII characters in the URL, you need to take its Unicode inner code, and then add the "%" prefix to replace (encode) the character. Since this encoding is in the form of "%" plus character codes, some places are called "percent-coded".

Although the "percent code" of the URL encoding method is detailed, but in practice, the browser for the URL encoding method there are some differences (mainly in the non-ASCII character encoding differences), then we first show the different browsers (chrome and IE) URL encoding

Differences, and then do some objective summary and analysis of these differences.

1) The encoding of non-ASCII characters in the URL, the original URL address: http://test/wangfengpaopao/Feng Wang, the request method is to enter the address directly in the browser address bar, initiating the request.

A) Chrome

b) IE

As you can see, for non-ASCII characters in the path, chrome and IE are encoded according to the RFC3986 "% code", taking the Unicode code of non-ASCII characters.

2) encoding of non-ASCII characters in URL parameters, original URL address: http://test/wangfengpaopao/Feng Wang? Name= Feng Wang, the request is to enter the address directly in the browser address bar, initiating the request.

A) Chrome and IE11

  

b) IE11 The following versions (decoded using GBK)

  

As you can see, for non-ASCII characters in the query parameters, chrome and its IE11 are encoded in Unicode code that is RFC3986 "% encoded" and that takes non-ASCII characters. IE11 the following version directly sends the non-ASCII word typeface corresponding to the current system default encoding of the internal code.

3) Encoding of non-ASCII characters in the value of the form field name, request address: http://test/wangfengpaopao/Feng Wang, request is get.

A) Chrome

I. Page GBK encoding

    

II. Page UTF-8 encoding

    

b) IE

I. Page GBK encoding

    

II. Page UTF-8 encoding

    

It can be seen that when a GET request is sent through a form, the non-ascii,chrome and IE in the contents of the form field are "percent" encoded with the current page's encoding.

4) Encoding of non-ASCII characters in the Value of form field name, request address: http://test/wangfengpaopao/Feng Wang, request is post,enctype to application/ X-www-form-urlencoded.

A) Chrome

I. Page UTF-8 encoding

    

II. Page GBK encoding

    

b) IE

I. Page UTF-8 encoding

    

II. Page GBK encoding

    

You can see that when sending a POST request through a form, the non-ascii,chrome and IE in the contents of the form field are "percent" encoded with the current page's encoding.

5) Encoding of non-ASCII characters in the URL, original URL address: http://test/wangfengpaopao/Feng Wang? Name= Feng Wang, the request is initiated Ajax request, method is get.

A) Chrome

I. Page UTF-8 encoding

    

II. Page GBK encoding

    

b) IE

I. Page UTF-8 encoding

1. IE6 (GBK decoding)

      

2. IE11

      

II. Page GBK encoding

1. IE6 (GBK decoding)

      

2. IE11

      

As you can see, IE6 does not do any coding for non-ASCII characters in the URL, and Chrome and IE11 encode the URL in the same way that a form GET request is used.

6) Encoding of non-ASCII characters in the URL, original URL address: http://test/wangfengpaopao/Feng Wang, request method for initiating Ajax request, method for post, data bit Name= Feng Wang, Content-type for application/x-www-form-urlencoded.

A) Chrome

I. Page UTF-8 encoding

    

II. Page GBK encoding

    

b) IE

I. Page UTF-8 encoding

1. IE6

      

2. IE11

      

II. Page GBK encoding

1. IE6

      

2. IE11

      

As you can see, each browser does not have a "percent" encoding of the data for requests that are sent using post and Content-type to application/x-www-form-urlencoded.

From the above experimental results we can see:

① for the path part of the URL, ie and chrome will uniformly use UTF-8 encoding to sign the non-ASCII characters in the URL in percent.

② for newly opened pages, IE11 the following versions do not encode the parameters part of the URL, and chrome and IE11 use UTF-8 encoding to sign the non-ASCII characters in the URL in percent.

③ for requests made through forms (either post or get), IE and chrome use the default encoding of the current page to sign the non-ASCII characters in the URL parameters.

④ for AJAX-initiated requests through get, IE11 and Chrome will have a percent-coded non-ASCII character in the URL parameter based on the default encoding of the current page. The IE6 does not encode path information and parameter information for URL non-ASCII representations.

⑤ for AJAX-initiated requests via post, even if the application/x-www-form-urlencoded header is set, the browser does not encode any data (or the browser does not use the data that is sent as part of the URL).

Different browsers handle non-ASCII character encoding in different situations, but the browser is consistent with the encoding of the form data, that is: The UTF-8 character set is used for the percent encoding of the ASCII characters in the URL path. , the form data in the pair (including enctype at Post is application/x-www-form-urlencoded), with the page's default coded character set for percent encoding.

For the difference of URL encoding when Ajax initiates a request, we can encode the URL or non-ASCII characters in the data using JavaScript encodeuricomponent, unify the encoding method, simplify the complexity of the server decoding.

2. Browser decoding of "resources" for different media resource types (mime-type)

1) HTTP header information related to resource type and encoding type.

A) header information to be carried when the browser request

  

b) The header information that is carried when the server response

  

When a resource request is initiated through a browser, the browser carries the Accept header information, identifies the browser's desired mime-type, and specifies the browser's preference coefficient Q for different mime-type, while the browser sends Accept-charset header information. Identifies the browser character set decoding type.

When the server returns a resource that meets the browser's requirements, the server also carries Content-type header information that identifies the media type and encoding of the returned resource.

2) The browser's parsing process for resources of different media resource types

In any case, the Mime-type information that is carried when the browser initiates a request is merely an "expectation" of the server's return resource, and the mime-type of the resource itself is represented by the Mime-type information that is carried when the server answers.

In general, for text-type data (Html/css/js/xml, etc.), the browser will first convert the text stream encoding to the encoding used to process the data with the decoder corresponding to the mime-type of the resource, such as the JavaScript file, based on the charset of the resource. , if the text stream itself is GBK encoded, it must first be converted to Unicode encoding and then handed to the JavaScript engine to parse the execution. Of course, for different types of resources, the browser decoding process is not the same, the following will be described in more detail in the browser for HTML, CSS, JavaScript general decoding steps.

① browser decoding process for HTML documents

② Browser decoding process for CSS/JS documents

3) in AJAX requests, the browser's decoding process of "data"

The AJAX request differs from the request for a resource attribute source in the page, in essence, the data that the AJAX request gets, the browser thinks it is a normal text stream, regardless of the specific mime-type, even if the data is Responsexml, It is also a browser-to-responsetext XML parsing, which is the same as using JavaScript to parse ResponseText XML.

The process of Ajax decoding

  

3. Server-to-resource encoding and decoding

Figure out the principle of encoding and decoding and browser coding and decoding process, the server encoding and decoding can be said that there is no more, the future of soldiers, punches just. To be said to avoid the URL decoding of the various compatibility issues, it is best to have a uniform specification, such as the data submitted through the interface of non-ASCII characters are encodeuricomponent URL encoding. In order to avoid browser compatibility with resource decoding, it is also important to explicitly and correctly specify charset information when the server returns resources through Response.setcontenttype ("", CharSet).

In addition to skillfully using the programming language you are good at and relative to the framework provided by the codec-related methods, you can refer to Iconv to understand the specific code conversion principle, reference ucharsetdetector understand the principle of character set detection.

Encoding and decoding in the HTTP protocol

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.