Crawler_url Coding Principle Detailed

Source: Internet
Author: User
Tags urlencode

Often write crawler children's shoes, it is inevitable to deal with the URL containing Chinese, most of the time, all know url_encode, each language also has support, today simple collation under the principle, for everyone popular science

1. Characteristics:

If the URL contains non-ASCII characters, the browser url_encode the URL and sends it to the server. The process of url_encode is to encode the URL as a character in a coded way (GBK, UTF-8, etc.) into a binary bytecode, and then each byte is represented by a 3-character string "%xy", where XY is the two-bit hexadecimal representation of the byte.

UrlEncode exactly how to encode characters according to that encoding? This is the browser thing, and different browsers have different practices, the Chinese version of the browser will generally default to use GBK, by setting up the browser can also use UTF-8, different users have different browser settings, but also create a different encoding method, So many of the site's approach is to first put the URL inside the Chinese or special characters with JavaScript URL encode, and then splicing the URL to submit data, that is, for the browser to do the UrlEncode, the advantage is that the site can be unified get method to submit data encoding method. Completed the UrlEncode

The charset of the Web page is to tell the browser what encoding to use to interpret the page, and also

2. Flowchart:

The original URL---->get when the browser based on the HTTP header of the Content-type charset,post based (<meta http-equiv= "Content-type" content= "text/html"; Charset=utf-8 "/>" encode the URL or use JavaScript (if JavaScript is encoded then the browser is no longer encoded with ASCII characters) encode the URL using GBK or UTF-8 encoding----> All ASCII characters----> are converted to binary----in iso-8859-1 encoding, > sent with the request header (get no request entity, Post has)----> the server receives the ISO-8859-1 encoded URL---- The > server is decoded with ISO-8859-1 encoding----> Web page generally has the META header charset option, which is decoded by the server (Post form submission is encoded in the past)----> get the correct value

Crawler_url coding principles in a detailed

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.