"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Source: Internet
Author: User

"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Guangdong Vocational and Technical College Aohaoyuan

1. Introduction

The first step in implementing a web crawler is to establish a network connection and initiate requests to network resources such as servers or Web pages. Urllib is currently the most common practice, but requests is more convenient than urlib, allowing people to access network resources in a simpler way.

2, what is requests?

    requests is written in Python language, based on Urllib , with Apache2 Licensed http Library . It is more convenient than urllib, it can save us a lot of work, fully meet the requirements of HTTP testing.
    Requests Object has get, Post, Put, Delete, Head, options and other HTTP methods, Very simple to use. For Web systems, only the Get and post methods are generally supported. In the web crawler, get method is most commonly used. This article also focuses on the application of this method, the other method is more detailed use, see the User manual:
    http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

3. Initiating a network request

sending a network request using the requests get method is straightforward .
first, import the requests module:
Import requests
then, a request is made to the Web page via a URL:
res = requests. Get(' http://www.gdptc.cn/')
at this point, we have a requests object called res, from which we can get all the information we want, such as printing the URL of the Web page.
print ( res.url)

4. Get Response Content

We can read the contents of the server response. Requests automatically decodes content from the server, and most Unicode character sets can be seamlessly decoded .

After the request is issued,requests will make an educated guess based on the HTTP header 's encoding of the response. When you visit Res.text, requests uses its inferred text encoding. You can find out what encoding requests uses and can use the Res.encoding property to change it .

If you change the code , the Request will use the new value of res.encoding whenever you visit res.text.

5, the content of the response header

the server response header is represented in the form of a Python dictionary . This dictionary is special, it is only for the HTTP head of the birth.
The result of res.headers is:
{
'content-length': '39037 '
'x-powered-by': 'asp . '
'Date': 'Sat, Oct 13:58:41 GMT'
'x-aspnet-version': '2.0.50727 '
'cache-control': 'private'
'content-type': 'text/html; CharSet=utf-8'
'Server': 'microsoft-iis/7.5'
}
through this server's response header, we can know some basic data information of the server. According to RFC2616, the HTTP header is case insensitive. Therefore, we can use any uppercase form to access these response header fields. For example, we want to view the server's encoding and server model:

6. Custom Request Header

If you want to add an HTTP header for the request, simply pass a dict to the headers parameter. Many servers are often rejected for unusual requests, which require a legitimate cloak for network requests, while the disguise request header is the most commonly used method.
The user agent, which is part of the HTTP protocol, is part of the request header. It is a special string header that provides access to the Web site with information such as the type and version of the browser you are using, the operating system and version, the browser core , and so on. By adding a legitimate browser's UA information, the crawler's request can be disguised as a browser request.
For example, the user agent for the IE9 browser is:mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0;
these commonly used browser UA can be found on the Internet.
If you want to simulate IE9 browser to visit Baidu website, you can achieve this:

Requests does not change its behavior based on the specifics of the custom header. Only in the final request, All header information will be passed in.

7. Summary

The use of requests is much more than this, but as a starting point for web crawlers, the above knowledge is largely sufficient. The way to send requests to the server also more than one, how familiar with how to use, how convenient how to use, as a primer, more understanding, learning more, more practice, more application is the right path.

"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.