"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Last Update:2017-10-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Guangdong Vocational and Technical College Aohaoyuan

1. Introduction

The first step in implementing a web crawler is to establish a network connection and initiate requests to network resources such as servers or Web pages. Urllib is currently the most common practice, but requests is more convenient than urlib, allowing people to access network resources in a simpler way.

2, what is requests?

requests is written in Python language, based on Urllib , with Apache2 Licensed http Library . It is more convenient than urllib, it can save us a lot of work, fully meet the requirements of HTTP testing.
Requests Object has get, Post, Put, Delete, Head, options and other HTTP methods, Very simple to use. For Web systems, only the Get and post methods are generally supported. In the web crawler, get method is most commonly used. This article also focuses on the application of this method, the other method is more detailed use, see the User manual:
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

3. Initiating a network request

sending a network request using the requests get method is straightforward .
first, import the requests module:
Import requests
then, a request is made to the Web page via a URL:
res = requests. Get(' http://www.gdptc.cn/')
at this point, we have a requests object called res, from which we can get all the information we want, such as printing the URL of the Web page.
print ( res.url)

4. Get Response Content

We can read the contents of the server response. Requests automatically decodes content from the server, and most Unicode character sets can be seamlessly decoded .

After the request is issued,requests will make an educated guess based on the HTTP header 's encoding of the response. When you visit Res.text, requests uses its inferred text encoding. You can find out what encoding requests uses and can use the Res.encoding property to change it .

If you change the code , the Request will use the new value of res.encoding whenever you visit res.text.

5, the content of the response header

the server response header is represented in the form of a Python dictionary . This dictionary is special, it is only for the HTTP head of the birth.
The result of res.headers is:
{
'content-length': '39037 '
'x-powered-by': 'asp . '
'Date': 'Sat, Oct 13:58:41 GMT'
'x-aspnet-version': '2.0.50727 '
'cache-control': 'private'
'content-type': 'text/html; CharSet=utf-8'
'Server': 'microsoft-iis/7.5'
}
through this server's response header, we can know some basic data information of the server. According to RFC2616, the HTTP header is case insensitive. Therefore, we can use any uppercase form to access these response header fields. For example, we want to view the server's encoding and server model:

6. Custom Request Header

If you want to add an HTTP header for the request, simply pass a dict to the headers parameter. Many servers are often rejected for unusual requests, which require a legitimate cloak for network requests, while the disguise request header is the most commonly used method.
The user agent, which is part of the HTTP protocol, is part of the request header. It is a special string header that provides access to the Web site with information such as the type and version of the browser you are using, the operating system and version, the browser core , and so on. By adding a legitimate browser's UA information, the crawler's request can be disguised as a browser request.
For example, the user agent for the IE9 browser is:mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0;
these commonly used browser UA can be found on the Internet.
If you want to simulate IE9 browser to visit Baidu website, you can achieve this:

Requests does not change its behavior based on the specifics of the custom header. Only in the final request, All header information will be passed in.

7. Summary

The use of requests is much more than this, but as a starting point for web crawlers, the above knowledge is largely sufficient. The way to send requests to the server also more than one, how familiar with how to use, how convenient how to use, as a primer, more understanding, learning more, more practice, more application is the right path.

"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Web crawler Primer 02" HTTP Client library requests fundamentals and basic applications

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support