Python3 How to use the requests module to implement a detailed example of crawling page content

Last Update:2017-09-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduced the PYTHON3 use the requests module to crawl the page content the actual combat drill, has the certain reference value, has the interest can understand

1. Install Pip

My personal desktop system with the LinuxMint, the system is not installed by default PIP, considering the installation of requests module later using PIP, so I first step here to install PIP.

$ sudo apt install python-pip

Installation successful, view PIP version:

$ pip-v

2. Installing the Requests module

Here I do the installation by PIP Way:

$ pip Install requests

Run import requests, if the error is not prompted, that means that the installation has been successful!

Verify that the installation is successful

3. Installing BEAUTIFULSOUP4

Beautiful Soup is a python library that can extract data from HTML or XML files. It enables you to navigate through your favorite documents, find and modify the way you document. Beautiful Soup will save you hours or even days of working hours.

$ sudo apt-get install PYTHON3-BS4

Note: Here I am using the Python3 installation method, if you are using Python2, you can use the following command to install.

$ sudo pip install Beautifulsoup4

4.requests Module Analysis

1) Send request

First of all, of course, to import the requests module:

>>> Import Requests

Then, get the target crawl Web page. Here I have the following example:

>>> r = requests.get (' http://www.jb51.net/article/124421.htm ')

This returns a response object named R. We can get all the information we want from this object. The get here is the HTTP response, so extrapolate you can also replace it with put, delete, post, head.

2) Passing URL parameters

Sometimes we want to pass some kind of data to the query string of the URL. If you build the URL manually, the data is placed in the URL in the form of a key/value pair followed by a question mark. For example, Cnblogs.com/get?key=val. Requests allows you to use the params keyword parameter to provide these parameters as a string dictionary.

For example, when we search for "python crawler" keywords, newwindow (new window open), Q and OQ (search keywords) and other parameters can be manually composed of URLs, then you can use the following code:

>>> payload = {' NewWindow ': ' 1 ', ' Q ': ' Python crawler ', ' oq ': ' Python crawler '}>>> r = Requests.get ("Https://www.go Ogle.com/search ", Params=payload)

3) Response Content

Get page response content with R.text or r.content.

>>> Import requests>>> r = Requests.get (' Https://github.com/timeline.json ') >>> R.text

Requests will automatically decode the content from the server. Most Unicode character sets can be decoded seamlessly. Here to add a little r.text and r.content the difference between the two, simply say:

Resp.text returns the Unicode type of data;

Resp.content returns the bytes type, which is the binary data;

So if you want to take the text, you can pass the R.text, if you want to take the picture, the file, you can pass the r.content.

4) Get page encoding

>>> r = requests.get (' http://www.cnblogs.com/') >>> r.encoding ' Utf-8 '

5) Get the response status code

We can detect the response status code:

>>> r = requests.get (' http://www.cnblogs.com/') >>> r.status_code200

5. Case presentation

Recently, the company has just introduced an OA system, where I take the official documentation page for example, and only crawl the pages of the article title and content and other useful information.

Demo Environment

Operating system: LinuxMint

Python version: Python 3.5.2

Using modules: Requests, BEAUTIFULSOUP4

The code is as follows:

#!/usr/bin/env python#-*-coding:utf-8-*-_author_ = ' gavinhsueh ' import requestsimport bs4# the destination page address URL = ' http://www to crawl. Ranzhi.org/book/ranzhi/about-ranzhi-4.html ' #抓取页码内容, returns the Response object response = Requests.get (URL) #查看响应状态码status_code = response.status_code# uses BeautifulSoup to parse the code and lock the page number to specify the label contents content = Bs4. BeautifulSoup (Response.content.decode ("Utf-8"), "lxml") element = Content.find_all (id= ' book ') print (Status_code) Print (Element)

The program runs back to the crawl result:

Crawl success

About crawling results garbled problem

In fact, at first I was directly using the system default comes with the Python2 operation, but in the crawl back content encoding garbled problem on the long time, Google a variety of solutions are invalid. After being Python2 "whole crazy", had to honestly use Python3. For Python2 Crawl page content garbled problem, welcome all the predecessors to share experience, to help me wait for the epigenetic less detours.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 How to use the requests module to implement a detailed example of crawling page content

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 How to use the requests module to implement a detailed example of crawling page content

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support