[Walkthrough]python3 using the Requests module to crawl page content

Source: Internet
Author: User


This article summarizes: 1. Installing the PIP2. Installing the Requests Module 3. Analysis of installation of beautifulsoup44.requests module    + send request + Pass URL parameter    + response content    + Get page encoding    + Get response Status Code 5. Case Presentation PostScript

1. Install Pip

My personal desktop system with the LinuxMint, the system is not installed by default PIP, considering the installation of requests module later using PIP, so I first step here to install PIP.

$ sudo apt install python-pip

Installation successful, view PIP version:

$ pip-v


2. Installing the Requests module

Here I do the installation by PIP Way:

$ pip Install requests


Installing requests

Run import requests, if the error is not prompted, that means that the installation has been successful!


Verify that the installation is successful
3. Installing BEAUTIFULSOUP4

Beautiful Soup is a python library that can extract data from HTML or XML files. It enables you to navigate through your favorite documents, find and modify the way you document. Beautiful Soup will save you hours or even days of working hours.

$ sudo apt-get install PYTHON3-BS4

Note: Here I am using the Python3 installation method, if you are using Python2, you can use the following command to install.

$ sudo pip install Beautifulsoup4


4.requests Module Analysis

1) Send request

First of all, of course, to import the requests module:

>>> Import Requests

Then, get the target crawl Web page. Here I take Pinterest as an example:

>>> r = requests.get (' http://www.cnblogs.com/chanzhi/p/7542447.html ')

This returns a response object named R. We can get all the information we want from this object. The get here is the HTTP response, so extrapolate you can also replace it with put, delete, post, head.


2) Passing URL parameters

Sometimes we want to pass some kind of data to the query string of the URL. If you build the URL manually, the data is placed in the URL in the form of a key/value pair followed by a question mark. For example, Cnblogs.com/get?key=val. Requests allows you to use the params keyword parameter to provide these parameters as a string dictionary.

For example, when we search for "python crawler" keywords, newwindow (new window open), Q and OQ (search keywords) and other parameters can be manually composed of URLs, then you can use the following code:

>>> payload = {' NewWindow ': ' 1 ', ' Q ': ' Python crawler ', ' oq ': ' Python crawler '}

>>> r = Requests.get ("Https://www.google.com/search", Params=payload)

3) Response Content

Get page response content with R.text or r.content.

>>> Import Requests

>>> r = requests.get (' Https://github.com/timeline.json ')

>>> R.text

Requests will automatically decode the content from the server. Most Unicode character sets can be decoded seamlessly. Here to add a little r.text and r.content the difference between the two, simply say:

Resp.text returns the Unicode type of data;

Resp.content returns the bytes type , which is the binary data;

So if you want to take the text, you can pass the R.text, if you want to take the picture, the file, you can pass the r.content.


4) Get page encoding

>>> r = requests.get (' http://www.cnblogs.com/')

>>> r.encoding

' Utf-8 '

5) Get the response status code

We can detect the response status code:

>>> r = requests.get (' http://www.cnblogs.com/')

>>> R.status_code

200


5. Case presentation

Recently, the company has just introduced an OA system, where I take the official documentation page for example, and only crawl the pages of the article title and content and other useful information.

Demo Environment

Operating system: LinuxMint

Python version: Python 3.5.2

Using modules: Requests, BEAUTIFULSOUP4

The code is as follows:

1 #!/usr/bin/env python2 #-*-coding:utf-8-*-3_author_ ='Gavinhsueh'4 5 ImportRequests6 ImportBS47 8 #the destination page address to crawl9URL ='http://www.ranzhi.org/book/ranzhi/about-ranzhi-4.html'Ten  One #Crawl page content, return response object AResponse =requests.get (URL) -  - #View response status Codes theStatus_code =Response.status_code -  - #use BeautifulSoup to parse code and lock page numbers to specify label contents -Content = BS4. BeautifulSoup (Response.content.decode ("Utf-8"),"lxml") +element = Content.find_all (id=' Book') -  + Print(Status_code) A Print(Element)

The program runs back to the crawl result:


Crawl success

About crawling results garbled problem

In fact, at first I was directly using the system default comes with the Python2 operation, but in the crawl back content encoding garbled problem on the long time, Google a variety of solutions are invalid. After being Python2 "whole crazy", had to honestly use Python3. For Python2 Crawl page content garbled problem, welcome all the predecessors to share experience, to help me wait for the epigenetic less detours.


Postscript

Python has a lot of crawler-related modules, in addition to requests modules, such as Urllib and Pycurl and Tornado, and so on. In comparison, I personally feel that the requests module is relatively simple and easy to get started. With text, you can quickly learn to crawl page numbers using Python's requests module. My ability is limited, if the article has any mistake welcome to enlighten, second if you have any questions about Python crawling page content, also welcome to exchange discussions with everyone.

We learn together, communicate together and progress together!


Reference:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#

http://cn.python-requests.org/zh_CN/latest/

[Walkthrough]python3 using the Requests module to crawl page content

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.