This article mainly introduced the PYTHON3 use the requests module to crawl the page content the actual combat drill, has the certain reference value, has the interest can understand
1. Install Pip
My personal desktop system with the LinuxMint, the system is not installed by default PIP, considering the installation of requests module later using PIP, so I first step here to install PIP.
$ sudo apt install python-pip
Installation successful, view PIP version:
$ pip-v
2. Installing the Requests module
Here I do the installation by PIP Way:
$ pip Install requests
Run import requests, if the error is not prompted, that means that the installation has been successful!
Verify that the installation is successful
3. Installing BEAUTIFULSOUP4
Beautiful Soup is a python library that can extract data from HTML or XML files. It enables you to navigate through your favorite documents, find and modify the way you document. Beautiful Soup will save you hours or even days of working hours.
$ sudo apt-get install PYTHON3-BS4
Note: Here I am using the Python3 installation method, if you are using Python2, you can use the following command to install.
$ sudo pip install Beautifulsoup4
4.requests Module Analysis
1) Send request
First of all, of course, to import the requests module:
>>> Import Requests
Then, get the target crawl Web page. Here I have the following example:
>>> r = requests.get (' http://www.jb51.net/article/124421.htm ')
This returns a response object named R. We can get all the information we want from this object. The get here is the HTTP response, so extrapolate you can also replace it with put, delete, post, head.
2) Passing URL parameters
Sometimes we want to pass some kind of data to the query string of the URL. If you build the URL manually, the data is placed in the URL in the form of a key/value pair followed by a question mark. For example, Cnblogs.com/get?key=val. Requests allows you to use the params keyword parameter to provide these parameters as a string dictionary.
For example, when we search for "python crawler" keywords, newwindow (new window open), Q and OQ (search keywords) and other parameters can be manually composed of URLs, then you can use the following code:
>>> payload = {' NewWindow ': ' 1 ', ' Q ': ' Python crawler ', ' oq ': ' Python crawler '}>>> r = Requests.get ("Https://www.go Ogle.com/search ", Params=payload)
3) Response Content
Get page response content with R.text or r.content.
>>> Import requests>>> r = Requests.get (' Https://github.com/timeline.json ') >>> R.text
Requests will automatically decode the content from the server. Most Unicode character sets can be decoded seamlessly. Here to add a little r.text and r.content the difference between the two, simply say:
Resp.text returns the Unicode type of data;
Resp.content returns the bytes type, which is the binary data;
So if you want to take the text, you can pass the R.text, if you want to take the picture, the file, you can pass the r.content.
4) Get page encoding
>>> r = requests.get (' http://www.cnblogs.com/') >>> r.encoding ' Utf-8 '
5) Get the response status code
We can detect the response status code:
>>> r = requests.get (' http://www.cnblogs.com/') >>> r.status_code200
5. Case presentation
Recently, the company has just introduced an OA system, where I take the official documentation page for example, and only crawl the pages of the article title and content and other useful information.
Demo Environment
Operating system: LinuxMint
Python version: Python 3.5.2
Using modules: Requests, BEAUTIFULSOUP4
The code is as follows:
#!/usr/bin/env python#-*-coding:utf-8-*-_author_ = ' gavinhsueh ' import requestsimport bs4# the destination page address URL = ' http://www to crawl. Ranzhi.org/book/ranzhi/about-ranzhi-4.html ' #抓取页码内容, returns the Response object response = Requests.get (URL) #查看响应状态码status_code = response.status_code# uses BeautifulSoup to parse the code and lock the page number to specify the label contents content = Bs4. BeautifulSoup (Response.content.decode ("Utf-8"), "lxml") element = Content.find_all (id= ' book ') print (Status_code) Print (Element)
The program runs back to the crawl result:
Crawl success
About crawling results garbled problem
In fact, at first I was directly using the system default comes with the Python2 operation, but in the crawl back content encoding garbled problem on the long time, Google a variety of solutions are invalid. After being Python2 "whole crazy", had to honestly use Python3. For Python2 Crawl page content garbled problem, welcome all the predecessors to share experience, to help me wait for the epigenetic less detours.