This article summarizes: 1. Installing the PIP2. Installing the Requests Module 3. Analysis of installation of beautifulsoup44.requests module + send request + Pass URL parameter + response content + Get page encoding + Get response Status Code 5. Case Presentation PostScript
1. Install Pip
My personal desktop system with the LinuxMint, the system is not installed by default PIP, considering the installation of requests module later using PIP, so I first step here to install PIP.
$ sudo apt install python-pip
Installation successful, view PIP version:
$ pip-v
2. Installing the Requests module
Here I do the installation by PIP Way:
$ pip Install requests
Installing requests
Run import requests, if the error is not prompted, that means that the installation has been successful!
Verify that the installation is successful
3. Installing BEAUTIFULSOUP4
Beautiful Soup is a python library that can extract data from HTML or XML files. It enables you to navigate through your favorite documents, find and modify the way you document. Beautiful Soup will save you hours or even days of working hours.
$ sudo apt-get install PYTHON3-BS4
Note: Here I am using the Python3 installation method, if you are using Python2, you can use the following command to install.
$ sudo pip install Beautifulsoup4
4.requests Module Analysis
1) Send request
First of all, of course, to import the requests module:
>>> Import Requests
Then, get the target crawl Web page. Here I take Pinterest as an example:
>>> r = requests.get (' http://www.cnblogs.com/chanzhi/p/7542447.html ')
This returns a response object named R. We can get all the information we want from this object. The get here is the HTTP response, so extrapolate you can also replace it with put, delete, post, head.
2) Passing URL parameters
Sometimes we want to pass some kind of data to the query string of the URL. If you build the URL manually, the data is placed in the URL in the form of a key/value pair followed by a question mark. For example, Cnblogs.com/get?key=val. Requests allows you to use the params keyword parameter to provide these parameters as a string dictionary.
For example, when we search for "python crawler" keywords, newwindow (new window open), Q and OQ (search keywords) and other parameters can be manually composed of URLs, then you can use the following code:
>>> payload = {' NewWindow ': ' 1 ', ' Q ': ' Python crawler ', ' oq ': ' Python crawler '}
>>> r = Requests.get ("Https://www.google.com/search", Params=payload)
3) Response Content
Get page response content with R.text or r.content.
>>> Import Requests
>>> r = requests.get (' Https://github.com/timeline.json ')
>>> R.text
Requests will automatically decode the content from the server. Most Unicode character sets can be decoded seamlessly. Here to add a little r.text and r.content the difference between the two, simply say:
Resp.text returns the Unicode type of data;
Resp.content returns the bytes type , which is the binary data;
So if you want to take the text, you can pass the R.text, if you want to take the picture, the file, you can pass the r.content.
4) Get page encoding
>>> r = requests.get (' http://www.cnblogs.com/')
>>> r.encoding
' Utf-8 '
5) Get the response status code
We can detect the response status code:
>>> r = requests.get (' http://www.cnblogs.com/')
>>> R.status_code
200
5. Case presentation
Recently, the company has just introduced an OA system, where I take the official documentation page for example, and only crawl the pages of the article title and content and other useful information.
Demo Environment
Operating system: LinuxMint
Python version: Python 3.5.2
Using modules: Requests, BEAUTIFULSOUP4
The code is as follows:
1 #!/usr/bin/env python2 #-*-coding:utf-8-*-3_author_ ='Gavinhsueh'4 5 ImportRequests6 ImportBS47 8 #the destination page address to crawl9URL ='http://www.ranzhi.org/book/ranzhi/about-ranzhi-4.html'Ten One #Crawl page content, return response object AResponse =requests.get (URL) - - #View response status Codes theStatus_code =Response.status_code - - #use BeautifulSoup to parse code and lock page numbers to specify label contents -Content = BS4. BeautifulSoup (Response.content.decode ("Utf-8"),"lxml") +element = Content.find_all (id=' Book') - + Print(Status_code) A Print(Element)
The program runs back to the crawl result:
Crawl success
About crawling results garbled problem
In fact, at first I was directly using the system default comes with the Python2 operation, but in the crawl back content encoding garbled problem on the long time, Google a variety of solutions are invalid. After being Python2 "whole crazy", had to honestly use Python3. For Python2 Crawl page content garbled problem, welcome all the predecessors to share experience, to help me wait for the epigenetic less detours.
Postscript
Python has a lot of crawler-related modules, in addition to requests modules, such as Urllib and Pycurl and Tornado, and so on. In comparison, I personally feel that the requests module is relatively simple and easy to get started. With text, you can quickly learn to crawl page numbers using Python's requests module. My ability is limited, if the article has any mistake welcome to enlighten, second if you have any questions about Python crawling page content, also welcome to exchange discussions with everyone.
We learn together, communicate together and progress together!
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#
http://cn.python-requests.org/zh_CN/latest/
[Walkthrough]python3 using the Requests module to crawl page content