Reprint please indicate author and source: http://blog.csdn.net/c406495762
GitHub Code acquisition: Https://github.com/Jack-Cherish/python-spider
Python version: python3.x
Running platform: Windows
Ide:sublime Text3
PS: This article for the Gitchat online sharing article, the article published time for September 19, 2017. Activity Address:
http://gitbook.cn/m/mazi/activity/59b09bbf015c905277c2cc09
Introduction to the two Web crawler brief example of review elements 1 requests installation 2 Simple examples three reptile actual combat novel download 1 combat background 2 small test 3Beautiful Soup 3 integrated code beautiful wallpaper Download 1 Combat background 2 Combat Advanced 3 Integration code Archie Art VI P Video Download 1 combat background 2 Combat upgrade 3 write code four summary
a preface
Strongly suggest: Please read this article with the accompanying computer. This article to actual combat, reading process such as a little discomfort, but also hope to practice more.
The actual combat content of this article are: Network novel download (static website) Beautiful wallpaper Download (dynamic website) Archie art VIP video download two web crawler introduction
Web crawler, also called network spider (Web Spider). It crawls the content of the page based on the page address (URL), and the page address (URL) is the Web site link we entered in the browser. For example: https://www.baidu.com/, it is a URL.
Before we can explain the reptilian content, we need to learn a necessary skill to write a reptile: Review the elements (skip this part if you have mastered them). 1 Review Elements
In the browser's address bar, enter the URL address, right-click the page, and find the check. (Different browsers call different, Chrome browser is called check, Firefox browser is called viewing elements, but the function is the same)
We can see that there's a big push code on the right, which is called HTML. What is HTML. An easy to understand example: our genes determine our original looks, and the HTML returned by the server determines the original appearance of the site.
Why do you say it is original appearance. Because people can be plastic surgery ah. Have a heart, there is wood. The site can also be "plastic surgery" it. OK. Take a look at the picture below:
Can I have so much money? Obviously not. How do I give the website "face-lifting"? is by modifying the HTML information returned by the server. Each of us is a "cosmetic master" that can modify the page information. Where we click on the review element in the page, the browser navigates to the appropriate HTML location, which allows us to change the HTML information locally.
A small example: we all know that using the browser "Remember Password" function, the password will become a bunch of small black spots, is not visible. Can you let the password show up? Yes, just give the page "a small operation." Take Taobao as an example, in the Input Password box right button, click Check.
As you can see, the browser automatically navigates to the appropriate HTML location for us. Change the value of the Password property in the following figure to the Text property value (modified directly from the code at right):
The password we let the browser remember is just like this:
What does it mean to say so much? Browsers get information from the server as a client and then parse the information and show it to us. We can modify the HTML information locally, for the Web page "cosmetic", but our modified information will not be uploaded to the server, the server stored HTML information will not change. Refresh the interface, the page will return to its original appearance. It's like plastic surgery, we can change some superficial things, but we can't change our genes. 2 Simple Examples
The first step of the web crawler is to get the HTML information of the Web page based on the URL. In Python3, you can use Urllib.request and requests to crawl Web pages. The Urllib library is built in Python and can be used without our extra installation, as long as Python is installed. The requests library is a Third-party library that requires our own installation.
The requests library is powerful and easy to use, so this article uses the requests library to get HTML information for the Web page. GitHub address of Requests Library: Https://github.com/requests/requests (1) Requests installation
In CMD, install requests using the following instructions:
PIP Install requests
1
Or:
Easy_install Requests
1
(2) Simple example
The requests library is based on the following methods:
Official Chinese course Address: http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
The developer of the requests library provides us with a detailed Chinese tutorial, which is easy to find. This article will not explain all of its content, extract its part of the use of content, for actual combat instructions.
First, let's take a look at the Requests.get () method, which is used to initiate a GET request to the server and not understand that getting requests are not related. We can understand that the Chinese meaning of get and catch, that this requests.get () method is from the server to get, seize the data, that is, to obtain data. Let's look at an example (take www.gitbook.cn as an example) to deepen understanding:
#-*-Coding:utf-8-*-
Import Requests
if __name__ = ' __main__ ':
target = ' http://gitbook.cn/'
req = req Uests.get (url=target)
print (req.text)
1
2
3
4
5
6
7
One of the parameters that the Requests.get () method must set is the URL, because we have to tell the GET request, who our target is, and who we want to get information about. Run the program to see the results:
The left-hand side is the result of our program, and the right side is the information we get from the www.gitbook.cn website review element. We can see that we have successfully obtained the HTML information for the page. This is one of the simplest examples of reptiles, you might ask, I just crawled the HTML information on this page, what is the use of it. Sir slightly Ann not impatient, next enter our actual combat text. Three Reptiles Combat 1 Novel Downloads (1) Actual combat background
Novel Web site-pen interesting look: url:http://www.biqukan.com/
Pen interesting look is a pirated novel website, here are a lot of the beginning of the Chinese network of novels, the site novel update speed slightly behind the beginning of the original Chinese version of the novel Speed. And the site only supports online browsing, does not support the novel package download. Therefore, the actual combat is to crawl from the site and save a book called "An Eternal" novel, the novel is the ear is serialized in a fantasy novel. PS: This example is only for the exchange of learning, support the ear greatly, please subscribe to the beginning of the Chinese network. (2) Small test
Let's take a look at the first chapter of the novel "Eternal", url:http://www.biqukan.com/1_1094/5403177.html
Let's start with the knowledge we've learned to get the HTML information. Try, write code as follows: