Crawlers-100 million lines of comics against attack, and 100 million lines of crawlers against attack
Xiaomiao's nagging words: this blog is about using python to write a crawler tool. Why write this crawler? The reason is that after watching the animated film "The Dark burcirt", meow wants to see cartoons again. The results show that all the major apps have no resources, so it's hard to find a website to view them, however, because the network speed is too scum, it seems extra effort. At this time, it would be nice to download it in advance.
First go to the Project address (github): https://github.com/miaoerduo/cartoon-cat. You are welcomeFork,StarAnd suggestions.
This article is original. For more information, see the source ~
Meow's blog: http://www.miaoerduo.com
Original blog article: http://www.miaoerduo.com/python/crawler-comic book 100xxx-html
The reason is that, as a technical meow, no problem can hinder a cartoon heart. So the question is, which of the following is the best excavator technology?
Search for Python and crawler frameworks on bing. Find common frameworks.
Scrapy seems to be a good choice. As for the advantages of other frameworks, Xiao Miao did not elaborate, at least this framework was previously heard. But some problems are found during implementation. scrapy cannot directly capture dynamic pages. The cartoons of websites that meow needs to crawl are generated using Ajax. You need to analyze various types of data by yourself, which is a little troublesome.
Is there a tool for rendering pages? Like a browser? Yes.
Here we will introduce two tools:
PhantomJs is a browser. However, it does not have an interface. We can use JavaScript code to simulate user behavior. This requires that you understand its api and have a js Foundation.
Selenium, a browser automated testing framework. It depends on the browser (this browser can also be PhantomJs). Selenium can be used to simulate user behavior. The Python interface is relatively simple.
We use selenium + phantomjs to implement this crawler.
Yo, this crawler should have a well-known name... It is calledCartoon meowEnglish nameCartoon Cat.
Next we will introduce the implementation process of this crawler a little bit.
I. Initial Stage-Environment Construction
Here, we use Python as the development language. The framework is selenium. The reason is that python is often used to write crawlers. selenium can be used to simulate user behavior. PhantomJs is optional, but meow will eventually run on a server, so it is also necessary.
In order not to affect python on the local machine, we also need to use virtualenv to create an independent python environment. The procedure is as follows:
1. Install virtualenv
Virtualenv is a common tool used to create a python environment. Meow has two reasons for using this. One is to avoid polluting the environment of the local machine, and the other is a permission issue when the local machine directly installs the database.
Virtualenv is easy to install and can be installed using pip.
pip install virtualenv
After the program execution is complete, you will be happy to find that you already have the virtualenv tool.
2. Create a python Environment
Virtualenv is very convenient to use.
Create a new running environment:virtualenv <env-name>
Enter the corresponding independent environment:source <env-path>/bin/activate
After the first command is executed, a python environment is created successfully. After the second command is executed, the starting position of the command line changes. At this time, tools such as python and pip will become available in this new environment. Of course, you can also usewhich python
.
3. Install selenium
After entering the new environment, the dependent libraries installed by pip will be installed in the new environment, without affecting the python of the host. Install selenium Using pip:
pip install selenium
So far, our basic environment is complete.
4. Install PhantomJs
This can only be downloaded from the official website: http://phantomjs.org/download.html
Because the local lab environment of Meow is Mac, the Mac version is downloaded. After decompression, you can use it.
2. Search for resources
The cartoon you want to watch seems to have no resources for all major websites. After spending a lot of effort, you finally found a website! Http://www.tazhe.com/mh/9170 /.
The structure of each website is different, so you need to customize a crawler. The crawlers in this article can only be used for this cartoon website. If you need to crawl other websites, You need to modify them yourself.
Iii. Analysis-resource resolution
Here we need to parse two pages, one is the cartoon home page, such as the Front: http://www.tazhe.com/mh/9170/
The other is the page of a specific chapter.
1. Home Page
To reduce the image size, Xiao meow scaled the window. The homepage looks like this.
Figure 1 cartoon Homepage
All kinds of information are very clear. We are concerned with the following cartoon list. Using Chrome's powerfulReview ElementsFunction, we can immediately locate the position of the chapter. (Right-click the target location and choose review.Can be found)
Figure 2 nodes
As you can seeid
Yesplay_0
All the children's shoes that have learned the frontend should know that on a pageid
Generally, a node is uniquely identified. So if we can get this page, findid
Isplay_0
To narrow the search scope.
The information of each chapter isa
Tags, tagshref
Is the specific URL of the corresponding chapter. The text part of the label is the chapter name. In this way, the relative relationship is as follows:div#play_0 > ul > li > a
.
The analysis of the home page ends here.
2. Chapter Page
Open a specific chapter page at will. For example: http://www.tazhe.com/mh/9170/1187086.html
A very clean page is introduced to the eye (it is just a clear stream in the cartoon industry, and many cartoon websites are all advertising ).
We setPlace the cursor in the image area-> right-click-> Review.
Sorry, why can't we right-click it?
In fact, this phenomenon has more opportunities on the novel website. When we see beautiful text or cool images, we will subconsciously select-> right-click-> Save. In many cases, these resources are copyrighted. It should not be spread randomly (severely hitting your own face/(ㄒ o )/~~). Therefore, right-clicking is a simple but effective method.
So how can we bypass this trap?
It's easy. We don't need to right-click it. Open the developer tool option in the browser and find the elements option. You can see a complex structure (in fact, it is the same as the results of the elements reviewed above ). The tag is selected continuously. When the tag is selected, the corresponding position on the left side of the page will be blue. Try multiple times to find the corresponding location.
Figure 3 cartoon image
This isimg
Tag, correspondingid
YesqTcms_pic
. Find thisid
, You can find thisimg
Tag, accordingsrc
You can find the specific URI address of the image.
Next, find the address of the next image. In this case, you need to view the content of the button on the next page. With the same method, it is easy to locate successfully.
Figure 4 next page
Xiao meow originally used scrapy as a crawler. When I saw it, I gave up. Let's analyze the selecteda
The tag code is as follows:
<A class = "next" href = "javascript: a_f_qTcms_Pic_nextUrl_Href ();" title = "next page"> <span> next page </span> </a>
For a simple website, the "next page" can be useda
Labels andhref
Attribute. The advantage is that the implementation is relatively simple, and the disadvantage is that it can be easily parsed once the source code of the web page is obtained. Crawlers like scrapy can only crawl static code (Dynamic Analysis of ajax is required, a little troublesome ). Obviously, the page here is dynamic and implemented using ajax. Therefore, the source code of a Web page cannot be used to obtain images, but the js Code must be run. Therefore, we need a tool like browser or PhantomJs that can execute js Code.
The abovea
The tag code tells us a lot of information. The first is to tell us the location of this node, throughnext
This class name can be easily found on this node (in fact, there are two classes namednext
The other one is below, but the functions are the same ). Second, when this button is clicked, it is called:a_f_qTcms_Pic_nextUrl_Href()
This js function. Do we need to study this function again?
No. Because PhantomJs is a browser. We just need to click this one like a real usernext
Button to enter the next page. /* Do you feel the strength of this tool? */
3. Judge the end of the chapter
The last question is, how can we determine that this chapter is over?
We will jump to the last page of the chapter and click "next page" again. A pop-up window will pop up.
Figure 5 last page
After multiple tests, we will find that this pop-up window is only displayed on the last page. In this way, we will click "next page" each time we capture a page ", check whether there are any pop-up windows to see if it is the last page. In the developer tool on the right, we can see that this pop-up window isid
IsmsgDiv
Ofdiv
(In addition, its appearance and disappearance are realized by adding or removing nodes, and another implementation method isdisplay
Setnone
Andblock
In this case, it can be determined based on the display attribute ). So we can determine that the node does not exist.
At this point, the resolution of the two pages is complete. Next, let's start code implementation.
Iv. counterattack-simple use of code implementation 1 and selenium
from selenium import webdriverbrowser = webdriver.Firefox()# browser = webdriver.Safari()# browser = webdriver.Chrome()# browser = webdriver.Ie()# browser = webdriver.PhantomJs()browser.get('http://baidu.com')print browser.title# do anything you want
The above is a simple example. The first step is to import the dependent library.
Step 2: obtain a browser instance. Selenium supports multiple browsers. Download drivers are required for browsers other than firefox (selenium itself comes with firefox drivers ). Drive: https://pypi.python.org/pypi/selenium. After the driver is downloaded, add its pathPATHTo ensure that the driver is accessible. Or explicitly pass in the driver address as a parameter. Call as follows:
browser = webdriver.PhantomJs('path/to/phantomjs')
Step 3: Use get to open the webpage.
Finally, parse and process the page through the browser object.
2. Obtain the chapter link information.
On the above parsing page, we know the location of the chapter information:div#play_0 > ul > li > a
. In this way, the chapter information can be parsed. Browser supports a large number of selectors. This greatly simplifies searching for nodes.
From selenium import webdriverif _ name _ = "_ main _": driver = "path/to/driver" # driver address browser = webdriver. phantomJS (driver) # browser instance main_page = "http://www.tazhe.com/mh/9170/" browser. get (main_page) # load the page # parse the chapter's Element Node chapter_elem_list = browser. find_elements_by_css_selector ('# play_0 ul li A') # Use the css selector to find the chapter node chapter_elem_list.reverse () # The original chapter is the reverse chapter of chapter_list = [] for chapter_elem in chapter: # The text and href attributes of an element are respectively the chapter Name and Address chapter_list.append (chapter_elem.text, chapter_elem.get_attribute ('href ') # chapter_list is the chapter information.
3. Specify the address of a chapter and the image in the chapter.
This step involves obtaining nodes, simulating mouse clicks, and downloading resources. Selenium clicks are especially user-friendly. You only need to obtain the node and then callclick()
Method. There are many tutorials to download resources online. There are two main methods:Right-click and save, AndDownload the url using other tools.Considering that the right-click is not always available, and the operation is a little complicated. Meow chose the second solution.
From selenium import webdriverfrom selenium. common. exceptions import NoSuchElementExceptionimport osfrom OS import path as ospimport urllib # A simple download tool (url, save_path): try: with open (save_path, 'wb ') as fp: fp. write (urllib. urlopen (url ). read () failed t Exception, et: print (et) if _ name _ = "_ main __": driver = "path/to/driver" # driver address browser = webdriver. phantomJS (driver) # browser instance chapter_url = "http://www.tazhe.com/mh/9170/1187061.html" save_folder = ". /download "if not osp. exists (save_folder): OS. mkdir (save_folder) image_idx = 1 browser. get (chapter_url) # load the first page while True: # based on the analysis above, find the image URI: image_url = browser. find_element_by_css_selector ('# qTcms_pic '). get_attribute ('src') save_image_name = osp. join (save_folder, ('% 05d' % image_idx) + '. '+ osp. basename (image_url ). split ('. ') [-1]) download (image_url, save_image_name) # download an image # load the next page by simulating a click. Note that if the last page is displayed, a window pops up prompting browser. find_element_by_css_selector ('a. next '). click () try: # Find the pop-up window. If the pop-up window exists, the download of this chapter is complete, and the large loop ends the browser. find_element_by_css_selector ('# bgDiv') break t NoSuchElementException: # the pop-up window is not closed. Download image_idx + = 1
V. Final answer-written below
So far, the design ideas and main code implementation of the cartoon meow are all described. The above code is just for illustration. The code used by Tom to download cartoons is another one. Github address: https://github.com/miaoerduo/cartoon-cat. The project has more than 100 rows. But it also took a while.
Blog completed ~ The cartoon of Meow is finished ~
Figure 6 downloaded cartoon
If you think this article is helpful to you, please have a cup of tea ~~ O (distinct _ distinct) O ~~
Please indicate the source for reprinting ~