Python crawler frame--pyspider First Experience

Source: Internet
Author: User

Before contact Scrapy originally was thinking maybe scrapy can make my reptile faster, but perhaps I did not master the essentials of Scrapy, so the crawler run up and not as fast as I imagined, see this article is the previous use of the Scrapy crawler. Then yesterday I saw the Pyspider, to tell the truth originally just want to see, but did not expect a look let me like on the Pyspider.

Let's first show you the backstage of Pyspider:

Pyspider is a Chinese writing an open-source crawler framework, the personal feel that this framework is very convenient to use, as to how convenient can continue to watch.
Author Blog: http://blog.binux.me/
This article is I follow the author of this article, climbed to get is a watercress film, I also follow the crawl watercress film, but the original article some places no longer apply, here I will be some corrections to adapt to the changes in the Web page.

Installing Pyspider

Install Pyspider: pip install pyspider
Visual Pyspider only support 32-bit systems, because the installation of Pyspider before the need to install a dependent library: Pycurl, and Pycurl only support 32-bit system, although I also found someone else recompiled csdn, can be installed in 64 bit, Then Pyspider did install it, but it wasn't perfect!! here say, imperfect place is unable to debug!! Is it important to debug?
With the love of Pyspider, I will be a decisive re-install system!
If you are a 32-bit system, install it like this:

install pycurlpip install pyspider
    • 1
    • 2

If you are a 64-bit system and are not obsessive-compulsive and can accept imperfect things, install:
Download the recompiled Pycurl, and then install.
Then cmd input:pip install pyspider

First Pyspider crawler
    1. Open cmd, enter Pyspider, then open the browser input: http://localhost:5000, then you can enter the Pyspider backstage. The first time to open the background is a blank. (Click to open the browser after CMD do not shut down!) )
    2. Click Create to enter a name (of course the name is not random).
    3. Click OK to go to a script editor and create a script automatically when creating a project, here we only need to change the script OK. We are going to crawl all the films of watercress, choose http://movie.douban.com/tag/as the starting point, that is, to start crawling from here.
    4. First, change.on_start
@every(minutes=24 * 60)def on_start(self): self.crawl(‘http://movie.douban.com/tag/‘, callback=self.index_page)
    • 1
    • 2
    • 3

There is nothing to say, change the URL, callback is to call the next function to start the beginning of the page.
5. Change the Index_page function
Let's take a look at the Revelation page.

They are categorized by type, by country/region, and by chronological classification. We can choose to categorize by country or by age, preferably by genre, because the same movie can be both love and action (it feels weird). Here I choose chronological classification.
Let's see index_page how I changed it.

@config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc(‘#content>div>div.article> table:nth-child(9)>tbody>tr>td>a‘).items(): a=each.attr.href.replace(‘www‘,‘movie‘) self.crawl(a,callback=self.list_page)
    • 1
    • 2
    • 3
    • 4
    • 5

You can see that we chose tag from among the Response.doc, and then
#content>div>div.article> table:nth-child(9)>tbody>tr>td>a
A friend familiar with CSS selector may be familiar with this kind of thing, but I am the first contact, so I can not tell why. In fact, CSS selector with regular expressions, XPath, is also a method of content selection, and then also very well understand what this means.

This is the delimiter, familiar with CSS selector friends can not look at the following sections

Let's look at it first.
http://movie.douban.com/tag/
We're going to choose between 2013, 2012 and 1989, so we right-select 2013 and review the elements .

Then right-click in the link to select the copy CSS path to get one and then copy into the text, we choose a few more tags CSS path, check the rules
You can see a few of the selected tag CSS path different points in my red line marked the place, then we have their differences removed, leaving only the same part, that is, the bottom line, copy it down, put it in

for each in response.doc(‘#content>div>div.article> table:nth-child(9)>tbody>tr>td>a‘).items()
    • 1

Inside the parentheses, tell the crawler that the part we want to crawl is under this path!
This is a way to get CSS path, even my first contact with CSS selector can find

The following regression

Followed by

a=each.attr.href.replace(‘www‘,‘movie‘)self.crawl(a,callback=self.list_page)
    • 1
    • 2

We replaced the part of the CSS path rule above with the WWW instead of the movie. Why did you do it? We open it separately.
Http://www.douban.com/tag/2013/?focus=movie
And
Http://movie.douban.com/tag/2013/?focus=movie
View


Can see the WWW is no page!!! And the movie is to have the page!!! We have to traverse all the movies, of course we have to page!!! So that's why it's a replacement!

self.crawl(a,callback=self.list_page)
    • 1

This code is to get a similar Http://movie.douban.com/tag/2013?focus=movie page to the next function to parse!

    1. Changing the List_page function
@config(age=10*24*60*60)def list_page(self,response):#获得电影详细内容链接并交给下一个函数处理 for each in response.doc(‘td > .pl2 > a‘).items(): self.crawl(each.attr.href,callback=self.detail_page)#翻页,然后继续由list_page函数处理 for each in response.doc(‘.next > a‘).items(): self.crawl(each.attr.href,callback=self.list_page)
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

Here's CSS path I have pyspider own CSS selector helper get. Speaking of this, I'll talk about the Pyspider CSS selector helper How to use (of course, this is not omnipotent, otherwise I will not use the browser review element to get CSS path)  
Let's first click Run   in the middle of the Script Editor,
Select follows, see this  
Click the arrow to continue.  
 
by the way, if you click the arrow follows there is no link, then the CSS path of your previous function is not selected! Go back and modify!!  
Go here and select a linked arrow to continue. Back to the Web.  
 
We have two tasks to change list_page, one is to select the details of a movie link to the next function processing, and the other is that the page will continue to be processed by the List_page function.  
Select Enable CSS selector path and then click on a movie link, you will find all the movie links are framed by the red box!  
 
We select the mouse to a point in the picture, and then click the arrow above the middle of the page, which is where the plot is marked, and then add the CSS path with the Web detail link!  
The same can be used to turn the page CSS path.  
8. Change the Detail_page function

@config(priority=2)def detail_page(self, response): return { "url": response.url, "title":response.doc(‘* > * > div#wrapper > div#content > h1 > span‘).text(), "rate":response.doc(‘.rating_num‘).text(), "导演":response.doc(‘#info > span:nth-child(1) > span.attrs > a‘).text() }
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

This is simple, has been to the movie detail page, just see what you want, and then use CSS Selector helper to choose it! I only returned the URL, movie title, rating and director four!

Here, the crawler basically finished, click Save, Return to dashboard, the state of the crawler to change to running or debug, and then click on the right of the run, the crawler started!!
Also there are many demos here! http://demo.pyspider.org/

Python crawler frame--pyspider First Experience

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.