Python crawler----(5. Scrapy framework, integrated applications and others)

Last Update:2014-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When analyzing and processing selections, it is also important to note that JS on the page may modify the DOM tree structure.

(a) Use of GitHub

Because the previous use of win, did not use the shell. Currently just understand. Added later. Find a Few good tutorials

GitHub Ultra-detailed text guide http://blog.csdn.net/vipzjyno1/article/details/22098621

GitHub Modify Submit Http://www.360doc.com/content/12/0602/16/2660674_215429880.shtml

I'll add !!!!! later.

(ii) Use of Firebug in Firefox

I've been using Firefox's F12 default debugging tool for a while, and it's pretty cool. Just changed the Firebug a try, that is even more cool.

Tools-->web developer-->get More Tools then, in general, the first one is FireBug installation. After pressing F12, it is enabled by default.

The function is simply too strong to have no friends. The xpath,css path of the element can be obtained directly. Cookies can also be modified .....

(1) items.py

#-*-Coding:utf-8-*-from scrapy import Item, Fieldclass Movieitem (item): Name = Field () year = field () score = Field () Director = field () Classification = field () actor = field ()

(2) spiders/movie_spider.py

# -*- coding: utf-8 -*-from scrapy import selectorfrom  Scrapy.contrib.spiders import crawlspider, rulefrom scrapy.contrib.linkextractors.sgml  import sgmllinkextractorfrom douban.items import movieitemclass moviespider ( Crawlspider):    name =  "movie"     allowed_domains =  ["douban.com"]    start_urls =  (          ' http://movie.douban.com/top250 ',    )     #  Rules can customize URLs ' crawl     rules =  (        #   This Rule only finds URLs based on Start_urls and is not a specific page of data fetching         rule ( Sgmllinkextractor (allow= (R ' http://movie.douban.com/top250\?start=\d+.* ')),         #  This rule is the page address of the specific data fetch, and callback is the callback function that handles the returnedResponse Data         rule (Sgmllinkextractor (allow= (r '/HTTP/ Movie.douban.com/subject/\d+ '),  callback= ' Parse_item '),    )      Def parse_item (Self, response):        sel =  Selector (response)         item = movieitem ()          #  Here you can also use &NBSP;CSS (),  re ()   and so on. You can also use Firebug to assist in selecting         item[' Name '] = sel.xpath ('//span[@ Property= "v:itemreviewed"]/text ()). Extract ()         item[' year '  = sel.xpath ('//span[@class = ' year ']/text () '). Extract ()          item[' score '] = sel.xpath ('//strong[@class = ' ll rating_num ']/text () '). Extract ()          item[' director ']&NBSP;=&NBSP;SEL.XPATh ('//a[@rel = ' V:directedby ']/text () '). Extract ()         item[' Classification '] = sel.xpath ('//span[@property = ' v:genre ']/text () '). Extract ()          item[' actor '] = sel.xpath ('//a[@rel = ' v:starring ']/text () '). Extract ()          return item

(3) pipeline.py

# Save the crawled data to the database, there are two versions # one is saved to the MySQL database # Another is saved to the non-relational database MongoDB

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler----(5. Scrapy framework, integrated applications and others)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler----(5. Scrapy framework, integrated applications and others)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support