Python crawler----(5. Scrapy framework, integrated applications and others)

Source: Internet
Author: User

When analyzing and processing selections, it is also important to note that JS on the page may modify the DOM tree structure.

(a) Use of GitHub

Because the previous use of win, did not use the shell. Currently just understand. Added later. Find a Few good tutorials

GitHub Ultra-detailed text guide http://blog.csdn.net/vipzjyno1/article/details/22098621

GitHub Modify Submit Http://www.360doc.com/content/12/0602/16/2660674_215429880.shtml

I'll add !!!!! later.

(ii) Use of Firebug in Firefox

I've been using Firefox's F12 default debugging tool for a while, and it's pretty cool. Just changed the Firebug a try, that is even more cool.

Tools-->web developer-->get More Tools then, in general, the first one is FireBug installation. After pressing F12, it is enabled by default.

The function is simply too strong to have no friends. The xpath,css path of the element can be obtained directly. Cookies can also be modified .....

(c) Watercress movie crawl http://www.ituring.com.cn/article/114408

(1) items.py

#-*-Coding:utf-8-*-from scrapy import Item, Fieldclass Movieitem (item): Name = Field () year = field () score = Field () Director = field () Classification = field () actor = field ()

(2) spiders/movie_spider.py

# -*- coding: utf-8 -*-from scrapy import selectorfrom  Scrapy.contrib.spiders import crawlspider, rulefrom scrapy.contrib.linkextractors.sgml  import sgmllinkextractorfrom douban.items import movieitemclass moviespider ( Crawlspider):    name =  "movie"     allowed_domains =  ["douban.com"]    start_urls =  (          ' http://movie.douban.com/top250 ',    )     #  Rules can customize URLs ' crawl     rules =  (        #   This Rule only finds URLs based on Start_urls and is not a specific page of data fetching         rule ( Sgmllinkextractor (allow= (R ' http://movie.douban.com/top250\?start=\d+.* ')),         #  This rule is the page address of the specific data fetch, and callback is the callback function that handles the returnedResponse Data         rule (Sgmllinkextractor (allow= (r '/HTTP/ Movie.douban.com/subject/\d+ '),  callback= ' Parse_item '),    )      Def parse_item (Self, response):        sel =  Selector (response)         item = movieitem ()          #  Here you can also use  CSS (),  re ()   and so on. You can also use Firebug to assist in selecting         item[' Name '] = sel.xpath ('//span[@ Property= "v:itemreviewed"]/text ()). Extract ()         item[' year '  = sel.xpath ('//span[@class = ' year ']/text () '). Extract ()          item[' score '] = sel.xpath ('//strong[@class = ' ll rating_num ']/text () '). Extract ()          item[' director '] = SEL.XPATh ('//a[@rel = ' V:directedby ']/text () '). Extract ()         item[' Classification '] = sel.xpath ('//span[@property = ' v:genre ']/text () '). Extract ()          item[' actor '] = sel.xpath ('//a[@rel = ' v:starring ']/text () '). Extract ()          return item

(3) pipeline.py

# Save the crawled data to the database, there are two versions # one is saved to the MySQL database # Another is saved to the non-relational database MongoDB






Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.