When analyzing and processing selections, it is also important to note that JS on the page may modify the DOM tree structure.
(a) Use of GitHub
Because the previous use of win, did not use the shell. Currently just understand. Added later. Find a Few good tutorials
GitHub Ultra-detailed text guide http://blog.csdn.net/vipzjyno1/article/details/22098621
GitHub Modify Submit Http://www.360doc.com/content/12/0602/16/2660674_215429880.shtml
I'll add !!!!! later.
(ii) Use of Firebug in Firefox
I've been using Firefox's F12 default debugging tool for a while, and it's pretty cool. Just changed the Firebug a try, that is even more cool.
Tools-->web developer-->get More Tools then, in general, the first one is FireBug installation. After pressing F12, it is enabled by default.
The function is simply too strong to have no friends. The xpath,css path of the element can be obtained directly. Cookies can also be modified .....
(c) Watercress movie crawl http://www.ituring.com.cn/article/114408
(1) items.py
#-*-Coding:utf-8-*-from scrapy import Item, Fieldclass Movieitem (item): Name = Field () year = field () score = Field () Director = field () Classification = field () actor = field ()
(2) spiders/movie_spider.py
# -*- coding: utf-8 -*-from scrapy import selectorfrom Scrapy.contrib.spiders import crawlspider, rulefrom scrapy.contrib.linkextractors.sgml import sgmllinkextractorfrom douban.items import movieitemclass moviespider ( Crawlspider): name = "movie" allowed_domains = ["douban.com"] start_urls = ( ' http://movie.douban.com/top250 ', ) # Rules can customize URLs ' crawl rules = ( # This Rule only finds URLs based on Start_urls and is not a specific page of data fetching rule ( Sgmllinkextractor (allow= (R ' http://movie.douban.com/top250\?start=\d+.* ')), # This rule is the page address of the specific data fetch, and callback is the callback function that handles the returnedResponse Data rule (Sgmllinkextractor (allow= (r '/HTTP/ Movie.douban.com/subject/\d+ '), callback= ' Parse_item '), ) Def parse_item (Self, response): sel = Selector (response) item = movieitem () # Here you can also use  CSS (), re () and so on. You can also use Firebug to assist in selecting item[' Name '] = sel.xpath ('//span[@ Property= "v:itemreviewed"]/text ()). Extract () item[' year ' = sel.xpath ('//span[@class = ' year ']/text () '). Extract () item[' score '] = sel.xpath ('//strong[@class = ' ll rating_num ']/text () '). Extract () item[' director '] = SEL.XPATh ('//a[@rel = ' V:directedby ']/text () '). Extract () item[' Classification '] = sel.xpath ('//span[@property = ' v:genre ']/text () '). Extract () item[' actor '] = sel.xpath ('//a[@rel = ' v:starring ']/text () '). Extract () return item
(3) pipeline.py
# Save the crawled data to the database, there are two versions # one is saved to the MySQL database # Another is saved to the non-relational database MongoDB