The first Spider is the city ID that grabs Shanghai, and incidentally grabs its next-level administrative ID.
The second spider is the shopid of the top 10,000 restaurants in Shanghai.
This is the third spider, which, according to a restaurant's shopid, crawls all its comments within a certain month.
The cumulative effect of the three spiders is to crawl all reviews of the TOPN restaurant in any city. The third spider modified, can also be done to crawl only one day of the comments, only grab someone's comments, from the point of view of the crawl is all right.
After the first two spider warm-up, this time to do a complex point.
Urlhttp://www.dianping.com/shop/2269717/review_more?pageno=1 is a restaurant reviews the home page, pageno=1 represents the 1th pages, each reviews have User ID, review time. If the review is in the current year, the review time does not write the year, otherwise there is a year. At the bottom of the page is a total of 24 pages of its home reviews.
If you want to crawl the July 2014 review, the process is this:
1. July 2014 has 31 days, that is to say, the date to crawl is from 2014-0701 to 2014-0731.
2. From Http://www.dianping.com/shop/2269717/review_more?pageno=1 to Http://www.dianping.com/shop/2269717/review_more? pageno=24, crawl the page.
3. If pageno=1, crawl the bottom of the page and save the maximum number of pages.
3. Review, if the creation date of the review does not have a year, the current year to fill. If the review was created on a date between 2014-0701 and 2014-0731, the reviews are stored. If, on the current page, you encounter a review that was created earlier than 2014-0701, the new page is no longer crawled because the next page reviews are earlier than the date to crawl.
4. If you encounter a 403 error, pause for 10 minutes before continuing scratching, if you encounter a 404 error, do not continue to crawl, return.
spider is as follows:
Import datetimeimport randomimport timefrom scrapy.spider import basespiderfrom scrapy.selector Import Htmlxpathselectorfrom scrapy.http Import Requestclass shopreviewspider (basespider): name = "Shopreview_spider" Allow Ed_domains = ["dianping.com"] start_urls = [] Handle_httpstatus_list = [404,403] def __init__ (self): "" Year_mon = ' 2014-04 ' shopid = ' 2269717 ' "" "Self._shopid =" 2269717 "Self._thisyear_int = Datetime.date.today (). Year Year_g = Mon_g = 7 Max_day = Self._min_date = Datetime.da Te (Year_g, Mon_g, 1) self._max_date = Datetime.date (Year_g, Mon_g, max_day) #set start URL se Lf.pagno = 1 Self.start_urls = ["http://www.dianping.com/shop/%s/review_more?pageno=%s"% (Self._shopid,self.pagno) ] Def parse (self, Response): if Response.Status = = 403:time.sleep (10*60) yield Request (re Sponse.url,callback=self.parse, headers={' Referer ': ' Http://www.baidu.com/s?psid=sdafwewer ' +str (1)}) elif response.status = = 404: print "\n\nmeet 404, Mark Shop fetched and return\n\n" ELSE:HXS = htmlxpathselector (response) #extract Reviews xs = hxs.select ('//div[@class =\ "comment-list\"]/ul/li ") generate_new_reque St = True If Len (xs) = = 0:print "Len (xs) = = 0" Generate_new_request = False For x in xs:reviewer = X.select (' div[@class =\ "pic\"]/a/@href '). Extract () [0].split ('/') [-1] Review = X.select (' div[@class =\ "content\"]/div[@class =\ "comment-txt\"]/div[@class =\ "j_brief-cont\"] '). Extract () [ 0].strip () reviewdate_t = X.select (' div[@class =\ "content\"]/div[@class =\ "misc-info\"]/span[@class =\ "time\"] /text () '). Extract () [0].split () [0] reviewdate= "" If Len (reviewdate_t) = = 5: Reviewdate = ("%s"% (self)._thisyear_int) + "-" +reviewdate_t else:reviewdate = "+reviewdate_t Rd1" , Rd2,rd3 = Reviewdate.split ('-') DD = datetime.date (int (RD1), int (RD2), int (RD3)) if DD < Self._min_date:generate_new_request = False elif dd >= self._min_date and DD <= Self._max_date:print "----------------" Print reviewdate print R Eviewer Print review Else:pass xs = hxs.select ('/ /a[@class =\ "Pagelink\"] Num_page_link = Len (xs) max_pages =-1 if num_page_link > 0: Max_pages = max ([X.select ("Text ()"). Extract () [0]) for x in XS]) if generate_new_request: Self.pagno + = 1 if Num_page_link = 0 or Self.pagno > Max_pages:pass Else New_url = "http://www.dianping.com/shop/%s/review_more?pageno=%s"% (Self._shopid, Self.pagno) Rand_int = Random.randint (1,999999) yield Request (New_url, Callback=self.parse, headers={' Referer ': ' Http://www.baidu.com/s?psid=sdafwewer ' +str (rand_int)})
The spider is/tmp/scrapy-test/crawdp/crawdp/spiders/shopreview_spider.py.
Run "Scrapy crawl Shopreview_spder" in the/TMP/SRCAPY-TEST/CRAWDP directory, the effect is this:
----------------2014-07-311198927<div class= "J_brief-cont" > This shop for another one <br> don't know when it will stop &L T;br> Overall business lunch business is very good <br> no special dishes <br> overall feel good </div>----------------2014-07-291942230& Lt;div class= "J_brief-cont" > for children to celebrate Children's Day, stay near hotels, after review of this shop, in the 5 spreads 1 minutes when asked to point afternoon tea set, was accepted! Hehe <br> environment is very Hong Kong-style, good afternoon tea! </div>----------------2014-07-29428660<div class= "J_brief-cont" > Lotus leaf rice is very good. The iron-plate squid is also very good. </div>----------------2014-07-2457148861<div class= "J_brief-cont" > Golden Orange Lemon Sweet but I prefer a lot of sour food. It looks delicious. Service attitude good environment simple clean two person morning tea ate 117 </div>----------------2014-07-205186406<div class= "J_brief-cont" &G T is an authentic Hong Kong-style restaurant, taste authentic, diverse dishes, but also often the innovation, often go to surprise </div>----------------2014-07-141836211<div class= "J_brief-cont "> The store's score to tell the truth, in my personal opinion, is low, this kind of shop if placed in my former live in the five factories around, the end of the Terrible in Wanda Plaza, the bad shops,The rice casserole is especially recommended. </div>----------------2014-07-1310365<div class= "J_brief-cont" > is purely for the shrimp next door to buy crayfish to come, next door Lobster Shop 5 Point not to open, had to stop at this shop, ordered an afternoon tea set, milk tea + fish egg sausage powder, taste very authentic, next to a few table customers are said Cantonese should this shop full of authentic, bad place is the first floor seems to be a smoking area, sitting in the passive smoke secondhand. </div>----------------2014-07-0848406752<div class= "J_brief-cont" > Fried fresh milk is very tasty, a large variety of dishes, the night to go to the ring The environment is very good, worth recommending! </div>----------------2014-07-0357580045<div class= "J_brief-cont" > Xiang kee the Taste of oyster test can also, I and colleagues sometimes go Eat, environment, General Service, </div>----------------2014-07-0214718103<div class= "J_brief-cont" > The staff was very friendly and helpful. Seafood porridge, burned a little Hu, also is fresh, scallops a bit too salty. Slip chicken Pot, there are many crisp and crisp lotus root slices, chicken very tender. Braised beef with radish and radish stew can be done. Fried fresh milk, very strange a dish, why sweet things will have garlic smell? </div>----------------2014-07-0136276873<div class= "J_brief-cont" > A Hong Kong-style tea restaurant, see a lot of people directly order a copy of the Lord Food (fried river flour or fried line, etc.) for supper. The hall is hung with a large TV in Cantonese, the next table is a table of Hong Kong people in the dining. <br> very like his family of white mushroom beef slices, in addition to a little salty, bailing mushrooms eat very full and chewy; the brand pig's feet taste good but only five or six, or slightly salty;Tao vegetables cold dishes, not good, purple cabbage are cut good blockbusters, and sesame paste is not much, and before the plate and eat are not the same, the end of a very large basin is actually not so much, the material is generally used; staple food is the Ribs river powder, no work. <br> staff were very friendly and helpful. </div>
scrapy in the handling of <br> elements, there will be a small problem, so the original text of the review can not be retained <div> and <br>.