There are many ways to sort restaurants in popular reviews online, such asHttp://www.dianping.com/search/category/1/10/o10, is the city of Shanghai according to the total number of comments on the restaurant ranking, the following 50 pages, which is the Shanghai cumulative review of the top 750 restaurants. But only 750, a little bit less. Shanghai has 18 districts, each district will show the first 750 restaurants, such as thisHttp://www.dianping.com/search/category/1/10/r802o10, is the first 750 of the Yaohan lot in Pudong New area. Shanghai now has 100,000 restaurants, in this way can get at least the data of TOP8 million restaurants.
But it doesn't make sense to get so much data, because most of the restaurants are ordinary restaurants, providing regular meals, eating and eating, and no one will want to review. Only the special restaurant can be reviewed by people. Or only a restaurant that cares about business, will hire the navy to help write reviews.
Data analysis found that only about 300 restaurants in Shanghai have more than 100 reviews per month, accounting for 0.3% of the total number of restaurants. If a restaurant has more than 20 reviews per month, it's going to be in the top 3,000, which is pretty incredible. Public participation is not high enough, the restaurant participation is not high, which indicates that the review industry is still promising!
The purpose of this time is to crawl the top 10,000 restaurants in Shanghai with the largest number of reviews, capturing 750 from each district in Shanghai, 18 districts being 13,500, and removing hundreds of restaurants that belong to two or more than two, and more than 10,000 are enough. These are more than 10,000, enough to cover every area of the restaurant with materials.
Take the Pudong new area as an example to do the crawl. The top 750 restaurants in Pudong New area, the corresponding URL is http://www.dianping.com/search/category/1/10/r5o10p1, note that the category behind the 1 is the city code of Shanghai, R5 is the code of Pudong New Area, P1 is the first page, there are 15 restaurants, the other symbolic meaning for the moment without tube. There are at least thousands of restaurants in every district in Shanghai, so don't worry about less than 750 restaurants, so don't deal with this anomaly. Well, let's at least put the last number of the link, from 1 to 50, crawl the HTML page, and then extract the restaurant information.
Before crawling, to modify the configuration file,/tmp/srcapy-test/crawdp/crawdp/setting.py, to add four lines of code, modified to the following form:
------------------------------------
Bot_name = ' CRAWDP ' bot_version = ' 1.0 ' spider_modules = [' crawdp.spiders ']newspider_module = ' crawdp.spiders ' USER_AGENT = '%s/%s '% (Bot_name, bot_version) download_delay = 5randomize_download_delay = Trueuser_agent = ' Mozilla AppleWebKit/537 . chrome/27.0.1453.93 safari/537.36 ' cookies_enabled = False
-------------------------------------
The last four lines of code are newly added. This time to crawl 50 times the page, the interval between each crawl is 5 seconds, to do a random download delay, to disable cookies, these measures in order to avoid the official server is not allowed to crawl.
Add the file shopids_spider.py in the/tmp/scrapy-test/crawdp/crawdp/spiders/directory as follows
------------------------------------
From Scrapy.spider import basespiderfrom scrapy.selector import Htmlxpathselectorclass shopidsspider (BaseSpider): Name = "Shopids_spider" Start_urls = [] for I in range (1,51): Start_urls.append ("http://www.dianping.com/ search/category/1/10/r5o10p%s "% i" def parse (self, response): HxS = htmlxpathselector (response) xs = HxS . Select ('//ul[@class =\ "detail\"] ') for x in Xs:print "---------" shopid = X.select (' li[@clas S=\ "shopname\"]/a[@class =\ "bl\"]/@href '). Extract () [0].split ('/') [-1] shopname = X.select (' li[@class =\ ' shopname\ "]/a[@class =\" bl\ "]/text ()"). Extract () [0] print "shopid, Shopname =%s,%s"% (Shopid, shopname)
------------------------------------
Then, in the/TMP/SCRAPY-TEST/CRAWDP directory execution "scrapy crawl Shopids_spider", you can see the name of the restaurant crawled to, as well as they in the public reviews Wang's shopid, the result is similar to this:
---------shopid, shopname = 5391580, Thai Princess Pavilion (shinmay Square Shop)---------shopid, shopname = 4043482, sai Pui oat Noodle Village (Jinqiao Branch)---------Shopid, Shopname = 2748850, Wang Xiang Garden (96 square shop)---------shopid, shopname = 500068, Typhoon shelter (Yaohan)---------shopid, shopname = 5473698, on the string of incense pot (Pudong Xin Mei store)-- -------shopid, shopname = 501019, gallery also Fang restaurant (Zhengda shop)---------shopid, shopname = 559844, Yu Xiang family (Lujiazui shop)
So, how do you know the ID of 18 districts in Shanghai? On the left side of the Http://www.dianping.com/search/category/1/10/o10, click on "by the canton" to list the 18 districts of Shanghai Link, which contains the ID of each zone, as long as a crawl can be obtained.
This approach is the simplest way. In fact, you can add more functions, so that the crawl process more intelligent, such as judging the return state of Reponse, after being 403, you can pause a few seconds and then continue to crawl, such as the results into the database, or into a JSON file. These East reference scrapy can be done.