The birth of a site 03--10,000 restaurants that fetch the most number of comments

Source: Internet
Author: User

There are a number of ways to sort restaurants in popular reviews online , for exampleHttp://www.dianping.com/search/category/1/10/o10, the city of Shanghai in accordance with the total number of comments on the restaurant ranking, the following 50 pages, which is the Shanghai cumulative review of the top 750 restaurants. But there are only 750, a little bit less. Shanghai has 18 districts, each district will display the first 750 restaurants, for example, thisHttp://www.dianping.com/search/category/1/10/r802o10, is the first 750 of the Yaohan lot in Pudong New area. Shanghai now has 100,000 restaurants, in such a way to get at least the data of TOP8 million restaurants.

But it doesn't make sense to get so much data, because most of the restaurants are ordinary restaurants, provide regular dining, eat and eat, no one would like to review. Only a special restaurant can be reviewed. Or just have a very business-conscious restaurant, will hire the navy to help themselves write reviews.

Data analysis found that only about 300 restaurants in Shanghai have more than 100 reviews per month, accounting for 0.3% of the total number of restaurants. Assuming that a restaurant has more than 20 reviews per one months, it's in the top 3,000, and it's actually pretty incredible. The public participation is not high enough, the restaurant participation is not high enough, which indicates that the review sector is still promising!

This time, we aim to crawl the top 10,000 restaurants in Shanghai with the largest number of reviews, capturing 750 from each district in Shanghai, 18 districts being 13,500, and removing hundreds of restaurants that belong to two or more districts at the same time, and more than two are enough. These are more than 10,000, enough to cover every area of the restaurant that has the material.

Take the Pudong new area as an example to do the crawl. Pudong New Area reviews the top 750 restaurants, the corresponding URL is http://www.dianping.com/search/category/1/10/r5o10p1, note that the category behind the 1 is the city code of Shanghai, R5 is the code of Pudong New Area, P1 is the first page, there are 15 restaurants, the other symbolic meaning for the moment. There are at least thousands of restaurants in every district in Shanghai, so don't worry about less than 750 restaurants and don't have to deal with this anomaly. Well, let's at least put the last number of the link, from 1 to 50, crawl the HTML page, and then extract the restaurant information.

Before crawling, to change the configuration file,/tmp/srcapy-test/crawdp/crawdp/setting.py, to add four lines of code, changed to such as the following form:

------------------------------------

Bot_name = ' CRAWDP ' bot_version = ' 1.0 ' spider_modules = [' crawdp.spiders ']newspider_module = ' crawdp.spiders ' USER_AGENT = '%s/%s '% (Bot_name, bot_version) download_delay = 5randomize_download_delay = Trueuser_agent = ' Mozilla AppleWebKit/537 . chrome/27.0.1453.93 safari/537.36 ' cookies_enabled = False

-------------------------------------

The last four lines of code are newly added. This time to crawl 50 times the page, the interval between each crawl is 5 seconds, to do a random download delay, to disable cookies, these measures in order to avoid the official server is not allowed to crawl.

In the/tmp/scrapy-test/crawdp/crawdp/spiders/folder, add the file shopids_spider.py, such as the following

------------------------------------

From Scrapy.spider import basespiderfrom scrapy.selector import Htmlxpathselectorclass shopidsspider (BaseSpider): Name = "Shopids_spider" Start_urls = [] for I in range (1,51): Start_urls.append ("http://www.dianping.com/ search/category/1/10/r5o10p%s "% i" def parse (self, response): HxS = htmlxpathselector (response) xs = HxS . Select ('//ul[@class =\ "detail\"] ') for x in Xs:print "---------" shopid = X.select (' li[@clas S=\ "shopname\"]/a[@class =\ "bl\"]/@href '). Extract () [0].split ('/') [-1] shopname = X.select (' li[@class =\ ' shopname\ "]/a[@class =\" bl\ "]/text ()"). Extract () [0] print "shopid, Shopname =%s,%s"% (Shopid, shopname)

------------------------------------

Then, running "scrapy crawl Shopids_spider" under the/TMP/SCRAPY-TEST/CRAWDP folder, you'll be able to see the names of the restaurants that were crawled, as well as their shopid in the Volkswagen Review King, and the result is something like this:

---------shopid, shopname = 5391580, Thai Princess Pavilion (shinmay Square Shop)---------shopid, shopname = 4043482, sai Pui oat Noodle Village (Jinqiao Branch)---------Shopid, Shopname = 2748850, Wang Xiang Garden (96 square shop)---------shopid, shopname = 500068, Typhoon shelter (Yaohan)---------shopid, shopname = 5473698, on the string of incense pot (Pudong Xin Mei store)-- -------shopid, shopname = 501019, gallery also Fang restaurant (Zhengda shop)---------shopid, shopname = 559844, Yu Xiang family (Lujiazui shop)

So, how do you know the ID of 18 districts in Shanghai? On the left side of the Http://www.dianping.com/search/category/1/10/o10, click on "by District", you can list the 18 districts of Shanghai Link, which contains the ID of each zone, only to do a crawl can be obtained.

This approach is the simplest way. In fact, many other functions can be added to make the crawl process more intelligent, such as inferring the return state of Reponse, after being 403, can pause for several seconds and then continue to crawl, for example, the results into the database, or into a JSON file. These dongdong scrapy can be done.

The birth of a site 03--10,000 restaurants that fetch the most number of comments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.