The birth of a website 03--10,000 restaurants that fetch the most number of comments

Last Update:2014-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are many ways to sort restaurants in popular reviews online, such asHttp://www.dianping.com/search/category/1/10/o10, is the city of Shanghai according to the total number of comments on the restaurant ranking, the following 50 pages, which is the Shanghai cumulative review of the top 750 restaurants. But only 750, a little bit less. Shanghai has 18 districts, each district will show the first 750 restaurants, such as thisHttp://www.dianping.com/search/category/1/10/r802o10, is the first 750 of the Yaohan lot in Pudong New area. Shanghai now has 100,000 restaurants, in this way can get at least the data of TOP8 million restaurants.

But it doesn't make sense to get so much data, because most of the restaurants are ordinary restaurants, providing regular meals, eating and eating, and no one will want to review. Only the special restaurant can be reviewed by people. Or only a restaurant that cares about business, will hire the navy to help write reviews.

Data analysis found that only about 300 restaurants in Shanghai have more than 100 reviews per month, accounting for 0.3% of the total number of restaurants. If a restaurant has more than 20 reviews per month, it's going to be in the top 3,000, which is pretty incredible. Public participation is not high enough, the restaurant participation is not high, which indicates that the review industry is still promising!

The purpose of this time is to crawl the top 10,000 restaurants in Shanghai with the largest number of reviews, capturing 750 from each district in Shanghai, 18 districts being 13,500, and removing hundreds of restaurants that belong to two or more than two, and more than 10,000 are enough. These are more than 10,000, enough to cover every area of the restaurant with materials.

Take the Pudong new area as an example to do the crawl. The top 750 restaurants in Pudong New area, the corresponding URL is http://www.dianping.com/search/category/1/10/r5o10p1, note that the category behind the 1 is the city code of Shanghai, R5 is the code of Pudong New Area, P1 is the first page, there are 15 restaurants, the other symbolic meaning for the moment without tube. There are at least thousands of restaurants in every district in Shanghai, so don't worry about less than 750 restaurants, so don't deal with this anomaly. Well, let's at least put the last number of the link, from 1 to 50, crawl the HTML page, and then extract the restaurant information.

Before crawling, to modify the configuration file,/tmp/srcapy-test/crawdp/crawdp/setting.py, to add four lines of code, modified to the following form:

------------------------------------

Bot_name = ' CRAWDP ' bot_version = ' 1.0 ' spider_modules = [' crawdp.spiders ']newspider_module = ' crawdp.spiders ' USER_AGENT = '%s/%s '% (Bot_name, bot_version) download_delay = 5randomize_download_delay = Trueuser_agent = ' Mozilla AppleWebKit/537 . chrome/27.0.1453.93 safari/537.36 ' cookies_enabled = False

-------------------------------------

The last four lines of code are newly added. This time to crawl 50 times the page, the interval between each crawl is 5 seconds, to do a random download delay, to disable cookies, these measures in order to avoid the official server is not allowed to crawl.

Add the file shopids_spider.py in the/tmp/scrapy-test/crawdp/crawdp/spiders/directory as follows

------------------------------------

From Scrapy.spider import basespiderfrom scrapy.selector import Htmlxpathselectorclass shopidsspider (BaseSpider): Name = "Shopids_spider" Start_urls = [] for I in range (1,51): Start_urls.append ("http://www.dianping.com/ search/category/1/10/r5o10p%s "% i" def parse (self, response): HxS = htmlxpathselector (response) xs = HxS . Select ('//ul[@class =\ "detail\"] ') for x in Xs:print "---------" shopid = X.select (' li[@clas S=\ "shopname\"]/a[@class =\ "bl\"]/@href '). Extract () [0].split ('/') [-1] shopname = X.select (' li[@class =\ ' shopname\ "]/a[@class =\" bl\ "]/text ()"). Extract () [0] print "shopid, Shopname =%s,%s"% (Shopid, shopname)

------------------------------------

Then, in the/TMP/SCRAPY-TEST/CRAWDP directory execution "scrapy crawl Shopids_spider", you can see the name of the restaurant crawled to, as well as they in the public reviews Wang's shopid, the result is similar to this:

---------shopid, shopname = 5391580, Thai Princess Pavilion (shinmay Square Shop)---------shopid, shopname = 4043482, sai Pui oat Noodle Village (Jinqiao Branch)---------Shopid, Shopname = 2748850, Wang Xiang Garden (96 square shop)---------shopid, shopname = 500068, Typhoon shelter (Yaohan)---------shopid, shopname = 5473698, on the string of incense pot (Pudong Xin Mei store)-- -------shopid, shopname = 501019, gallery also Fang restaurant (Zhengda shop)---------shopid, shopname = 559844, Yu Xiang family (Lujiazui shop)

So, how do you know the ID of 18 districts in Shanghai? On the left side of the Http://www.dianping.com/search/category/1/10/o10, click on "by the canton" to list the 18 districts of Shanghai Link, which contains the ID of each zone, as long as a crawl can be obtained.

This approach is the simplest way. In fact, you can add more functions, so that the crawl process more intelligent, such as judging the return state of Reponse, after being 403, you can pause a few seconds and then continue to crawl, such as the results into the database, or into a JSON file. These East reference scrapy can be done.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The birth of a website 03--10,000 restaurants that fetch the most number of comments

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The birth of a website 03--10,000 restaurants that fetch the most number of comments

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support