Crawling Ctrip and Leech sites review data \ Ctrip review data crawl \ Travel site data crawl

Last Update:2018-07-24 Source: Internet

Author: User

Tags comments

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have long sold a large number of micro-blog data, travel site review data, and provide a variety of specified data crawling services, message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange Group: 99918768 Preface

In order to obtain multi-source data need to go to each site to get some of the attractions of the comments and pictures, first selected Ctrip and the two Web site, will be some crawl process recorded Ctrip analysis Data

First of all, we go to Ctrip scenic spots of Gulangyu to see the page we want to crawl, probably found that there are dozens of attractions, the structure of each site should be similar, so we choose the first scenic spot to see how the specific page should crawl.

We need the red circle part, it is easy to know this comment page is dynamically loaded, so can not directly use BS4 or direct extraction elements, we need to analyze the dynamic interface of the page. Open the chrome review element, switch to network to see the contents of the transmission, first clear the content to avoid interference, and then click Next, we can get

By looking at the returned data we can get this is the interface we want to use is post for transmission, the transmission of form data has a lot of fields, can be roughly guessed out

POIID is the site of the Poiid Pagenow is the current number of pages star is the score 1-5, 0 represents all ResourceID is a value for each resource

Crawl when you need to change these values can be based on their own needs to crawl content, but the need to pay attention to the pagenow of Ctrip can only get 100 pages, and POIID and ResourceID value is not regular, we need to view each scenic spot ... I have found the value of all the scenic spots of Gulangyu in turn, and there is a GitHub share in the text at the end of this article. Build a library

The first thing we want to do is to think about the structure of the database, I choose to use MySQL, the specific structure is as follows:

Get Data

I do not specifically analyze this, it is not difficult, there are a few pits to pay attention to.

First, not all reviews have views, price-performance ratings, so here's a judgment. Second, there is a trip time of this item, now seems to have no amount. Third, the comment text may appear single quotes, the insertion of the database will be an error, to escape or replace. Four, the crawl speed not too fast, Ctrip is still relatively powerful anti-grilled. Leech Honeycomb Analyze Data

Similarly, the leech data is also dynamically loaded, using the same method to view the analysis data interface.

We can see that the method of data acquisition is get, we can find the law of the requested URL. After comparing the data of different scenic spots and different pages, we found that the parameters changed mainly in two places, one is poiid i use href instead, one is the number of pages I use NUM instead. Get the comment data for the attraction just change these two values.

Url= ' http://pagelet.mafengwo.cn/poi/pagelet/poiCommentListApi?callback=jQuery18105332634542482972_ 1511924148475&params=%7b%22poi_id%22%3a%22{href}%22%2c%22page%22%3a{num}%2c%22just_comment%22%3a1%7d ' Get poi of each attraction

This is not a POST request. We don't have to go to one of the attractions to get the parameters, we can visit this site to find all the users, but the site's data is also dynamically loaded

Based on the above picture we can clearly see that we only need to pass the number of pages to get all the poiid of the attractions, and then according to these poiid we can get all the comments data, this part we use a function to deal with

Def get_param ():
    # Get parameters for all sights total
    = []
    router_url = ' http://www.mafengwo.cn/ajax/router.php ' for
    num In range (1, 6):
        params = {
            ' sact ': ' kmdd_structwebajax| Getpoisbytag ',
            ' imddid ': 12522,
            ' Itagid ': 0,
            ' ipage ': num
        }
        pos = Requests.post (url=router_url , Data=params, Headers=headers). JSON ()
        Soup_pos = BeautifulSoup (pos[' data '] [' list '], ' lxml ') result

        = [{' Scenery ': p[' title '], ' href ': Re.findall (re.compile (R '/poi/(\d+). html '), p[' href ']) [0]} for p in
                  soup_pos.find_ All (' a ')]
        total.extend (Result) return total

The rest is similar, no longer too much to explain. Personal Blog

8aoy1.cn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More