Crawling Ctrip and Leech sites review data \ Ctrip review data crawl \ Travel site data crawl

Source: Internet
Author: User
Tags comments

I have long sold a large number of micro-blog data, travel site review data, and provide a variety of specified data crawling services, message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange Group: 99918768 Preface

In order to obtain multi-source data need to go to each site to get some of the attractions of the comments and pictures, first selected Ctrip and the two Web site, will be some crawl process recorded Ctrip analysis Data

First of all, we go to Ctrip scenic spots of Gulangyu to see the page we want to crawl, probably found that there are dozens of attractions, the structure of each site should be similar, so we choose the first scenic spot to see how the specific page should crawl.

We need the red circle part, it is easy to know this comment page is dynamically loaded, so can not directly use BS4 or direct extraction elements, we need to analyze the dynamic interface of the page. Open the chrome review element, switch to network to see the contents of the transmission, first clear the content to avoid interference, and then click Next, we can get

By looking at the returned data we can get this is the interface we want to use is post for transmission, the transmission of form data has a lot of fields, can be roughly guessed out

POIID is the site of the Poiid Pagenow is the current number of pages star is the score 1-5, 0 represents all ResourceID is a value for each resource

Crawl when you need to change these values can be based on their own needs to crawl content, but the need to pay attention to the pagenow of Ctrip can only get 100 pages, and POIID and ResourceID value is not regular, we need to view each scenic spot ... I have found the value of all the scenic spots of Gulangyu in turn, and there is a GitHub share in the text at the end of this article. Build a library

The first thing we want to do is to think about the structure of the database, I choose to use MySQL, the specific structure is as follows:

Get Data

I do not specifically analyze this, it is not difficult, there are a few pits to pay attention to.

First, not all reviews have views, price-performance ratings, so here's a judgment. Second, there is a trip time of this item, now seems to have no amount. Third, the comment text may appear single quotes, the insertion of the database will be an error, to escape or replace. Four, the crawl speed not too fast, Ctrip is still relatively powerful anti-grilled. Leech Honeycomb Analyze Data

Similarly, the leech data is also dynamically loaded, using the same method to view the analysis data interface.

We can see that the method of data acquisition is get, we can find the law of the requested URL. After comparing the data of different scenic spots and different pages, we found that the parameters changed mainly in two places, one is poiid i use href instead, one is the number of pages I use NUM instead. Get the comment data for the attraction just change these two values.

Url= ' http://pagelet.mafengwo.cn/poi/pagelet/poiCommentListApi?callback=jQuery18105332634542482972_ 1511924148475&params=%7b%22poi_id%22%3a%22{href}%22%2c%22page%22%3a{num}%2c%22just_comment%22%3a1%7d ' Get poi of each attraction

This is not a POST request. We don't have to go to one of the attractions to get the parameters, we can visit this site to find all the users, but the site's data is also dynamically loaded

Based on the above picture we can clearly see that we only need to pass the number of pages to get all the poiid of the attractions, and then according to these poiid we can get all the comments data, this part we use a function to deal with

Def get_param ():
    # Get parameters for all sights total
    = []
    router_url = ' http://www.mafengwo.cn/ajax/router.php ' for
    num In range (1, 6):
        params = {
            ' sact ': ' kmdd_structwebajax| Getpoisbytag ',
            ' imddid ': 12522,
            ' Itagid ': 0,
            ' ipage ': num
        }
        pos = Requests.post (url=router_url , Data=params, Headers=headers). JSON ()
        Soup_pos = BeautifulSoup (pos[' data '] [' list '], ' lxml ') result

        = [{' Scenery ': p[' title '], ' href ': Re.findall (re.compile (R '/poi/(\d+). html '), p[' href ']) [0]} for p in
                  soup_pos.find_ All (' a ')]
        total.extend (Result) return total

    

The rest is similar, no longer too much to explain. Personal Blog

8aoy1.cn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.