I have long sold a large number of micro-blog data, travel site review data, and provide a variety of specified data crawling services, message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange Group: 99918768 Preface
In order to obtain multi-source data need to go to each site to get some of the attractions of the comments and pictures, first selected Ctrip and the two Web site, will be some crawl process recorded Ctrip analysis Data
First of all, we go to Ctrip scenic spots of Gulangyu to see the page we want to crawl, probably found that there are dozens of attractions, the structure of each site should be similar, so we choose the first scenic spot to see how the specific page should crawl.
We need the red circle part, it is easy to know this comment page is dynamically loaded, so can not directly use BS4 or direct extraction elements, we need to analyze the dynamic interface of the page. Open the chrome review element, switch to network to see the contents of the transmission, first clear the content to avoid interference, and then click Next, we can get
By looking at the returned data we can get this is the interface we want to use is post for transmission, the transmission of form data has a lot of fields, can be roughly guessed out
POIID is the site of the Poiid Pagenow is the current number of pages star is the score 1-5, 0 represents all ResourceID is a value for each resource
Crawl when you need to change these values can be based on their own needs to crawl content, but the need to pay attention to the pagenow of Ctrip can only get 100 pages, and POIID and ResourceID value is not regular, we need to view each scenic spot ... I have found the value of all the scenic spots of Gulangyu in turn, and there is a GitHub share in the text at the end of this article. Build a library
The first thing we want to do is to think about the structure of the database, I choose to use MySQL, the specific structure is as follows:
Get Data
I do not specifically analyze this, it is not difficult, there are a few pits to pay attention to.
First, not all reviews have views, price-performance ratings, so here's a judgment. Second, there is a trip time of this item, now seems to have no amount. Third, the comment text may appear single quotes, the insertion of the database will be an error, to escape or replace. Four, the crawl speed not too fast, Ctrip is still relatively powerful anti-grilled. Leech Honeycomb Analyze Data
Similarly, the leech data is also dynamically loaded, using the same method to view the analysis data interface.
We can see that the method of data acquisition is get, we can find the law of the requested URL. After comparing the data of different scenic spots and different pages, we found that the parameters changed mainly in two places, one is poiid i use href instead, one is the number of pages I use NUM instead. Get the comment data for the attraction just change these two values.
Url= ' http://pagelet.mafengwo.cn/poi/pagelet/poiCommentListApi?callback=jQuery18105332634542482972_ 1511924148475¶ms=%7b%22poi_id%22%3a%22{href}%22%2c%22page%22%3a{num}%2c%22just_comment%22%3a1%7d ' Get poi of each attraction
This is not a POST request. We don't have to go to one of the attractions to get the parameters, we can visit this site to find all the users, but the site's data is also dynamically loaded
Based on the above picture we can clearly see that we only need to pass the number of pages to get all the poiid of the attractions, and then according to these poiid we can get all the comments data, this part we use a function to deal with
Def get_param ():
# Get parameters for all sights total
= []
router_url = ' http://www.mafengwo.cn/ajax/router.php ' for
num In range (1, 6):
params = {
' sact ': ' kmdd_structwebajax| Getpoisbytag ',
' imddid ': 12522,
' Itagid ': 0,
' ipage ': num
}
pos = Requests.post (url=router_url , Data=params, Headers=headers). JSON ()
Soup_pos = BeautifulSoup (pos[' data '] [' list '], ' lxml ') result
= [{' Scenery ': p[' title '], ' href ': Re.findall (re.compile (R '/poi/(\d+). html '), p[' href ']) [0]} for p in
soup_pos.find_ All (' a ')]
total.extend (Result) return total
The rest is similar, no longer too much to explain. Personal Blog
8aoy1.cn