Python3 to realize the public comment Network hotel information and hotel reviews page Crawl __python

Source: Internet
Author: User
Tags exception handling

* * Author: **mr. Ceong
Links:http://blog.csdn.net/leigaiceong/article/details/53188454 Python3 Implementation of the public comment network hotel information and hotel reviews page Crawl Summary

This article according to the existing "public comment network" hotel home page URL address, automatically crawl the required hotel name, picture, latitude and longitude, hotel price, star rating, the number of user comments and user comments of the user ID, user name, rating, comment time and so on, and the content of the crawl to the. txt document. This article is on the basis of Bowen http://blog.csdn.net/drdairen/article/details/51146961 to achieve and improve. Therefore, I am very grateful to the author for his selfless dedication. . Body I. Basic information programming Language: Python 3.5.2 Implementation platform: Eclipse for Pydev implementation functionality:
1. Crawl the basic information of the hotel (including: Hotel name, address, picture, latitude and longitude, hotel price, star rating, the number of users comments)
2. Crawl Hotel comments (including: Hotel reviews User ID, user name, room rating, service rating, total score, evaluation time, Hotel Comment content ) implementation code: http://download.csdn.net/ detail/qq_22107075/9668373 Matters needing attention:
1. When grasping the comments of the public comments network if access to the public comment network frequency is too frequent will lead to the public comment on the network's anti-reptile technology briefly shielded PC IP address (indefinite)
2. In this article, we set the time module of Python to limit the times of crawling comments , the implementation process

This article crawls a list of popular hotels in the vicinity of Guangzhou Tianhe District with more than 100 reviews (ranked by number of comments, as shown below).
realize the idea

A total of 5 modules are included in the program interpretation Program

dianpingspider.py Main program
picture.py Crawling Hotel Pictures
position.py is converted to warp latitude according to POI parameters
priceandscores.py hotels price and star rating
urlspider.py climb to the URL of the popular hotels near Guangzhou Tianhe District
Folder Hotel store information and user reviews for successful crawling
folder Image Store a successful crawl hotel picture module explanation
(1) dianpingspider.py
This module is to implement the main program, as long as the operation of this program can experiment this function. The main function is to invoke another four modules and implement other features that are not implemented (see Code for details)

Implementation of the principle of reptiles commonly used Python 3 modules, such as: Urllib.request, Re, time, online learning materials are very many, this does not make redundant, please google.

Import urllib.request Import re import time import picture #获取酒店图片 import urlspider #获取酒店的主页网址 import position #获取酒店的经纬度 I          mport OS import random import priceandscores #获取酒店的价格和星级评分 # Statistical evaluation information for all hotels, stored in text def getratingall (Filein): Count =0 # count, show progress websitenumber=0 for lines in open (Filein, ' R '): # Read and process the file line by row, that is, the hotel URL count=line.split ('
        \ t ') [0] line=line.split (' \ t ') [1] websitenumber+=1 print ("Crawling Hotel information for page%s"% (Websitenumber)) Try:print ("Fetching information from the first%s hotels, url is%s"% (Count,line.strip (' \ n ')) # get hotel number hotelid = Line.s  Trip (' \ n '). Split ('/') [4] #print (' Hotelid of the hotel is: ', Hotelid ') # pieced together the URL URL of the first page of the hotel "Comment page" = Line.strip (' \ n ') + "/review_more" #print (' the URL of the comment page on the first page of the hotel is ', url ') # Simulate browser, open URL h Eaders = (' user-agent ', ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) like Gecko ') opener = Urllib.request.build_opener ()
            Opener.addheaders = [headers] data = Opener.open (URL). Read () #print (data) when accessing the public comment network too Frequently, the public comments on the web's anti-reptile technology will block the IP address of this computer, at this time data will appear abnormal, can not print normal HTML data = Data.decode (' utf-8 ', ' ignore ') #prin T (data) # Get the number of hotel users comments Rate_number = Re.compile (R ' all reviews </a><em class= "Col-exp" >\ (. *?) \) </em></span> ', re.
            Dotall). FindAll (data) rate_number = Int ('. Join (Rate_number)) # Converts the list to STR, joins the contents of the list of iterations into STR, and then converts the type Print ("%d hotel reviews is%s"% (Websitenumber, rate_number)) if (rate_number<100): #若酒店评论数目少于100条, do not skip this hotel
            , no more digging information continue # Get hotel name and address Opener1 = Urllib.request.build_opener () Opener1.addheaders = [headers] Hotel_url =opener1.open (line). Read () Hotel_data = Hotel_url.decode ( ' Utf-8 ' #ignore是忽略其中有异常的编码, displays only valid encoding #print (hotel_data) #当访问大众点评网过于次数频繁的时候, the mass Comment network's anti-reptile technology will block the local IP address, at this time hotel_data an exception appears, unable to print normal HTML # get hotel name Shop_name=re.compile (U ' 

(2) urlspider.py

This module is mainly to crawl the home page URL of the popular hotels near Guangzhou Tianhe District (the order of the hotels is ranked by comments). As you can see from the program, we need to go from Http://www.dianping.com/guangzhou/hotel/r22c354pno10 (the last N represents page 1-50, not letters, and by changing n values, we can iterate through all the pages, Find all Hotels that crawl every page of the hotel. Visitors can only view a maximum of 50 pages of the hotel, but it is enough, after all, the number of comments more than 100 is actually not much. Crawling down the hotel home page is saved in "./hotelurl.txt".

The code is as follows:

    Import urllib.request Import re import os def gethotelurl (count): filename = ' HotelUrl.txt ' if os.path.exists (f ilename): # Print ("Yes") os.remove (filename) print ("Getting the URL of a popular hotel near Guangzhou Tianhe District, save to HotelUrl.txt") for K 1 in range (1,51): Tempurlpag = "http://www.dianping.com/guangzhou/hotel/r22c354p" +str (K1) + "O10" #下面对于当前 Page to get each hotel URL and join the queue headers = (' User-agent ', ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) like Gecko ') opener = Urllib.request.build_opener () opener.addheaders = [Headers] op = Opener.open (TEMPURLPAG) data= op.read (). Decode (encoding= ' UTF-8 ') Linkre = Re.compile (R ' Data-shop-url= " (.*?)" \ r \ n Data-hippo ', re. Dotall). FindAll (data) # Print (Linkre) for K2 in range (0, Len (linkre)): Tempurlhotel = "htt p://www.dianping.com/shop/"+ linkre[k2] fileop = open (filename, ' a ', encoding=" Utf-8 ") Fileop . Write (Str (couNT) + "\ T" \ +tempurlhotel+ ' \ n ') fileop.close () count+=1 #print (' is now the first:%d hotel on page%d, The address is:%s '% (K1+1,k2+1,tempurlhotel)) print (' First:%d page crawl complete. '%k1 #----------------------------------#--------Test Crawl hotel URL Program------------#----------------------------------If _ _name__ = = "__main__": Count=1 Gethotelurl (count)

Run Result:

(3) position.py
This module is mainly used to crawl the public comments on the latitude and longitude of the network information. The map location of the public comments network is very accurate, but from the HTML source code can not find the coordinates (latitude and longitude) information. In this paper, the POI parameters of JS text in HTML are converted into coordinates to obtain the latitude and longitude information. Refer to the information http://www.site-digger.com/html/articles/20111110/18.html, if you are interested can be carefully developed.

Part of the code:

# coding:utf-8 # Analytic public comment Map coordinate parameter (POI) def to_base36 (value): #将10进制整数转换为36进制字符串 if not isinstance (value, int):
    Raise (TypeError ("expected int, got%s:%r"% (value.__class__.__name__, value)) if value = = 0:return "0" 
        If value < 0:sign = "-" value =-value else:sign = "" result = [] while value: (value, MoD) = Divmod (value,) result.append ("0123456789abcdefghijklmnopqrstuvwxyz" [mod]) return (SIG
    n + "". Join (Reversed (result)) def getPosition (C): #解析大众点评POI参数 Digi = Add = + Plus = 7 Cha = 36
        I =-1 H = 0 B = ' J = Len (C) G = Ord (c[-1]) C = c[:-1] J-= 1 for E in range (J):
            d = Int (c[e], cha)-add if d >= add:d = d-plus B + = To_base36 (d) If D > H:
    I = E H = D A = Int (b[:i], digi) F = Int (b[i+1:], digi) L = (A + f-int (G))/2 Latitude = float (f-l)/100000 longitude = float (L)/100000 return Longitude,latitude #-----------------------------#-----------Test program ---------#-----------------------------If __name__ = = ' __main__ ': (longitude,latitude) =getposition (' IJGDHFZVIBRDHR ') print ("Longitude:%s°e,latitude:%s°n"% (longitude,latitude))

(4) picture.py
This module mainly realizes the crawling of the hotel pictures. There are a few pictures of the hotel on the homepage of each hotel, and we will crawl down the top 5 (some of the hotels are less than 5) and deposit them in the "./image" folder. The main idea is to map the images in HTML to crawl out of the link, and then save to the folder can be.

Import urllib.request
import re

def getpicture (url,fileout):
    webpage=urllib.request.urlopen (URL)
    data = Webpage.read ()
    data = Data.decode (' UTF-8 ')

    picture_temp=re.compile (R ' 

Run Result:

(5) priceandscores.py

This module gets the price and star rating of each hotel. Attention to get the hotel price when I put the hotel web site "www.dianping.com/shop/*" changed to "m.dianping.com/shop/*". Mainly has part of the www end of the hotel home page price is Dynamic Web page, crawling more trouble, interested in your own can try, and M-end of the page price is static Web page, crawling relatively simple.

The procedure is as follows:

import urllib.request import re # # Get Hotel name and address def getpriceandscores (URL): Url=url.replace (' W ww. ', ' M. ') #爬取价格的时候需要将酒店的PC端 (www.)
    To App end (M.) #print (URL) opener = Urllib.request.build_opener () headers = (' User-agent ', ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) like Gecko ') opener.addheaders = [headers] data=opener.open (URL). Read () data= data.decode (' utf-8 ') #ig Nore is the encoding that ignores the exception, showing only valid encodings #获取酒店总评分 Scores=re.compile (R ' <span class= "star star-(. *?)" ></span> ', Re.
    Dotall). FindAll (data) #scores =str ('. Join (scores)) scores = Int (Scores[0])/10 #print (scores) #获取酒店价格 Price =re.compile (R ' <span class= "Price" > (. *?) </span> ', Re. Dotall). FindAll (data) Price = Int ('. Join (Price)] #print (price) return price,scores if __name__ = = "__main_ _: Url= "http://www.dianping.com/shop/2256811" (price,scores) =getpriceandscores (URL) print (price) print (SC Ores) 
results show

1. Basic information of each hotel and hotel reviews the basic information of the user

2. User reviews for each hotel

3. Picture of each hotel

Under 4.hotel folder

Under 5.image folder

This article is dedicated only to those who want to realize the web crawler in person quickly, for the great God this article is not suitable, for the research is not suitable. Because this article is only for "implementation" and not for "speaking principle." Thank you so much for reading this little messy note.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.