After an afternoon plus one night, finally put the reptile code to write well, behind there are many want to improve the place (for example, data storage with Redis, use multithreading to speed up, crawl pictures, subdivision data, etc.), to be free to make changes, the following is the specific steps and ideas:
tools:pycharm,Google chrome developer tools, Fiddle2
Platform: Python3.4
First , enter the public comments the default area of the home is Shanghai, so simply go directly to the hotel home page from here start crawling data
The task I accomplished was simple and divided into two steps:
1, crawl comments network all hotels in Shanghai home page address, save
2. For each hotel, crawl all the information (ID, name) of the customer's evaluation, evaluate the situation (room, environment, service, evaluation time, etc.), evaluate the situation and so on, and save
second, code writing
1, we can see that every page of the hotel homepage has many hotels, each page URL is the following form:
Http://www.dianping.com/shanghai/hotel/pn
where n is the number of pages (1-50) so that we can traverse all the pages and find all the hotels as follows:
For K1 in range (0,50):
tempurlpag = "http://www.dianping.com/shanghai/hotel/p" +str (PAGELIST[K1))
# Queuepag.append (TEMPURLPAG)
#下面对于当前页拿到每个酒店网址并加入队列
op = Opener.open (TEMPURLPAG)
if ' HTML ' not in Op.getheader (' Content-type '):
continue
data1 = Op.read (). Decode (encoding= ' UTF-8 ')
# onclick= ' window.open ('/shop/2503441 ')
Linkre = Re.compile (R ' onclick= ' window.open\ (\ ') (. *?) \ ') ', re. Dotall). FindAll (data1)
for K2 in range (0, Len (linkre)):
Tempurlhotel = "http://www.dianping.com" + linkre[k2 ]
queue.append (Tempurlhotel)
fileop.write (tempurlhotel+ ' \ n ')
#print (' current page: %d :%d Hotel, its address is:%s '% (K1+1,k2+1,tempurlhotel)
Print (' page: %d crawl complete. '%k1)
The running process is as follows :
The output file is as follows:
you can see the comment network as a tourist login to the 750 hotel information all crawled to (visitor login only shows 50 pages of data, do not know user login is not purple, crawler login Please refer to the previous blog)
2, I put all the hotel home page into the queue, where the use of the common deque module, in order to prevent the site ban crawler, so in the post to the crawler disguised as a browser (see the previous blog)
for each hotel read to data, here is an example of obtaining the evaluation user name:
# <a target= "_blank" rel= "nofollow" href= "/member/8450012" user-id= "8450012" class= "J_card" >
# User's name
username = re.compile (r '
in the previous note, the content is to use Google or fiddle or to view the saved HTML data contains a string of user names, using the regular:
Re.compile (R '
We can get username.
By the same token, get all the information you want, and then you can format the output to the file after processing:
Mink = min (int (len (userid)), int (len (rate_total)), int (len (rate_room))
# print (' Mink: ', mink)
# Opens the output file, Append information Enter
fileop = open (Fileout, ' a ', encoding= "Utf-8") for
K in range (0, mink):
fileop.write ('%12s\t%12s\t% 20s\t%15s\t%15s\t%15s\t%15s\t\n '% (Hotelid, userid[k], username[k], rate_room[k], rate_service[k], rate_total[k], Rate_time[k])
fileop.close ())
do not forget to throw an exception using try...except during code writing, such as the following:
Except:
file_except = open (' D:/exception.txt ', ' a ')
file_except.write (tempurl+ ' \ n ')
Breaknumber = Breaknumber+1
We can also save the information of anomalies to facilitate debugging .
This is the running process:
(I feel like I'm writing a course report ...) Habitual pasting process ... )
This is the last data to crawl to look like:
(a few crooked do not know what the situation, please ignore)
In this way, we are easy to achieve the desired goal (welcome to the message discussion)
------------------------
to code too many people, upload a copy of the V2 version for everyone's reference, now CSDN download can not be set to free, so need 1 points,
Address: http://download.csdn.net/download/drdairen/9938689