Python3 Reptile Combat: Crawl The public comment network all hotel related information in a certain area _

Python3 Reptile Combat: Crawl The public comment network all hotel related information in a certain area __python

Last Update:2018-07-28 Source: Internet

Author: User

Tags chrome developer chrome developer tools

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After an afternoon plus one night, finally put the reptile code to write well, behind there are many want to improve the place (for example, data storage with Redis, use multithreading to speed up, crawl pictures, subdivision data, etc.), to be free to make changes, the following is the specific steps and ideas:

tools:pycharm,Google chrome developer tools, Fiddle2

Platform: Python3.4

First , enter the public comments the default area of the home is Shanghai, so simply go directly to the hotel home page from here start crawling data

The task I accomplished was simple and divided into two steps:

1, crawl comments network all hotels in Shanghai home page address, save

2. For each hotel, crawl all the information (ID, name) of the customer's evaluation, evaluate the situation (room, environment, service, evaluation time, etc.), evaluate the situation and so on, and save

second, code writing

1, we can see that every page of the hotel homepage has many hotels, each page URL is the following form:

Http://www.dianping.com/shanghai/hotel/pn

where n is the number of pages (1-50) so that we can traverse all the pages and find all the hotels as follows:

For K1 in range (0,50):
    tempurlpag = "http://www.dianping.com/shanghai/hotel/p" +str (PAGELIST[K1))
    # Queuepag.append (TEMPURLPAG)
    #下面对于当前页拿到每个酒店网址并加入队列
    op = Opener.open (TEMPURLPAG)
    if ' HTML ' not in Op.getheader (' Content-type '):
        continue
    data1 = Op.read (). Decode (encoding= ' UTF-8 ')
    #  onclick= ' window.open ('/shop/2503441 ')
  
    Linkre = Re.compile (R ' onclick= ' window.open\ (\ ') (. *?) \ ') ', re. Dotall). FindAll (data1)
    for K2 in range (0, Len (linkre)):
        Tempurlhotel = "http://www.dianping.com" + linkre[k2 ]
        queue.append (Tempurlhotel)
        fileop.write (tempurlhotel+ ' \ n ')
        #print (' current page:  %d  :%d  Hotel, its address is:%s '% (K1+1,k2+1,tempurlhotel)
    Print (' page:  %d  crawl complete. '%k1)

The running process is as follows :

The output file is as follows:

you can see the comment network as a tourist login to the 750 hotel information all crawled to (visitor login only shows 50 pages of data, do not know user login is not purple, crawler login Please refer to the previous blog)

2, I put all the hotel home page into the queue, where the use of the common deque module, in order to prevent the site ban crawler, so in the post to the crawler disguised as a browser (see the previous blog)

for each hotel read to data, here is an example of obtaining the evaluation user name:

  #   <a target= "_blank" rel= "nofollow" href= "/member/8450012" user-id= "8450012" class= "J_card" >   

# User's name
username = re.compile (r '  

 
in the previous note, the content is to use Google or fiddle or to view the saved HTML data contains a string of user names, using the regular: 


Re.compile (R '  



We can get username.

By the same token, get all the information you want, and then you can format the output to the file after processing:


Mink = min (int (len (userid)), int (len (rate_total)), int (len (rate_room))
                # print (' Mink:  ', mink)

                # Opens the output file, Append information Enter
                fileop = open (Fileout, ' a ', encoding= "Utf-8") for
                K in range (0, mink):
                    fileop.write ('%12s\t%12s\t% 20s\t%15s\t%15s\t%15s\t%15s\t\n '% (Hotelid, userid[k], username[k], rate_room[k], rate_service[k], rate_total[k], Rate_time[k])
                fileop.close ()) 

 
do not forget to throw an exception using try...except during code writing, such as the following: 


Except:
        file_except = open (' D:/exception.txt ', ' a ')
        file_except.write (tempurl+ ' \ n ')
        Breaknumber = Breaknumber+1 
We can also save the information of anomalies to facilitate debugging . 

This is the running process:




(I feel like I'm writing a course report ...) Habitual pasting process ... ）

This is the last data to crawl to look like:





(a few crooked do not know what the situation, please ignore)

In this way, we are easy to achieve the desired goal (welcome to the message discussion)

------------------------

to code too many people, upload a copy of the V2 version for everyone's reference, now CSDN download can not be set to free, so need 1 points,

Address: http://download.csdn.net/download/drdairen/9938689

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More