Python3 Reptile Combat: Crawl The public comment network all hotel related information in a certain area __python

Source: Internet
Author: User
Tags chrome developer chrome developer tools

After an afternoon plus one night, finally put the reptile code to write well, behind there are many want to improve the place (for example, data storage with Redis, use multithreading to speed up, crawl pictures, subdivision data, etc.), to be free to make changes, the following is the specific steps and ideas:

tools:pycharm,Google chrome developer tools, Fiddle2

Platform: Python3.4

First , enter the public comments the default area of the home is Shanghai, so simply go directly to the hotel home page from here start crawling data

The task I accomplished was simple and divided into two steps:

1, crawl comments network all hotels in Shanghai home page address, save

2. For each hotel, crawl all the information (ID, name) of the customer's evaluation, evaluate the situation (room, environment, service, evaluation time, etc.), evaluate the situation and so on, and save

second, code writing

1, we can see that every page of the hotel homepage has many hotels, each page URL is the following form:


where n is the number of pages (1-50) so that we can traverse all the pages and find all the hotels as follows:

For K1 in range (0,50):
    tempurlpag = "" +str (PAGELIST[K1))
    # Queuepag.append (TEMPURLPAG)
    op = (TEMPURLPAG)
    if ' HTML ' not in Op.getheader (' Content-type '):
    data1 = (). Decode (encoding= ' UTF-8 ')
    #  onclick= ' ('/shop/2503441 ')
    Linkre = Re.compile (R ' onclick= '\ (\ ') (. *?) \ ') ', re. Dotall). FindAll (data1)
    for K2 in range (0, Len (linkre)):
        Tempurlhotel = "" + linkre[k2 ]
        queue.append (Tempurlhotel)
        fileop.write (tempurlhotel+ ' \ n ')
        #print (' current page:  %d  :%d  Hotel, its address is:%s '% (K1+1,k2+1,tempurlhotel)
    Print (' page:  %d  crawl complete. '%k1)

The running process is as follows :

The output file is as follows:

you can see the comment network as a tourist login to the 750 hotel information all crawled to (visitor login only shows 50 pages of data, do not know user login is not purple, crawler login Please refer to the previous blog)

2, I put all the hotel home page into the queue, where the use of the common deque module, in order to prevent the site ban crawler, so in the post to the crawler disguised as a browser (see the previous blog)

for each hotel read to data, here is an example of obtaining the evaluation user name:

  #   <a target= "_blank" rel= "nofollow" href= "/member/8450012" user-id= "8450012" class= "J_card" >   

# User's name
username = re.compile (r ' 

in the previous note, the content is to use Google or fiddle or to view the saved HTML data contains a string of user names, using the regular:

Re.compile (R ' 

We can get username.

By the same token, get all the information you want, and then you can format the output to the file after processing:

Mink = min (int (len (userid)), int (len (rate_total)), int (len (rate_room))
                # print (' Mink:  ', mink)

                # Opens the output file, Append information Enter
                fileop = open (Fileout, ' a ', encoding= "Utf-8") for
                K in range (0, mink):
                    fileop.write ('%12s\t%12s\t% 20s\t%15s\t%15s\t%15s\t%15s\t\n '% (Hotelid, userid[k], username[k], rate_room[k], rate_service[k], rate_total[k], Rate_time[k])
                fileop.close ())

do not forget to throw an exception using try...except during code writing, such as the following:

        file_except = open (' D:/exception.txt ', ' a ')
        file_except.write (tempurl+ ' \ n ')
        Breaknumber = Breaknumber+1
We can also save the information of anomalies to facilitate debugging .

This is the running process:

(I feel like I'm writing a course report ...) Habitual pasting process ... )

This is the last data to crawl to look like:

(a few crooked do not know what the situation, please ignore)

In this way, we are easy to achieve the desired goal (welcome to the message discussion)


to code too many people, upload a copy of the V2 version for everyone's reference, now CSDN download can not be set to free, so need 1 points,


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.