Reading call transfer for Python crawlers (2)

Source: Internet
Author: User

Reading call transfer for Python crawlers (2)

 

 

The next page in the returned directory of the previous page is in a div with the id of footlink. If you want to match each link, a large number of other links on the webpage will be crawled, however, footlink only has one div! We can match the div, capture it, And then match the link in the captured div. Then there are only three. As long as the last link is the url of the next page, use this url to update the target url we crawled, so that we can keep capturing the next page. The user reading logic is that after reading a chapter, wait for user input. If it is quit, exit the program; otherwise, the next chapter is displayed.

 

Basic knowledge:

The basic knowledge of the previous Article is added with the Python thread module.

 

Source code:

 

#-*-Coding: UTF-8-*-import urllib2import reimport threadimport chardetclass Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = http://www.quanben.com/xiaoshuo/10/10412/2095096.html # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2.Request (myUrl, Headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode (UTF-8) # Find the div tag id = content # capture the title my_title = re. search ('(. *?) ', UnicodePage, re. S) my_title = my_title.group (1) # capture chapter content my_content = re. search ('(.*?) ', UnicodePage, re. s) my_content = my_content.group (1) my_content = my_content.replace (,) my_content = my_content.replace (,) # store the title and content of the chapter onePage = {'title': my_title, 'content': my_content} # locate the connection area foot_link = re at the bottom of the page. search ('(. *?) ', UnicodePage, re. s) foot_link = foot_link.group (1) # locate the next page in the connection area. The third nextUrl = re. findall (U '(. *?)', Foot_link, re. s) nextUrl = nextUrl [2] [0] # update the next crawling link self. url = nextUrl return onePage # used to load chapter def LoadPage (self): while self. flag: if (len (self. pages)-self. page <3): try: # obtain the new page myPage = self. getPage () self. pages. append (myPage) failed T: print 'cannot connect to the webpage! '# Display chapter def ShowPage (self, curPage): print curPage ['title'] print curPage ['content'] print user_input = raw_input (currently chapter % d, press enter to read the next chapter or enter quit to exit: % self. page) if (user_input = 'quit'): self. flag = False print def Start (self): print U' Start to read ...... '# create a thread. start_new_thread (self. loadPage, () # If the page array of self contains the element while self. flag: if self. page <= len (self. pages): nowPage = self. pages [self. page-1] self. showPage (nowPage) self. page + = 1 print u this reading ends # ----------- program entrance ----------- print u ----------------------------------------- program: Read call transfer version: 0.2 Author: angryrookie Date: 2014-07-07 language: Python 2.7 function: press enter to browse the next section ----------------------------------- print U'. Press Enter: 'raw _ input ('') myBook = Book_Spider () myBook. start ()

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.