Reading call transfer for Python crawlers (III)

Source: Internet
Author: User

Reading call transfer for Python crawlers (III)

Although we can continue to read the chapters in the previous blog, do we run our Python program every time we read novels? I can't even see where the record is. Every time I come back? Of course not. Change it! Now, we only need to capture the novels we want into the local txt file, and then select a reader to read them all.


In fact, we have completed most of the logic of the last program. The subsequent changes only need to be captured in each chapter without being shown, but saved to the txt file. The other one is that the program continuously crawls Based on the Url of the next page. When will it end? Note that when the last chapter of the novel arrives, the link on the next page is the same as the link on the returned directory. So when we capture a webpage, we take out the two links. When the two links are the same, we stop crawling. Finally, we don't need to use multiple threads for this program. We only need a thread that is constantly capturing the pages of novels.

However, when there are more chapters in the novel, it may take a long time to complete. So much is not taken into account now. If the basic functions are completed, OK ....


Basic knowledge: the previous basic knowledge-multithreading knowledge + file operation knowledge.


Source code:

#-*-Coding: UTF-8-*-import urllib2import urllibimport reimport threadimport chardetclass Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = "http://www.quanben.com/xiaoshuo/0/910/59302.html" # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2.Reque St (myUrl, headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") # Find the div tag id = "content" # capture the title my_title = re. search ('(. *?) ', UnicodePage, re. S) my_title = my_title.group (1) # capture chapter content my_content = re. search ('(.*?) ', UnicodePage, re. S) my_content = my_content.group (1) my_content = my_content.replace ("
"," \ N ") my_content = my_content.replace (" "," ") # store the title and content of the chapter onePage = {'title': my_title, 'content ': my_content} # Find The Connection Area foot_link = re at the bottom of the page. search ('(. *?) ', UnicodePage, re. s) foot_link = foot_link.group (1) # locate the next page in the connection area. The third nextUrl = re. findall (U '(. *?) ', Foot_link, re. s) # directory link dir_url = nextUrl [1] [0] nextUrl = nextUrl [2] [0] # update the next crawling link self. url = nextUrl if (dir_url = nextUrl): self. flag = False; return onePage # used to load chapter def downloadPage (self): f_txt = open ("douluo mainland.txt", 'W + ') while self. flag: try: # obtain the new page myPage = self. getPage () title = myPage ['title']. encode ('utf-8') content = myPage ['content']. encode ('utf-8') f_txt.write (title + '\ n \ n') f_txt.w Rite (content) f_txt.write ('\ n \ n') print "downloaded", myPage ['title'] cannot connect to the server! 'F_txt.close () def Start (self): print U' to download ...... \ n' self. downloadPage () print u "download completed" # ----------- program entrance ------------- print u "--------------------------------------- program: Read call transfer version: 0.3 Author: angryrookie Date: language: python 2.7 function: Press enter to start downloading --------------------------------- "print u". Press Enter: 'raw _ input ('') myBook = Book_Spider () myBook. start ()

The effect is shown in the following figure:


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.