Reading call transfer for Python crawlers (III)
Although we can continue to read the chapters in the previous blog, do we run our Python program every time we read novels? I can't even see where the record is. Every time I come back? Of course not. Change it! Now, we only need to capture the novels we want into the local txt file, and then select a reader to read them all.
In fact, we have completed most of the logic of the last program. The subsequent changes only need to be captured in each chapter without being shown, but saved to the txt file. The other one is that the program continuously crawls Based on the Url of the next page. When will it end? Note that when the last chapter of the novel arrives, the link on the next page is the same as the link on the returned directory. So when we capture a webpage, we take out the two links. When the two links are the same, we stop crawling. Finally, we don't need to use multiple threads for this program. We only need a thread that is constantly capturing the pages of novels.
However, when there are more chapters in the novel, it may take a long time to complete. So much is not taken into account now. If the basic functions are completed, OK ....
Basic knowledge: the previous basic knowledge-multithreading knowledge + file operation knowledge.
Source code:
#-*-Coding: UTF-8-*-import urllib2import urllibimport reimport threadimport chardetclass Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = "http://www.quanben.com/xiaoshuo/0/910/59302.html" # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2.Reque St (myUrl, headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") # Find the div tag id = "content" # capture the title my_title = re. search ('(. *?) ', UnicodePage, re. S) my_title = my_title.group (1) # capture chapter content my_content = re. search ('(.*?) ', UnicodePage, re. S) my_content = my_content.group (1) my_content = my_content.replace ("
"," \ N ") my_content = my_content.replace (" "," ") # store the title and content of the chapter onePage = {'title': my_title, 'content ': my_content} # Find The Connection Area foot_link = re at the bottom of the page. search ('(. *?) ', UnicodePage, re. s) foot_link = foot_link.group (1) # locate the next page in the connection area. The third nextUrl = re. findall (U '(. *?) ', Foot_link, re. s) # directory link dir_url = nextUrl [1] [0] nextUrl = nextUrl [2] [0] # update the next crawling link self. url = nextUrl if (dir_url = nextUrl): self. flag = False; return onePage # used to load chapter def downloadPage (self): f_txt = open ("douluo mainland.txt", 'W + ') while self. flag: try: # obtain the new page myPage = self. getPage () title = myPage ['title']. encode ('utf-8') content = myPage ['content']. encode ('utf-8') f_txt.write (title + '\ n \ n') f_txt.w Rite (content) f_txt.write ('\ n \ n') print "downloaded", myPage ['title'] cannot connect to the server! 'F_txt.close () def Start (self): print U' to download ...... \ n' self. downloadPage () print u "download completed" # ----------- program entrance ------------- print u "--------------------------------------- program: Read call transfer version: 0.3 Author: angryrookie Date: language: python 2.7 function: Press enter to start downloading --------------------------------- "print u". Press Enter: 'raw _ input ('') myBook = Book_Spider () myBook. start ()
The effect is shown in the following figure: