Although in the last blog we were able to read chapters consecutively, but. Does every time we read a novel execute our Python program? Not even a record to see anywhere, every time it is another time? Of course not, change!
Now that so many novel readers, we just need to grab the novel we want to the local TXT file, and then choose a reader to see. How the whole look at you.
As a matter of fact, the last program we have finished most of the logic, our next modification only need to crawl to each chapter of the time not to display, but into the TXT file. The other is that the program is constantly based on the next page of the URL to crawl, then when to end it? Note that when you reach the last chapter of the novel, the link to a page is the same as the link to the return folder. So when we crawl a Web page, we take these two links out, just to show the same time as two links. Stop crawling. In the end, we don't need multiple threads for this program, we just need a thread that is constantly crawling the pages of the novel.
Just a little more time in the fiction chapters, waiting for a little longer. Don't think so much now, the basic function is OK ....
Fundamentals: Previous Basics-Multithreading knowledge + file manipulation knowledge.
Source:
#-*-Coding:utf-8-*-import urllib2import urllibimport reimport threadimport chardetclass book_spider:def __init__ (SE LF): Self.pages = [] Self.page = 1 Self.flag = True Self.url = "Http://www.quanben.com/xiaoshu O/0/910/59302.html "# will crawl a section def getpage (self): Myurl = Self.url user_agent = ' mozilla/4.0 (compatibl E MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} req = Urllib2. Request (myurl, headers = headers) Myresponse = Urllib2.urlopen (req) mypage = Myresponse.read () charse t = Chardet.detect (mypage) charset = charset[' encoding '] if charset = = ' Utf-8 ' or charset = = ' UTF-8 ': MyPage = MyPage Else:mypage = Mypage.decode (' gb2312 ', ' ignore '). Encode (' Utf-8 ') unicodepage = Mypage.decode ("Utf-8") # Find the id= "content" of the div tag try: #抓取标题 my_title = Re.search (' < H1> (. *?))
Id= "Htmlcontent" class= "Contentbox" > (. *?) <div ', Unicodepage,re. S) my_content = My_content.group (1) except:print "Content HTML changes, please analyze again. "Return False my_content = My_content.replace (" <br/> "," \ n ") My_content = My_conte Nt.replace ("", "") #用字典存储一章的标题和内容 Onepage = {' title ': My_title, ' content ': my_content} try: #找到页面下方的连接区域 Foot_link = Re.search (' <div.*?class= ' Chapter_turnpage ' > (. *?) </div> ', Unicodepage,re. S) Foot_link = Foot_link.group (1) #在连接的区域找下一页的连接, according to the Web page features a third nexturl = Re.findall (U ' <a. *?href= "(. *?
) ".*?> (. *?) </a> ', Foot_link,re. S) #文件夹链接 Dir_url = nexturl[1][0] Nexturl = nexturl[2][0] # Update the next crawl link Self.url = Nexturl if (Dir_url = = Nexturl): Self.flag = False RE Turn onepage except:print "bottom link change. Please analyze again! " Return False # for loading Chapter Def downloadpage (self): f_txt = open (U "bucket continent. txt", ' w+ ') while Self.flag: Try: # Gets the new page mypage = self. GetPage () if mypage = = False:print ' crawl failed!
' Self.flag = False title = mypage[' title '].encode (' utf-8 ') content = mypage[' content '].encode (' Utf-8 ') f_txt.write (title + ' \ n ') f_txt.write (content) F_txt.write (' \n\n\n ') print "downloaded", mypage[' title ' Except:print ' cannot connect serv er! ' Self.flag = False f_txt.close () def start (self): print u ' starts downloading ... \ n ' Self.downloadpage () print U "downloaded" #-----------the entrance to the program-----------print U ""----------------------------------- ----Program: Read Call transfer version number: 0.3 Angryrookie Date: 2014-07-08 language: Python 2.7 Features: Press ENTER to start the download------------------------------------ ---"" "Print U" press ENTER: ' Raw_input (') MyBook = Book_spider () Mybook.start ()
The effect is shown in figure:
Python crawler Read call Transfer (iii)