Python Crawler's reading call Transfer (III)

Source: Internet
Author: User

Although in the last blog we were able to read chapters consecutively, but. Does every time we read a novel execute our Python program? Not even a record to see anywhere, every time it is another time? Of course not, change!

Now that so many novel readers, we just need to grab the novel we want to the local TXT file, and then choose a reader to see. How the whole look at you.


As a matter of fact, the last program we have finished most of the logic, our next modification only need to crawl to each chapter of the time not to display, but into the TXT file. The other is that the program is constantly based on the next page of the URL to crawl, then when to end it? Note that when you reach the last chapter of the novel, the link to a page is the same as the link to the return folder. So when we crawl a Web page, we take these two links out, just to show the same time as two links. Stop crawling. In the end, we don't need multiple threads for this program, we just need a thread that is constantly crawling the pages of the novel.

Just a little more time in the fiction chapters, waiting for a little longer. Don't think so much now, the basic function is OK ....


Fundamentals: Previous Basics-Multithreading knowledge + file manipulation knowledge.


Source:

#-*-Coding:utf-8-*-import urllib2import urllibimport reimport threadimport chardetclass book_spider:def __init__ (SE LF): Self.pages = [] Self.page = 1 Self.flag = True Self.url = "Http://www.quanben.com/xiaoshu O/0/910/59302.html "# will crawl a section def getpage (self): Myurl = Self.url user_agent = ' mozilla/4.0 (compatibl E MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} req = Urllib2. Request (myurl, headers = headers) Myresponse = Urllib2.urlopen (req) mypage = Myresponse.read () charse            t = Chardet.detect (mypage) charset = charset[' encoding '] if charset = = ' Utf-8 ' or charset = = ' UTF-8 ': MyPage = MyPage Else:mypage = Mypage.decode (' gb2312 ', ' ignore '). Encode (' Utf-8 ') unicodepage = Mypage.decode ("Utf-8") # Find the id= "content" of the div tag try: #抓取标题 my_title = Re.search (' < H1> (. *?)

)

Id= "Htmlcontent" class= "Contentbox" > (. *?) <div ', Unicodepage,re. S) my_content = My_content.group (1) except:print "Content HTML changes, please analyze again. "Return False my_content = My_content.replace (" <br/> "," \ n ") My_content = My_conte Nt.replace ("", "") #用字典存储一章的标题和内容 Onepage = {' title ': My_title, ' content ': my_content} try: #找到页面下方的连接区域 Foot_link = Re.search (' <div.*?class= ' Chapter_turnpage ' > (. *?) </div> ', Unicodepage,re. S) Foot_link = Foot_link.group (1) #在连接的区域找下一页的连接, according to the Web page features a third nexturl = Re.findall (U ' <a. *?href= "(. *?

) ".*?> (. *?) </a> ', Foot_link,re. S) #文件夹链接 Dir_url = nexturl[1][0] Nexturl = nexturl[2][0] # Update the next crawl link Self.url = Nexturl if (Dir_url = = Nexturl): Self.flag = False RE Turn onepage except:print "bottom link change. Please analyze again! " Return False # for loading Chapter Def downloadpage (self): f_txt = open (U "bucket continent. txt", ' w+ ') while Self.flag: Try: # Gets the new page mypage = self. GetPage () if mypage = = False:print ' crawl failed!

' Self.flag = False title = mypage[' title '].encode (' utf-8 ') content = mypage[' content '].encode (' Utf-8 ') f_txt.write (title + ' \ n ') f_txt.write (content) F_txt.write (' \n\n\n ') print "downloaded", mypage[' title ' Except:print ' cannot connect serv er! ' Self.flag = False f_txt.close () def start (self): print u ' starts downloading ... \ n ' Self.downloadpage () print U "downloaded" #-----------the entrance to the program-----------print U ""----------------------------------- ----Program: Read Call transfer version number: 0.3 Angryrookie Date: 2014-07-08 language: Python 2.7 Features: Press ENTER to start the download------------------------------------ ---"" "Print U" press ENTER: ' Raw_input (') MyBook = Book_spider () Mybook.start ()





The effect is shown in figure:


Python crawler Read call Transfer (iii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.