This article describes how to write a Python script to capture a network novel to create your own reader, including optimizing the chapter arrangement of the novel. It is of practical significance for Python learners! If you need a friend, can you tell me if you are worried about the online reading of novels that cannot be downloaded on the Internet? Or is the content of some articles impulsive to your favorites, but you cannot find a download link? Is there an impulse to write a program on your own to get everything done? Have you learned python and want to find something to show off and tell others that "My brother is awesome !"? Let's get started! Haha ~
Okay, I 've been writing more about Yii recently. I 'd like to find something to adjust... =
This project is for the purpose of research. All copyright issues are on the author's side. Readers who want to watch pirated novels should face it on their own!
After talking about this, what we have to do is to crawl the content of the novel text from the webpage. Our research object is quanben novel network .... I declare again that I am not responsible for any copyright ....
The most basic content at the beginning is to capture the content of a chapter.
Environment: Ubuntu, Python 2.7
Basic knowledge
There are several knowledge points involved in this program, which are listed here. If you have any questions, Baidu will have a bunch of them.
1. The request object of the urllib2 module is used to set the HTTP request, including the captured url and the proxy of the disguised browser. Then the urlopen and read methods are well understood.
2. chardet module, used to detect webpage encoding. Garbled characters are easily captured on webpages. To determine whether the webpage is gtk encoded or UTF-8, use the chardet detect function for detection. In the use of Windows students can download the http://download.csdn.net/detail/jcjc918/8231371 here, unzip to the python lib directory just fine.
3. the decode function converts a string from a certain encoding to a unicode character, while the encode converts a unicode character to a string in the specified encoding format.
4. Application of the remodule regular expression. The search function can find an item that matches the regular expression, while the replace function replaces the matched string.
Train of Thought Analysis:
The url we selected is http://www.quanben.com/xiaoshuo/0/910/59302.html, the first chapter of douluo continent. You can view the source code of the webpage and find that only one content tag contains the content of all chapters. Therefore, you can use regular expressions to match and capture the content tag. I tried to print this part of content and found a lot
And,
To replace it with a line break, it is a placeholder in the webpage, that is, a space. replace it with a space. The contents of this chapter are beautiful. For completeness, we also use regular expressions to crawl the title.
Program
#-*-Coding: UTF-8-*-import urllib2 import re import chardet class Book_Spider: def _ init _ (self): self. pages = [] # capture a chapter def GetPage (self): myUrl = "http://www.quanben.com/xiaoshuo/0/910/59302.html"; user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} request = urllib2.Request (myUrl, headers = headers) myResponse = urllib2.urlopen (request) my Page = myResponse. read () # Check the character encoding of the webpage, and convert it to UTF-8 charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") try: # capture the title my_title = re. search ('(. *?) ', UnicodePage, re. S) my_title = my_title.group (1) Comment T: print' title HTML changes. Please analyze it again! 'Return False try: # capture chapter content my_content = re. search ('
(.*?)
"," \ N ") my_content = my_content.replace (" "," ") # store the title and content of the chapter onePage = {'title': my_title, 'content ': my_content} return onePage # used to load chapter def LoadPage (self): try: # Get new chapter myPage = self. getPage () if myPage = False: print 'failed to capture! 'Return False self. pages. append (myPage) cannot: print 'cannot connect to the server! '# Display chapter def ShowPage (self, curPage): print curPage ['title'] print curPage ['content'] def Start (self ): print U' start reading ...... \ n' # load this page into self. loadPage () # if the pages array of self contains the element if self. pages: nowPage = self. pages [0] self. showPage (nowPage) # ----------- program entrance ------------- print u "" --------------------------------------- program: Read call transfer version: 0.1 Author: angryrookie Date: 2014-07-05 language: Python 2.7 function: press enter to browse the section --------------------------------- "print u". Press Enter: 'raw_input () myBook = Book_Spider () myBook. start ()
It looks nice to me after running the program. If you don't believe it, please read it: ^ _ ^
Naturally, we will crawl the entire novel. First of all, we need to finish the program from the original one chapter. After reading the previous chapter, we can continue reading the next chapter.
Note that the webpage of each novel section has a link to the next page. By viewing the source code of the webpage and sorting it out a little bit (not displayed), we can see that this part of HTML is in the following format: