Write a Python script to capture network novels and create your own reader.

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes how to write a Python script to capture a network novel to create your own reader, including optimizing the chapter arrangement of the novel. it is of practical significance for Python learners! If you need a friend, can you tell me if you are worried about the online reading of novels that cannot be downloaded on the Internet? Or is the content of some articles impulsive to your favorites, but you cannot find a download link? Is there an impulse to write a program on your own to get everything done? Have you learned python and want to find something to show off and tell others that "my brother is awesome !"? Let's get started! Haha ~
Okay, I 've been writing more about Yii recently. I 'd like to find something to adjust... =

This project is for the purpose of research. all copyright issues are on the author's side. readers who want to watch pirated novels should face it on their own!
After talking about this, what we have to do is to crawl the content of the novel text from the webpage. our research object is quanben novel network .... I declare again that I am not responsible for any copyright ....
The most basic content at the beginning is to capture the content of a chapter.

Environment: Ubuntu, Python 2.7

Basic knowledge
There are several knowledge points involved in this program, which are listed here. if you have any questions, Baidu will have a bunch of them.
1. the request object of the urllib2 module is used to set the HTTP request, including the captured url and the proxy of the disguised browser. Then the urlopen and read methods are well understood.
2. chardet module, used to detect webpage encoding. Garbled characters are easily captured on webpages. to determine whether the webpage is gtk encoded or UTF-8, use the chardet detect function for detection. In the use of Windows students can download the http://download.csdn.net/detail/jcjc918/8231371 here, unzip to the python lib directory just fine.
3. the decode function converts a string from a certain encoding to a unicode character, while the encode converts a unicode character to a string in the specified encoding format.
4. application of the remodule regular expression. The search function can find an item that matches the regular expression, while the replace function replaces the matched string.

Train of Thought analysis:
The url we selected is http://www.quanben.com/xiaoshuo/0/910/59302.html, the first chapter of Douluo continent. You can view the source code of the webpage and find that only one content tag contains the content of all chapters. Therefore, you can use regular expressions to match and capture the content tag. I tried to print this part of content and found a lot
And,
To replace it with a line break, it is a placeholder in the webpage, that is, a space. replace it with a space. The contents of this chapter are beautiful. For completeness, we also use regular expressions to crawl the title.

Program

#-*-Coding: UTF-8-*-import urllib2 import re import chardet class Book_Spider: def _ init _ (self): self. pages = [] # capture a chapter def GetPage (self): myUrl = "http://www.quanben.com/xiaoshuo/0/910/59302.html"; user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-Agent': user_agent} request = urllib2.Request (myUrl, headers = headers) myResponse = urllib2.urlopen (request) my Page = myResponse. read () # Check the character encoding of the webpage, and convert it to UTF-8 charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") try: # capture the title my_title = re. search ('(. *?) ', UnicodePage, re. S) my_title = my_title.group (1) comment T: print' title HTML changes. please analyze it again! 'Return False try: # capture chapter content my_content = re. search ('
 
  
(.*?)
  "," \ N ") my_content = my_content.replace (" "," ") # store the title and content of the chapter onePage = {'Title': my_title, 'content ': my_content} return onePage # used to load chapter def LoadPage (self): try: # Get New Chapter myPage = self. getPage () if myPage = False: print 'failed to capture! 'Return False self. pages. append (myPage) cannot: print 'cannot connect to the server! '# Display chapter def ShowPage (self, curPage): print curPage ['title'] print curPage ['content'] def Start (self ): print U' start reading ...... \ n' # load this page into self. loadPage () # if the pages array of self contains the element if self. pages: nowPage = self. pages [0] self. showPage (nowPage) # ----------- program entrance ------------- print u "" --------------------------------------- program: Read call transfer version: 0.1 author: angryrookie date: 2014-07-05 language: Python 2.7 function: press enter to browse the section --------------------------------- "print u". press enter: 'raw_input () myBook = Book_Spider () myBook. start ()

It looks nice to me after running the program. if you don't believe it, please read it: ^ _ ^

Naturally, we will crawl the entire novel. First of all, we need to finish the program from the original one chapter. after reading the previous chapter, we can continue reading the next chapter.
Note that the webpage of each novel section has a link to the next page. By viewing the source code of the webpage and sorting it out a little bit (not displayed), we can see that this part of HTML is in the following format:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Write a Python script to capture network novels and create your own reader.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Write a Python script to capture network novels and create your own reader.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support