Write a Python script to capture network novels to create your own reader,

Last Update:2015-08-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Are you worried about the "online novel reading" content that cannot be downloaded online? Or is the content of some articles impulsive to your favorites, but you cannot find a download link? Is there an impulse to write a program on your own to get everything done? Have you learned python and want to find something to show off and tell others that "My brother is awesome !"? Let's get started! Haha ~
Okay, I 've been writing more about Yii recently. I 'd like to find something to adjust... =

This project is for the purpose of research. All copyright issues are on the author's side. Readers who want to watch pirated novels should face it on their own!
After talking about this, what we have to do is to crawl the content of the novel text from the webpage. Our research object is quanben novel network .... I declare again that I am not responsible for any copyright ....
The most basic content at the beginning is to capture the content of a chapter.

Environment: Ubuntu, Python 2.7

Basic knowledge
There are several knowledge points involved in this program, which are listed here. If you have any questions, Baidu will have a bunch of them.
1. The request object of the urllib2 module is used to set the HTTP request, including the captured url and the proxy of the disguised browser. Then the urlopen and read methods are well understood.
2. chardet module, used to detect webpage encoding. Garbled characters are easily captured on webpages. To determine whether the webpage is gtk encoded or UTF-8, use the chardet detect function for detection. In the use of Windows students can download the http://download.csdn.net/detail/jcjc918/8231371 here, unzip to the python lib directory just fine.
3. the decode function converts a string from a certain encoding to a unicode character, while the encode converts a unicode character to a string in the specified encoding format.
4. Application of the remodule regular expression. The search function can find an item that matches the regular expression, while the replace function replaces the matched string.

Train of Thought Analysis:
The url we selected is http://www.quanben.com/xiaoshuo/0/910/59302.html, the first chapter of douluo continent. You can view the source code of the webpage and find that only one content tag contains the content of all chapters. Therefore, you can use regular expressions to match and capture the content tag. If you try to print this part of content, you will find many <br/> and <br/> replace it with a line break, which is a placeholder in the webpage, that is, a space. replace it with a space. The contents of this chapter are beautiful. For completeness, we also use regular expressions to crawl the title.

Program

#-*-Coding: UTF-8-*-import urllib2 import re import chardet class Book_Spider: def _ init _ (self): self. pages = [] # capture a chapter def GetPage (self): myUrl = "http://www.quanben.com/xiaoshuo/0/910/59302.html"; user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} request = urllib2.Request (myUrl, headers = headers) myResponse = urllib2.urlopen (request) my Page = myResponse. read () # Check the character encoding of the webpage, and convert it to UTF-8 charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") try: # capture the title my_title = re. search ('
It looks nice to me after running the program. If you don't believe it, please read it: ^ _ ^


Naturally, we will crawl the entire novel. First of all, we need to finish the program from the original one chapter. After reading the previous chapter, we can continue reading the next chapter.
Note that the webpage of each novel section has a link to the next page. By viewing the source code of the webpage and sorting it out a little bit (not displayed), we can see that this part of HTML is in the following format:

<Div id = "footlink"> <script type = "text/javascript" charset = "UTF-8" src = "/scripts/style5.js"> </script> <a href =" http://www.quanben.com/xiaoshuo/0/910/59301.html> previous page </a> <a href = "http://www.quanben.com/xiaoshuo/0/910/"> back to directory </a> <a href = "http://www.quanben.com/xiaoshuo/0/910/59303.html"> next page </a> </div>
The previous page, returned directory, and next page are all in a div with the id of footlink. If you want to match each link, a large number of other links on the webpage will be crawled, however, footlink only has one div! We can match the div, capture it, And then match the <a> link in the captured div. Then there are only three. As long as the last link is the url of the next page, use this url to update the target url we crawled, so that we can keep capturing the next page. The user reading logic is that after reading a chapter, wait for user input. If it is quit, exit the program; otherwise, the next chapter is displayed.
Basic knowledge:
The basic knowledge of the previous Article is added with the Python thread module.
Source code:
#-*-Coding: UTF-8-*-import urllib2 import re import thread import chardet class Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = "http://www.quanben.com/xiaoshuo/10/10412/2095096.html" # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2.Request (MyUrl, headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") # Find the div tag of id = "content" try: # capture the title my_title = re. search ('
Now, we only need to capture the novels we want into the local txt file, and then select a reader to read them all.
In fact, we have completed most of the logic of the last program. The subsequent changes only need to be captured in each chapter without being shown, but saved to the txt file. The other one is that the program continuously crawls Based on the Url of the next page. When will it end? Note that when the last chapter of the novel arrives, the link on the next page is the same as the link on the returned directory. So when we capture a webpage, we take out the two links. When the two links are the same, we stop crawling. Finally, we don't need to use multiple threads for this program. We only need a thread that is constantly capturing the pages of novels.
However, when there are more chapters in the novel, it may take a long time to complete. So much is not taken into account now. If the basic functions are completed, OK ....
Basic knowledge: the previous basic knowledge-multithreading knowledge + file operation knowledge.
Source code:

#-*-Coding: UTF-8-*-import urllib2 import urllib import re import thread import chardet class Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = "http://www.quanben.com/xiaoshuo/0/910/59302.html" # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2. Request (myUrl, headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") # Find the div tag of id = "content" try: # capture the title my_title = re. search ('

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Write a Python script to capture network novels to create your own reader,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Write a Python script to capture network novels to create your own reader,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support