Write a Python script to capture network novels to create your own reader,

Source: Internet
Author: User

Write a Python script to capture network novels to create your own reader,

Are you worried about the "online novel reading" content that cannot be downloaded online? Or is the content of some articles impulsive to your favorites, but you cannot find a download link? Is there an impulse to write a program on your own to get everything done? Have you learned python and want to find something to show off and tell others that "My brother is awesome !"? Let's get started! Haha ~
Okay, I 've been writing more about Yii recently. I 'd like to find something to adjust... =

This project is for the purpose of research. All copyright issues are on the author's side. Readers who want to watch pirated novels should face it on their own!
After talking about this, what we have to do is to crawl the content of the novel text from the webpage. Our research object is quanben novel network .... I declare again that I am not responsible for any copyright ....
The most basic content at the beginning is to capture the content of a chapter.

Environment: Ubuntu, Python 2.7

Basic knowledge
There are several knowledge points involved in this program, which are listed here. If you have any questions, Baidu will have a bunch of them.
1. The request object of the urllib2 module is used to set the HTTP request, including the captured url and the proxy of the disguised browser. Then the urlopen and read methods are well understood.
2. chardet module, used to detect webpage encoding. Garbled characters are easily captured on webpages. To determine whether the webpage is gtk encoded or UTF-8, use the chardet detect function for detection. In the use of Windows students can download the http://download.csdn.net/detail/jcjc918/8231371 here, unzip to the python lib directory just fine.
3. the decode function converts a string from a certain encoding to a unicode character, while the encode converts a unicode character to a string in the specified encoding format.
4. Application of the remodule regular expression. The search function can find an item that matches the regular expression, while the replace function replaces the matched string.

Train of Thought Analysis:
The url we selected is http://www.quanben.com/xiaoshuo/0/910/59302.html, the first chapter of douluo continent. You can view the source code of the webpage and find that only one content tag contains the content of all chapters. Therefore, you can use regular expressions to match and capture the content tag. If you try to print this part of content, you will find many <br/> and <br/> replace it with a line break, which is a placeholder in the webpage, that is, a space. replace it with a space. The contents of this chapter are beautiful. For completeness, we also use regular expressions to crawl the title.

Program

#-*-Coding: UTF-8-*-import urllib2 import re import chardet class Book_Spider: def _ init _ (self): self. pages = [] # capture a chapter def GetPage (self): myUrl = "http://www.quanben.com/xiaoshuo/0/910/59302.html"; user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} request = urllib2.Request (myUrl, headers = headers) myResponse = urllib2.urlopen (request) my Page = myResponse. read () # Check the character encoding of the webpage, and convert it to UTF-8 charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") try: # capture the title my_title = re. search ('

It looks nice to me after running the program. If you don't believe it, please read it: ^ _ ^

Naturally, we will crawl the entire novel. First of all, we need to finish the program from the original one chapter. After reading the previous chapter, we can continue reading the next chapter.
Note that the webpage of each novel section has a link to the next page. By viewing the source code of the webpage and sorting it out a little bit (not displayed), we can see that this part of HTML is in the following format:

<Div id = "footlink"> <script type = "text/javascript" charset = "UTF-8" src = "/scripts/style5.js"> </script> <a href =" http://www.quanben.com/xiaoshuo/0/910/59301.html> previous page </a> <a href = "http://www.quanben.com/xiaoshuo/0/910/"> back to directory </a> <a href = "http://www.quanben.com/xiaoshuo/0/910/59303.html"> next page </a> </div>

The previous page, returned directory, and next page are all in a div with the id of footlink. If you want to match each link, a large number of other links on the webpage will be crawled, however, footlink only has one div! We can match the div, capture it, And then match the <a> link in the captured div. Then there are only three. As long as the last link is the url of the next page, use this url to update the target url we crawled, so that we can keep capturing the next page. The user reading logic is that after reading a chapter, wait for user input. If it is quit, exit the program; otherwise, the next chapter is displayed.

Basic knowledge:
The basic knowledge of the previous Article is added with the Python thread module.

Source code:

#-*-Coding: UTF-8-*-import urllib2 import re import thread import chardet class Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = "http://www.quanben.com/xiaoshuo/10/10412/2095096.html" # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2.Request (MyUrl, headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") # Find the div tag of id = "content" try: # capture the title my_title = re. search ('

Now, we only need to capture the novels we want into the local txt file, and then select a reader to read them all.

In fact, we have completed most of the logic of the last program. The subsequent changes only need to be captured in each chapter without being shown, but saved to the txt file. The other one is that the program continuously crawls Based on the Url of the next page. When will it end? Note that when the last chapter of the novel arrives, the link on the next page is the same as the link on the returned directory. So when we capture a webpage, we take out the two links. When the two links are the same, we stop crawling. Finally, we don't need to use multiple threads for this program. We only need a thread that is constantly capturing the pages of novels.
However, when there are more chapters in the novel, it may take a long time to complete. So much is not taken into account now. If the basic functions are completed, OK ....

Basic knowledge: the previous basic knowledge-multithreading knowledge + file operation knowledge.

Source code:

#-*-Coding: UTF-8-*-import urllib2 import urllib import re import thread import chardet class Book_Spider: def _ init _ (self): self. pages = [] self. page = 1 self. flag = True self. url = "http://www.quanben.com/xiaoshuo/0/910/59302.html" # capture a chapter def GetPage (self): myUrl = self. url user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} req = urllib2. Request (myUrl, headers = headers) myResponse = urllib2.urlopen (req) myPage = myResponse. read () charset = chardet. detect (myPage) charset = charset ['encoding'] if charset = 'utf-8' or charset = 'utf-8': myPage = myPage else: myPage = myPage. decode ('gb2312', 'ignore '). encode ('utf-8') unicodePage = myPage. decode ("UTF-8") # Find the div tag of id = "content" try: # capture the title my_title = re. search ('

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.