python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design

Source: Internet
Author: User

What to learn in this chapter:
1, page encoding restore read
2. Function design

STUEP1: page encoding restore read

This Fetch object:
http://www.cuiweijuxs.com/jingpinxiaoshuo/

Follow the first piece of code to crawl:
#-*-Coding:utf-8-*-from urllib import requestif __name__ = "__main__":    chaper_url = "Http://www.cuiweijuxs.com/ji ngpinxiaoshuo/"    headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0 '}    req = Request. Request (Url=chaper_url, headers=headers)    response = Request.urlopen (req)    html = response.read ()    print ( html

Print out

B ' <!doctype html>\r\n

Such content, this is the encoding format problem, in the ZipFile decompression garbled article has already said, so need to look at the HTML page of the head, see the encoding format is GBK

Concrete Look http://www.cnblogs.com/yaoshen/p/8671344.html

Another method of program detection is to use Chardet (non-native library, need to install),

CharSet = chardet.detect (HTML)
Print (CharSet)
Detection content: {' encoding ': ' GB2312 ', ' confidence ': 0.99, ' language ': ' Chinese '}

If the use of GB2312 to decode is problematic, after the attempt to find or GBK more effective, including a bit more characters

Rewrite the code as follows:

    html = Html.decode ('GBK')    #except:    #     html = html.decode (' utf-8 ')    Print (HTML)

The complete code is as follows:

#-*-coding:utf-8-*- fromUrllibImportRequestImportChardetif __name__=="__main__": Chaper_url="http://www.cuiweijuxs.com/jingpinxiaoshuo/"Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0'} req= Request. Request (Url=chaper_url, headers=headers) Response=Request.urlopen (req) HTML=Response.read ()Print(HTML)#View page encoding formatCharSet =chardet.detect (HTML)Print(CharSet)#View page Content    #Try:html = Html.decode ('GBK')    #except:    #html = html.decode (' utf-8 ')    Print(HTML)
View Code

STUEP2: basic function design

Establish class:capture, define initialization (__init__), read (readhtml), Save (savehtml) and other basic function functions, and then create a run function to integrate the function of running,
Finally, use Capture (). Run ()
(1) __init__ method (double underline), initialization parameters
    def __init__ (self):        # define crawl URL        self.init_url = ' http://www.cuiweijuxs.com/jingpinxiaoshuo/'        # definition headers        self.head = {            ' user-agent ': ' mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166  safari/535.19 '}        

(2) Wrap the Read page as a method and return the parsed HTML object

    def readhtml (self):        # Take CSDN as an example, csdn do not change the user agent is Unreachable        # Create Request object        Print (self.init_url)        req = Request. Request (Self.init_url, Headers=self.head)        # Incoming created Request object        response = Request.urlopen (req)        # Reads the response information and decodes the        HTML = Response.read (). Decode (' GBK ')        # Print information printed        (HTML)        return HTML

(3) writes the Read Web page to the file in a utf-8 way

    def savehtml (self, file_name, file_content):        file_object = open (file_name, ' W ', encoding= ' utf-8 ')        File_ Object.write (file_content)        file_object.close ()

(4) Call the Run method, read the Web page, and then save

    def Run (self):         Try :             = self.readhtml ()            self.savehtml ('test.html', HTML)         except  baseexception as Error:            print(Error) Capture (). Run ()

The complete code is as follows:

1 #-*-coding:utf-8-*-2  fromUrllibImportRequest3 4 5 classCapture:6 7     def __init__(self):8         #Defining crawl URLs9Self.init_url ='http://www.cuiweijuxs.com/jingpinxiaoshuo/'Ten         #Define Headers OneSelf.head = { A             'user-agent':'mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19'} -  -  the     defreadhtml (self): -         #take csdn as an example, CSDN does not change the user agent is inaccessible -         #Create a Request object -         Print(Self.init_url) +req = Request. Request (Self.init_url, headers=self.head) -         #passing in the created Request object +Response =Request.urlopen (req) A         #read response information and decode athtml = Response.read (). Decode ('GBK') -         #Printing Information -         Print(HTML) -         returnHTML -  -     defsavehtml (self, file_name, file_content): inFile_object = open (file_name,'W', encoding='Utf-8') - file_object.write (file_content) to file_object.close () +  -     defRun (self): the         Try: *HTML =self.readhtml () $Self.savehtml ('test.html', HTML)Panax Notoginseng         exceptbaseexception as Error: -             Print(Error) the  +  ACapture (). Run ()
View Code



python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.