python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design

Last Update:2018-04-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What to learn in this chapter:
1, page encoding restore read
2. Function design

STUEP1: page encoding restore read


This Fetch object:

http://www.cuiweijuxs.com/jingpinxiaoshuo/

Follow the first piece of code to crawl:

#-*-Coding:utf-8-*-from urllib import requestif __name__ = "__main__":    chaper_url = "Http://www.cuiweijuxs.com/ji ngpinxiaoshuo/"    headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0 '}    req = Request. Request (Url=chaper_url, headers=headers)    response = Request.urlopen (req)    html = response.read ()    print ( html

Print out

B ' <!doctype html>\r\n

Such content, this is the encoding format problem, in the ZipFile decompression garbled article has already said, so need to look at the HTML page of the head, see the encoding format is GBK

Concrete Look http://www.cnblogs.com/yaoshen/p/8671344.html

Another method of program detection is to use Chardet (non-native library, need to install),

CharSet = chardet.detect (HTML)
Print (CharSet)
Detection content: {' encoding ': ' GB2312 ', ' confidence ': 0.99, ' language ': ' Chinese '}

If the use of GB2312 to decode is problematic, after the attempt to find or GBK more effective, including a bit more characters

Rewrite the code as follows:

    html = Html.decode ('GBK')    #except:    #     html = html.decode (' utf-8 ')    Print (HTML)

The complete code is as follows:

#-*-coding:utf-8-*- fromUrllibImportRequestImportChardetif __name__=="__main__": Chaper_url="http://www.cuiweijuxs.com/jingpinxiaoshuo/"Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0'} req= Request. Request (Url=chaper_url, headers=headers) Response=Request.urlopen (req) HTML=Response.read ()Print(HTML)#View page encoding formatCharSet =chardet.detect (HTML)Print(CharSet)#View page Content    #Try:html = Html.decode ('GBK')    #except:    #html = html.decode (' utf-8 ')    Print(HTML)

View Code

STUEP2: basic function design

Establish class:capture, define initialization (__init__), read (readhtml), Save (savehtml) and other basic function functions, and then create a run function to integrate the function of running,

Finally, use Capture (). Run ()
(1) __init__ method (double underline), initialization parameters

    def __init__ (self):        # define crawl URL        self.init_url = ' http://www.cuiweijuxs.com/jingpinxiaoshuo/'        # definition headers        self.head = {            ' user-agent ': ' mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166  safari/535.19 '}

(2) Wrap the Read page as a method and return the parsed HTML object

    def readhtml (self):        # Take CSDN as an example, csdn do not change the user agent is Unreachable        # Create Request object        Print (self.init_url)        req = Request. Request (Self.init_url, Headers=self.head)        # Incoming created Request object        response = Request.urlopen (req)        # Reads the response information and decodes the        HTML = Response.read (). Decode (' GBK ')        # Print information printed        (HTML)        return HTML

(3) writes the Read Web page to the file in a utf-8 way

    def savehtml (self, file_name, file_content):        file_object = open (file_name, ' W ', encoding= ' utf-8 ')        File_ Object.write (file_content)        file_object.close ()

(4) Call the Run method, read the Web page, and then save

    def Run (self):         Try :             = self.readhtml ()            self.savehtml ('test.html', HTML)         except  baseexception as Error:            print(Error) Capture (). Run ()

The complete code is as follows:

1 #-*-coding:utf-8-*-2  fromUrllibImportRequest3 4 5 classCapture:6 7     def __init__(self):8         #Defining crawl URLs9Self.init_url ='http://www.cuiweijuxs.com/jingpinxiaoshuo/'Ten         #Define Headers OneSelf.head = { A             'user-agent':'mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19'} -  -  the     defreadhtml (self): -         #take csdn as an example, CSDN does not change the user agent is inaccessible -         #Create a Request object -         Print(Self.init_url) +req = Request. Request (Self.init_url, headers=self.head) -         #passing in the created Request object +Response =Request.urlopen (req) A         #read response information and decode athtml = Response.read (). Decode ('GBK') -         #Printing Information -         Print(HTML) -         returnHTML -  -     defsavehtml (self, file_name, file_content): inFile_object = open (file_name,'W', encoding='Utf-8') - file_object.write (file_content) to file_object.close () +  -     defRun (self): the         Try: *HTML =self.readhtml () $Self.savehtml ('test.html', HTML)Panax Notoginseng         exceptbaseexception as Error: -             Print(Error) the  +  ACapture (). Run ()

View Code

python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support