What to learn in this chapter:
1, page encoding restore read
2. Function design
STUEP1: page encoding restore read
This Fetch object:
http://www.cuiweijuxs.com/jingpinxiaoshuo/
Follow the first piece of code to crawl:
#-*-Coding:utf-8-*-from urllib import requestif __name__ = "__main__": chaper_url = "Http://www.cuiweijuxs.com/ji ngpinxiaoshuo/" headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0 '} req = Request. Request (Url=chaper_url, headers=headers) response = Request.urlopen (req) html = response.read () print ( html
Print out
B ' <!doctype html>\r\n
Such content, this is the encoding format problem, in the ZipFile decompression garbled article has already said, so need to look at the HTML page of the head, see the encoding format is GBK
Concrete Look http://www.cnblogs.com/yaoshen/p/8671344.html
Another method of program detection is to use Chardet (non-native library, need to install),
CharSet = chardet.detect (HTML)
Print (CharSet)
Detection content: {' encoding ': ' GB2312 ', ' confidence ': 0.99, ' language ': ' Chinese '}
If the use of GB2312 to decode is problematic, after the attempt to find or GBK more effective, including a bit more characters
Rewrite the code as follows:
html = Html.decode ('GBK') #except: # html = html.decode (' utf-8 ') Print (HTML)
The complete code is as follows:
#-*-coding:utf-8-*- fromUrllibImportRequestImportChardetif __name__=="__main__": Chaper_url="http://www.cuiweijuxs.com/jingpinxiaoshuo/"Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0'} req= Request. Request (Url=chaper_url, headers=headers) Response=Request.urlopen (req) HTML=Response.read ()Print(HTML)#View page encoding formatCharSet =chardet.detect (HTML)Print(CharSet)#View page Content #Try:html = Html.decode ('GBK') #except: #html = html.decode (' utf-8 ') Print(HTML)
View Code
STUEP2: basic function design
Establish class:capture, define initialization (__init__), read (readhtml), Save (savehtml) and other basic function functions, and then create a run function to integrate the function of running,
Finally, use Capture (). Run ()
(1) __init__ method (double underline), initialization parameters
def __init__ (self): # define crawl URL self.init_url = ' http://www.cuiweijuxs.com/jingpinxiaoshuo/' # definition headers self.head = { ' user-agent ': ' mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19 '}
(2) Wrap the Read page as a method and return the parsed HTML object
def readhtml (self): # Take CSDN as an example, csdn do not change the user agent is Unreachable # Create Request object Print (self.init_url) req = Request. Request (Self.init_url, Headers=self.head) # Incoming created Request object response = Request.urlopen (req) # Reads the response information and decodes the HTML = Response.read (). Decode (' GBK ') # Print information printed (HTML) return HTML
(3) writes the Read Web page to the file in a utf-8 way
def savehtml (self, file_name, file_content): file_object = open (file_name, ' W ', encoding= ' utf-8 ') File_ Object.write (file_content) file_object.close ()
(4) Call the Run method, read the Web page, and then save
def Run (self): Try : = self.readhtml () self.savehtml ('test.html', HTML) except baseexception as Error: print(Error) Capture (). Run ()
The complete code is as follows:
1 #-*-coding:utf-8-*-2 fromUrllibImportRequest3 4 5 classCapture:6 7 def __init__(self):8 #Defining crawl URLs9Self.init_url ='http://www.cuiweijuxs.com/jingpinxiaoshuo/'Ten #Define Headers OneSelf.head = { A 'user-agent':'mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19'} - - the defreadhtml (self): - #take csdn as an example, CSDN does not change the user agent is inaccessible - #Create a Request object - Print(Self.init_url) +req = Request. Request (Self.init_url, headers=self.head) - #passing in the created Request object +Response =Request.urlopen (req) A #read response information and decode athtml = Response.read (). Decode ('GBK') - #Printing Information - Print(HTML) - returnHTML - - defsavehtml (self, file_name, file_content): inFile_object = open (file_name,'W', encoding='Utf-8') - file_object.write (file_content) to file_object.close () + - defRun (self): the Try: *HTML =self.readhtml () $Self.savehtml ('test.html', HTML)Panax Notoginseng exceptbaseexception as Error: - Print(Error) the + ACapture (). Run ()
View Code
python3+beautifulsoup4.6 Crawl a website novel (ii) basic function design