The following are python3.6.* codes
Learning Reptiles, the first to learn to use the Urllib library, this library can be convenient for us to parse the content of the Web page, this chapter about its basic usage
Parsing Web pages
#Import Urllib fromUrllibImportRequest#Explicit URLBase_url ='http://www.baidu.com/'#initiates an HTTP request that returns a class file ObjectResponse =Request.urlopen (Base_url)#get Web page contenthtml = Response.read (). Decode ('Utf-8') #Write a webpage to a fileWith open ('baidu.html','W', encoding='Utf-8') as F:f.write (HTML)
Construct the request
Some websites determine whether the machine is in operation by getting the browser information so we need to construct the request header
#Import Module fromUrllibImportRequestbase_url='http://www.xicidaili.com/'#construct the request headerheaders = { 'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.181 safari/537.36'} #constructing the Request objectreq = Request. Request (base_url,headers=headers)#initiating a requestResponse =Request.urlopen (req)#get Web page contentHTML =Response.read (). Decode ()#print the obtained page codePrint(HTML)
GET Request transfer data
The submission form often uses post send or get send. The difference is that the latter is displayed directly on the URL for the submitted content. So let's try to implement them.
fromUrllibImportRequest,parseImportRandom#the value to take with GetQS = { 'WD':'Sister', 'a': 1}#converts a carried value to a value that is recognized by the browserQS =Parse.urlencode (QS)#Stitching URLBase_url ='http://www.baidu.com/s?'+QS#defines a list of headers for random accessUa_list = [ 'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.181 safari/537.36', 'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.181 safari/537.36'] #construct the request headerheaders = { #Stochastic construction user-agent'user-agent': Random.choice (ua_list)}#constructing the Request objectreq = Request. Request (base_url,headers=headers)#initiating a requestResponse =Request.urlopen (req)#GET Request Contenthtml = Response.read (). Decode () HTML = Response.read (). Decode ()
POST Request transfer data
fromUrllibImportRequest,parse Base_url='Http://fanyi.baidu.com/sug' #construct request form dataform = { 'kw':"a sheep"} #converts a carried value to a value that is recognized by the browserform =parse.urlencode (form)#build the POST request, if the data parameter is specified, the request is a POST requestreq = Request. Request (Base_url,data=bytes (form,encoding='Utf-8')) #initiating an HTTP POST requestResponse =Request.urlopen (req)#Get Response Content (JSON)Data= Response.read (). Decode ()
This simulates a simple login, of course, most of the site is not so easy to log on, but this code is the core of the simulation login
Python crawler urllib Library Basic use