# This is a learning note for the Liaoche teacher Python tutorial
1. Overview
Urllib provides a series of functions for manipulating URLs.
The urllib includes four modules, including
1.1 , u rllib.request
The Urllib request module makes it easy to crawl URL content.
It sends a GET request to the specified page first, and then returns the HTTP response:
1 ) to the watercress of a URL crawl, and return a response
From Urllib Import Request
# Request Module calls the Urlopen method to open the URL
With Request.urlopen (' https://api.douban.com/v2/book/2129650 ') as F:
data = F.read () # returned page content
Print (' Status: ', f.status, F.reason)
For K, V in f.getheaders ():
Print ('%s:%s '% (k, v))
Print (' Data: ', data.decode (' utf-8 '))
2 ) Simulation IPhone 6 to request the Watercress homepage
The impersonation browser sends a GET request using the Request object. By adding HTTP headers to the Request object, we can disguise the requests as a variety of browsers
From Urllib Import Request
req = Request. Request (' http://www.douban.com/') # created a Request object req is a class
# Add the requested header information
req.add_header (' user-agent ', ' mozilla/6.0 (iPhone; CPU iPhone os 8_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) version/8.0 mobile/10a5376e safari/8536.25 ')
With Request.urlopen (req) as F: # will be Request Object as URL Incoming
Print (' Status: ', F.status, F.reason)
For K, V in F.getheaders ():
Print ('%s:%s '% (k, v))
Print (' Data: ', F.read (). Decode (' Utf-8 '))
1.2 , P OST
If you want to send a request as a post, you only need to pass the parameter data in bytes form.
1 We simulate a microblog login, read the login mailbox and password first, and then follow the format of the weibo.cn login page to username=xxx&password=xxx the encoding passed in
From Urllib import request, parse
Print (' Login to weibo.cn ... ')
email = input (' Email: ')
passwd = input (' Password: ')
Login_data = Parse.urlencode ( # # using the urlencode method of parse to encode the data to be passed
(' username ', email),
(' Password ', passwd),
(' entry ', ' Mweibo '),
(' client_id ', '),
(' SaveState ', ' 1 '),
(' EC ', '),
(' Pagerefer ', ' Https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F ')
])
req = Request. Request (' Https://passport.weibo.cn/sso/login ') # create request Object
Req.add_header (' Origin ', ' https://passport.weibo.cn ')
Req.add_header (' user-agent ', ' mozilla/6.0 (iPhone; CPU iPhone os 8_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) version/8.0 mobile/10a5376e safari/8536.25 ')
Req.add_header (' Referer ', ' https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r= http%3a%2f%2fm.weibo.cn%2f ')
with Request.urlopen (req, Data=login_data.encode (' Utf-8 ')) As F: # data is used to indicate additional information sent to the server request
print (' Status: ', F.status, F.reason)
For K, V in F.getheaders ():
Print ('%s:%s '% (k, v))
Print (' Data: ', F.read (). Decode (' Utf-8 '))
1.3 , Header
1 If more complex controls are needed, such as through a proxy to access the site, we need to use Proxyhandler to handle
Proxy_handler = Urllib.request.ProxyHandler ({' http ': 'http://www.example.com:3128/'}) # Create an agent
Proxy_auth_handler = Urllib.request.ProxyBasicAuthHandler () # set up Basic authentication management, use agent to process identity authentication
# Relam : Scope of the agent, ' Host ' : Agent URL ,
# with a proxy URL that is provided using programming (' Host ') Replace the default Proxyhandler
Proxy_auth_handler.add_password (' Realm ', ' host ', ' username ', ' password ')
opener = Urllib.request.build_opener (Proxy_handler, Proxy_auth_handler) # returns a Openerdirector instance
With opener.open (' http://www.example.com/login.html ') as F: # visit URL
Pass
1.4 , Summary
Urllib provides the ability to use the program to execute various HTTP requests. If you want to simulate a browser to complete a specific function, you need to disguise the request as a browser. The camouflage method is to first monitor the browser's request, and then according to the browser's request header to disguise,user-agent header is used to identify the browser.
1.5 , expand Documents
Python3 web crawler "Send request using Urllib.request" (76067790)
Python in Urlopen () Introduction (https://www.cnblogs.com/zyq-blog/p/5606760.html)
python Why crawlers Use opener object and why you want to create a global default opener Object (https://www.cnblogs.com/cunyusup/p/7341829.html)
2 , examples
1 , use Urllib to read the JSON, and then parse the JSON into a Python object:
#-*-Coding:utf-8-*-
From Urllib Import Request
Import JSON
def fetch_data (URL):
With Request.urlopen (URL) as f:
data = Json.loads (F.read (). Decode (' Utf-8 ')) # decode the content of the Read page, and then json.loads () Deserialize to Python Object
Return data
# test
URL = ' https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3d% 202151330&format=json '
data = Fetch_data (URL)
Print (data)
Assert data[' query ' [' Results '] [' channel '] [' location '] [' city '] = = ' Beijing '
Print (' OK ')
Python Learning note __12.9 urlib