Objective
The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local. There are many libraries in python that can be used to crawl Web pages, so let's learn urllib first.
Note: This blog development environment is Python3
Urlopen
Let's start with a piece of code:
# urllib_urlopen.py# 导入urllib.requestimport urllib.request# 向指定的url发送请求,并返回服务器响应的类文件对象response = urllib.request.urlopen("http://www.baidu.com")# 类文件对象支持 文件对象的操作方法,如read()方法读取文件全部内容,返回字符串html = response.read()# 打印字符串print(html)
Execute the written Python code, and the result will be printed:
python3 urllib_urlopen.py
In fact, if we open the Baidu homepage in the browser, right click "View Source Code", you will find, and we just printed out is exactly the same. In other words, the above 4 lines of code has helped us to Baidu's home page of all the code to crawl down.
A basic URL request that corresponds to the Python code is really simple.
Request
In our first example, the Urlopen () parameter is a URL address, but if more complex operations are needed, such as adding an HTTP header, you must create a Request instance as a parameter to the Urlopen (), and the URL address that needs to be accessed as the request The parameters of the instance.
We edit urllib_request.py
# urllib_request.pyimport urllib.request# url 作为Request()方法的参数,构造并返回一个Request对象request = urllib.request.Request("http://www.baidu.com")# Request对象作为urlopen()方法的参数,发送给服务器并接收响应response = urllib.request.urlopen(request)html = response.read()print(html)
The results are exactly the same.
新建Request实例,除了必须要有 url 参数之外,还可以设置另外两个参数:1.data(默认空):是伴随 url 提交的数据(比如要post的数据),同时 HTTP 请求将从 "GET"方式 改为 "POST"方式。2.headers(默认空):是一个字典,包含了需要发送的HTTP报头的键值对。这两个参数下面会说到。
User-agent
But so directly with Urllib to send a request to a website, indeed slightly abrupt, it is like, everyone has a door, you as a passer-by directly into the identity of the obvious is not very polite. And some sites do not like to be accessed by the program (non-human access), it is possible to deny your access requests.
But if we use a legitimate identity to request someone else's website, it is obvious that they are welcome, so we should add an identity to our code, that is, the so-called User-agent head.
浏览器 就是互联网世界上公认被允许的身份,如果我们希望我们的爬虫程序更像一个真实用户,那我们第一步,就是需要伪装成一个被公认的浏览器。用不同的浏览器在发送请求的时候,会有不同的User-Agent头。
# urllib_useragent.pyimport urllib.requesturl = "http://www.baidu.com"# IE 9.0 的 User-Agent,包含在 ua_header里# 温馨提示:大家可以自行百度搜索查看各浏览器对应的user-agentua_header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} # url 连同 headers,一起构造Request请求,这个请求将附带 IE9.0 浏览器的User-Agentrequest = urllib.request.Request(url, headers = ua_header)# 向服务器发送这个请求response = urllib.request.urlopen(request)html = response.read()print(html)
Add more header information
A complete HTTP request message is constructed by adding a specific Header to the HTTP requests.
可以通过调用Request.add_header() 添加/修改一个特定的header 也可以通过调用Request.get_header()来查看已有的header。
# urllib_headers.pyimport urllib.requesturl = "http://www.baidu.com"#IE 9.0 的 User-Agentheader = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}request = urllib.request.Request(url, headers = header)#也可以通过调用Request.add_header() 添加/修改一个特定的headerrequest.add_header("Connection", "keep-alive")# 也可以通过调用Request.get_header()来查看header信息# request.get_header(header_name="Connection")response = urllib.request.urlopen(request)print (response.code) #可以查看响应状态码html = response.read()print(html)
- Randomly Add/Modify User-agent
# urllib_useragentlist.pyimport urllib.requestimport randomurl = "http://www.baidu.com/"# 可以是User-Agent列表,也可以是代理列表ua_list = [ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11", "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"]# 在User-Agent列表里随机选择一个User-Agentuser_agent = random.choice(ua_list)# 构造一个请求request = urllib.request.Request(url)# add_header()方法 添加/修改 一个HTTP报头request.add_header("User-Agent", user_agent)# get_header() 获取一个已有的HTTP报头的值,注意只能是第一个字母大写,其他的必须小写print (request.get_header("User-agent"))
UrlEncode
import urllib.parseword = {"wd": "百度"}# 通过urllib.parse.urlencode()方法,将字典键值对按URL编码转换,从而能被web服务器接受。a = urllib.parse.urlencode(word)print(a) # wd=%E7%99%BE%E5%BA%A6# 通过urllib.parse.urlencode()方法,将字典键值对按URL编码转换,从而能被web服务器接受。b = urllib.parse.unquote(a)print(b) # wd=百度
Basic use of the three Urllib libraries for the Python crawler entry