The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local. There are many libraries in python that can be used to crawl Web pages, so let's learn urllib.request first. (for URLLIB2 in python2.x)
Urlopen
Let's start by reading the following code:
#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # Imports Urllib.request Library Import urllib.request# sends a request to the specified URL, The class file object that returns the server response response = Urllib.request.urlopen ("http://www.baidu.com/") # class file object supports the action method of the file object, such as the Read () method, which reads the entire contents of the file, Returns the string html = Response.read () # Prints the string print (HTML)
In fact, if we open the Baidu homepage in the browser, right click "View Source", you will find that we execute the above program output is the same as the result. In other words, the above line of code has helped us to the Baidu homepage of the entire code to crawl down.
Reuqest
In the example above, the parameter of Urlopen () is a URL address.
However, if you need to perform more complex operations, such as adding an HTTP header, you must create a request instance as a parameter to Urlopen (), and the URL address you need to access as a parameter to the request instance.
#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # importing Urllib.request library import urllib.request# URL as request () method, construct and return a request object, request = Urllib.request.Request ("http://www.baidu.com/") # to the server to send this response = Urllib.request.urlopen (Request) HTML = Response.read () print (HTML)
The result is exactly the same:
The new request instance can be set up with two additional parameters in addition to the URL parameter:
1.data (default is empty): is the data that accompanies the URL submission (such as post data), and the HTTP request will be changed from "GET" to "POST" mode.
2.headers (default is empty): is a dictionary that contains the key-value pairs of the HTTP headers that need to be sent.
User-agent
If we want our bot to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different user-agent headers when sending requests. Urllib.request default User-agent header is: python-urllib/x.y (x and Y are Python major and minor version numbers, such as python-urllib/3.5)
#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # importing Urllib.request library import urllib.request# Chrome User-agent, included in the header header = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.71 safari/537.36 '}# URL, together with headers, constructs the requests request, which will be shipped with Chrome browser user-agentrequest = Urllib.request.Request ("http://www.baidu.com/", headers = header) # Send this request to the server response = Urllib.request.urlopen (Request) HTML = Response.read () print (HTML)
Add more header information
A complete HTTP request message is constructed by adding a specific header to the HTTP requests.
You can add/modify a specific header by calling Request.add_header () or you can view an existing header by calling Request.get_header ().
Add a specific header
#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # Import the Urllib.request library import urllib.request# Chrome user-agent, included in the header header = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.71 safari/537.36 '}# URL, together with headers, constructs the requests request, which will be shipped with Chrome browser user-agentrequest = Urllib.request.Request ("http://www.baidu.com/", headers = header) # can also be called by the Request.add_ Header () Add/Modify a specific Headerrequest.add_header ("Connection", "Keep-alive") # Send this request to the server response = Urllib.request.urlopen (Request) HTML = Response.read () print (HTML)