Urllib module as a collection of components for Python 3 processing URLs, if you have knowledge of Python 2, then you will notice that Python 2 has urllib and urllib2 two versions of the modules, which are now Python 3 URLs LIB package is part of how to reflect the relationship between them |
The Urllib module for Python 3 is a collection of components that can handle URLs. If you have knowledge of Python 2, then you will notice that Python 2 has the Urllib and urllib2 two versions of the module. These are now part of the Python 3 urllib package. The current version of Urllib includes the following sections:
Urllib.request
Urllib.error
Urllib.parse
Urllib.rebotparser
Next we will discuss the parts apart from Urllib.error. Official documents actually recommend you try a third-party library, requests, an advanced HTTP client interface. However, I still think it is useful to know how to open URLs and interact with them without relying on third-party libraries, and this can also help you understand why requests packages are so popular.
Urllib.request
The Urllib.request module is used to open and retrieve URLs at the beginning of the period. Let's see what you can do with function Urlopen:
>>> import urllib.request>>> url = urllib.request.urlopen (' https:// www.google.com/') >>> url.geturl () ' https://www.google.com/' >>> url.info () < http.client.httpmessage object at 0x7fddc2de04e0>>>> header = Url.info () >>> header.as_string () (' date: fri, 24 jun 2016 18:21:19 gmt/n ' ' expires: -1/n ' ' cache-control: private, max-age=0/n ' ' Content-type: text/html; charset=iso-8859-1/n ' ' p3p: cp= ' this is not a p3p policy! see ' ' Https://www.google.com/support/accounts/answer/151657?hl=en for more info. " /n ' ' server: gws/n ' ' x-xss-protection: 1; mode=block/n ' ' x-frame-options: sameorigin/n ' ' set-cookie: ' ' Nid=80=tyjmy0jy6flssvj7dpssznouqdvqkfkhdchspigu3xfv41lvh_ Jg6lrusdgkprtm2hmz3j9v76ps4k_cbg7pdwuemqfr0dfzw33swpgex5qzlkxuvuvpfe9g699qz4cx9ipcbu3hkwrrya; ' ' expires=sat, 24-dec-2016 18:21:19 gmt; path=/; domain=.google.com; httponly/n ' ' Alternate-Protocol: 443:quic/n ' ' alt-svc: quic= ": 443"; ma=2592000; v= "34,33,32,31,30,29,28,27,26,25"/n " ' accept-ranges: none/n ' ' vary: accept-encoding/n ' ' connection: close/n ' '/ n ') >>> url.getcode () 200
Here we have the module we need and tell it to open Google's URL. Now we have a HttpResponse object that we can interact with. The first thing we want to do is call the method Geturl, which returns the resource obtained from the URL. This allows us to find out if the URL has been redirected.
Next Call info, which returns the metadata for the page, such as the request header information. Therefore, we can assign the result to our headers variable and then call its method as_string. You can print out the header information we received from Google. You can also get the HTTP response code of the webpage through GetCode, the current situation is 200, meaning is normal work.
If you want to look at the HTML code of the Web page, you can call the variable URL method read. I'm not going to reproduce the process because the output is too long.
Note that the request object initiates a GET request by default unless you specify its data parameter. If you pass the data parameter to it, the request object will become a POST.
Download file
Urllib A typical application scenario is to download the file. Let's take a look at a few ways to accomplish this task:
>>> import urllib.request>>> url = '/HTTP Www.blog.pythonlibrary.org/wp-content/uploads/2012/06/wxDbViewer.zip ' >>> response = Urllib.request.urlopen (URL) >>> data = response.read () >>> with open ( '/home/mike/desktop/test.zip ', ' WB ') as fobj:... fobj.write (data) ...
In this example we open a zip archive URL that is saved on my blog. Then we read out the data and write the data to disk. A workaround for this operation is to use Urlretrieve:
>>> import urllib.request>>> url = '/HTTP Www.blog.pythonlibrary.org/wp-content/uploads/2012/06/wxDbViewer.zip ' >>> tmp_file, header = urllib.request.urlretrieve (URL) >>> with open ('/home/mike/desktop/test.zip ', ' WB ') as fobj:... with open (tmp_file, ' RB ') as tmp:... fobj.write (Tmp.read ())
>>> import urllib.request>>> url = ' http://www.blog.pythonlibrary.org/wp-content/uploads/2012/06/ Wxdbviewer.zip ' >>> urllib.request.urlretrieve (URL, '/home/mike/desktop/blog.zip ') ('/home/mike/desktop/ Blog.zip ',
As you can see, it returns the path where the file was saved, as well as the header information that was obtained from the request.
Set up your user agent
When you use a browser to access a webpage, the browser tells the site who it is. This is called the User-agent (user agent) field. Python's urllib will represent itself as PYTHON-URLLIB/X.Y, where x and Y are the main, minor version numbers of the python you use. Some sites do not recognize this user agent field, and then the site may behave strangely or not work at all. Hurrily is that you can easily set your own user-agent field.
>>> import urllib.request>>> user_agent = ' Mozilla/5.0 (x11; ubuntu; linux x86_64; rv:47.0) gecko/20100101 firefox/47.0 ' >>> url = ' http://www.whatsmyua.com/' >>> headers = {' user-agent ': User_agent}>>> request = urllib.request.request (url, headers=headers) >>> with urllib.request.urlopen (Request) as response:... with open ('/home/mdriscoll/desktop/user_agent.html ', ' WB ') as out:... out.write (Response.read ())
Here we set up our user agent for Mozilla FireFox, and then we visit http://www.whatsmyua.com/and it will tell us what it recognizes in our user-agent field. We then passed the URL and our header message to Urlopen to create a Request instance. Finally we save the result. If you open this result, you will see that we have successfully modified our user-agent field. Use this code to try out different values and see how it changes.
Urllib.parse
The Urllib.parse library is a standard interface for splitting and combining URL strings. For example, you can use it to convert a relative URL to an absolute URL. Let's try it to convert a URL that contains a query:
>>> from urllib.parse import urlparse>>> result = Urlparse (' Https://duckduckgo.com/?q=python+stubbing&t=canonical&ia=qa ') >>> Resultparseresult (scheme= ' https ', netloc= ' duckduckgo.com ', path= '/', params= ', query= ' q= Python+stubbing&t=canonical&ia=qa ', fragment= ') >>> result.netloc ' duckduckgo.com ' >>> result.geturl () ' Https://duckduckgo.com/?q=python+stubbing&t=canonical&ia=qa ' >> > result.portnone
Here we import the function Urlparse and pass the URL of a DuckDuckGo containing the search query string to it as an argument. My query string is a search for articles about "Python stubbing". As you can see, it returns a Parseresult object that you could use to learn more about the URL. For example, you can get port information (no port information in this case), network location, path, and many other things.
Submit a Web form
This module also has a method UrlEncode can transfer data to a URL. A typical usage scenario for Urllib.parse is to submit Web forms. Let's search Python through search engine DuckDuckGo to see how this function works.
>>> import urllib.request>>> import urllib.parse>>> data = urllib.parse.urlencode ({' Q ': ' Python '}) >>> data ' Q=python ' >>> url = ' http://duckduckgo.com/html/' >>> full_url = url + '? ' + data>>> response = urllib.request.urlopen (Full_url) >>> with open ('/home/mike/desktop/results.html ', ' WB ') as f:... F.write (Response.read ())
This example is straightforward. Basically we submitted a query to DuckDuckGo using Python instead of the browser. To accomplish this we need to build our query string using UrlEncode. We then stitch the string and URL together into a full, correct URL, and then use Urllib.request to submit the form. Finally, we get the result and save it to disk.
Urllib.robotparser
The Robotparser module is composed of a separate class robotfileparser. This class will answer questions such as whether a particular user agent gets the URL of a site that has Robot.txt set up. The Robot.txt file tells the web crawler or robot that the parts of the current site are not allowed to be accessed. Let's look at a simple example:
>>> import urllib.robotparser>>> robot = Urllib.robotparser.RobotFileParser () >>> robot.set_url (' Http://arstechnica.com/robots.txt ') None >>> robot.read () none>>> robot.can_fetch (' * ', ' http://arstechnica.com/') True >>> robot.can_fetch (' * ', ' http://arstechnica.com/cgi-bin/') False
Here we import the robot parser class and then create an instance. Then we pass it a URL that indicates the location of the site robots.txt. Next we tell the parser to read the file. When we're done, we give it a different set of URLs to let it find out which ones we can crawl and those that can't be crawled. We soon saw that we could access the main station but could not access the Cgi-bin path.
Summarize
Now you have the ability to use Python's Urllib package. In this section, we learned how to download files, submit Web forms, modify our own user agents, and access robots.txt. Urllib also has a lot of additional features not mentioned here, such as website identity authentication. You might consider switching to the requests library before you use Urllib for authentication, because requests has implemented these features in a more user-friendliness and easy-to-debug manner. I would also like to remind you that Python has already supported cookies through the Http.cookies module, although this feature is also well encapsulated in the request package. You should probably consider trying two at a time to decide which one is best for you.
Learn Python's Urllib module