International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Learn Python's Urllib module

Last Update:2017-08-09 Source: Internet

Author: User

Tags response code urlencode quic

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Urllib module as a collection of components for Python 3 processing URLs, if you have knowledge of Python 2, then you will notice that Python 2 has urllib and urllib2 two versions of the modules, which are now Python 3 URLs LIB package is part of how to reflect the relationship between them

The Urllib module for Python 3 is a collection of components that can handle URLs. If you have knowledge of Python 2, then you will notice that Python 2 has the Urllib and urllib2 two versions of the module. These are now part of the Python 3 urllib package. The current version of Urllib includes the following sections:

Urllib.request
Urllib.error
Urllib.parse
Urllib.rebotparser

Next we will discuss the parts apart from Urllib.error. Official documents actually recommend you try a third-party library, requests, an advanced HTTP client interface. However, I still think it is useful to know how to open URLs and interact with them without relying on third-party libraries, and this can also help you understand why requests packages are so popular.

Urllib.request

The Urllib.request module is used to open and retrieve URLs at the beginning of the period. Let's see what you can do with function Urlopen:

>>> import urllib.request>>> url = urllib.request.urlopen (' https:// www.google.com/') >>> url.geturl () ' https://www.google.com/' >>> url.info () < http.client.httpmessage object at 0x7fddc2de04e0>>>> header =  Url.info () >>> header.as_string () (' date: fri, 24 jun 2016 18:21:19  gmt/n '   ' expires: -1/n '   ' cache-control: private, max-age=0/n '   ' Content-type:  text/html; charset=iso-8859-1/n '   ' p3p: cp= ' this is not a p3p  policy! see  '   ' Https://www.google.com/support/accounts/answer/151657?hl=en for more  info. " /n '   ' server: gws/n '   ' x-xss-protection: 1; mode=block/n '   ' x-frame-options:  sameorigin/n '   ' set-cookie:  '   ' Nid=80=tyjmy0jy6flssvj7dpssznouqdvqkfkhdchspigu3xfv41lvh_ Jg6lrusdgkprtm2hmz3j9v76ps4k_cbg7pdwuemqfr0dfzw33swpgex5qzlkxuvuvpfe9g699qz4cx9ipcbu3hkwrrya;  '   ' expires=sat, 24-dec-2016  18:21:19 gmt; path=/; domain=.google.com; httponly/n '   ' Alternate-Protocol:  443:quic/n '   ' alt-svc: quic= ": 443";  ma=2592000; v= "34,33,32,31,30,29,28,27,26,25"/n "   ' accept-ranges: none/n '   ' vary: accept-encoding/n '   ' connection: close/n '   '/ n ') >>> url.getcode () 200

Here we have the module we need and tell it to open Google's URL. Now we have a HttpResponse object that we can interact with. The first thing we want to do is call the method Geturl, which returns the resource obtained from the URL. This allows us to find out if the URL has been redirected.
Next Call info, which returns the metadata for the page, such as the request header information. Therefore, we can assign the result to our headers variable and then call its method as_string. You can print out the header information we received from Google. You can also get the HTTP response code of the webpage through GetCode, the current situation is 200, meaning is normal work.

If you want to look at the HTML code of the Web page, you can call the variable URL method read. I'm not going to reproduce the process because the output is too long.

Note that the request object initiates a GET request by default unless you specify its data parameter. If you pass the data parameter to it, the request object will become a POST.

Download file

Urllib A typical application scenario is to download the file. Let's take a look at a few ways to accomplish this task:

>>> import urllib.request>>> url =  '/HTTP Www.blog.pythonlibrary.org/wp-content/uploads/2012/06/wxDbViewer.zip ' >>> response =  Urllib.request.urlopen (URL) >>> data = response.read () >>> with open ( '/home/mike/desktop/test.zip ',  ' WB ')  as fobj:...     fobj.write (data) ...

In this example we open a zip archive URL that is saved on my blog. Then we read out the data and write the data to disk. A workaround for this operation is to use Urlretrieve:

>>> import urllib.request>>> url =  '/HTTP Www.blog.pythonlibrary.org/wp-content/uploads/2012/06/wxDbViewer.zip ' >>> tmp_file, header  = urllib.request.urlretrieve (URL) >>> with open ('/home/mike/desktop/test.zip ',   ' WB ')  as fobj:...     with open (tmp_file,  ' RB ')  as  tmp:...         fobj.write (Tmp.read ())

>>> import urllib.request>>> url = ' http://www.blog.pythonlibrary.org/wp-content/uploads/2012/06/ Wxdbviewer.zip ' >>> urllib.request.urlretrieve (URL, '/home/mike/desktop/blog.zip ') ('/home/mike/desktop/ Blog.zip ', 
As you can see, it returns the path where the file was saved, as well as the header information that was obtained from the request.
Set up your user agent
 When you use a browser to access a webpage, the browser tells the site who it is. This is called the User-agent (user agent) field. Python's urllib will represent itself as PYTHON-URLLIB/X.Y, where x and Y are the main, minor version numbers of the python you use. Some sites do not recognize this user agent field, and then the site may behave strangely or not work at all. Hurrily is that you can easily set your own user-agent field. 
>>> import urllib.request>>> user_agent =  '  Mozilla/5.0  (x11; ubuntu; linux x86_64; rv:47.0)  gecko/20100101 firefox/47.0 ' >>>  url =  ' http://www.whatsmyua.com/' >>> headers = {' user-agent ':  User_agent}>>> request = urllib.request.request (url, headers=headers) >>>  with urllib.request.urlopen (Request)  as response:...     with  open ('/home/mdriscoll/desktop/user_agent.html ',  ' WB ')  as out:...          out.write (Response.read ()) 
Here we set up our user agent for Mozilla FireFox, and then we visit http://www.whatsmyua.com/and it will tell us what it recognizes in our user-agent field. We then passed the URL and our header message to Urlopen to create a Request instance. Finally we save the result. If you open this result, you will see that we have successfully modified our user-agent field. Use this code to try out different values and see how it changes.
Urllib.parse
The Urllib.parse library is a standard interface for splitting and combining URL strings. For example, you can use it to convert a relative URL to an absolute URL. Let's try it to convert a URL that contains a query:
>>> from urllib.parse import urlparse>>> result =  Urlparse (' Https://duckduckgo.com/?q=python+stubbing&t=canonical&ia=qa ') >>>  Resultparseresult (scheme= ' https ',  netloc= ' duckduckgo.com ',  path= '/',  params= ',  query= ' q= Python+stubbing&t=canonical&ia=qa ',  fragment= ') >>> result.netloc ' duckduckgo.com ' >>> result.geturl () ' Https://duckduckgo.com/?q=python+stubbing&t=canonical&ia=qa ' >> > result.portnone 
Here we import the function Urlparse and pass the URL of a DuckDuckGo containing the search query string to it as an argument. My query string is a search for articles about "Python stubbing". As you can see, it returns a Parseresult object that you could use to learn more about the URL. For example, you can get port information (no port information in this case), network location, path, and many other things.
Submit a Web form
This module also has a method UrlEncode can transfer data to a URL. A typical usage scenario for Urllib.parse is to submit Web forms. Let's search Python through search engine DuckDuckGo to see how this function works.
>>> import urllib.request>>> import urllib.parse>>> data  = urllib.parse.urlencode ({' Q ':  ' Python '}) >>> data ' Q=python ' >>> url  =  ' http://duckduckgo.com/html/' >>> full_url = url +  '? '  + data>>> response = urllib.request.urlopen (Full_url) >>> with  open ('/home/mike/desktop/results.html ',  ' WB ')  as f:...      F.write (Response.read ()) 
This example is straightforward. Basically we submitted a query to DuckDuckGo using Python instead of the browser. To accomplish this we need to build our query string using UrlEncode. We then stitch the string and URL together into a full, correct URL, and then use Urllib.request to submit the form. Finally, we get the result and save it to disk.
Urllib.robotparser
The Robotparser module is composed of a separate class robotfileparser. This class will answer questions such as whether a particular user agent gets the URL of a site that has Robot.txt set up. The Robot.txt file tells the web crawler or robot that the parts of the current site are not allowed to be accessed. Let's look at a simple example:
>>> import urllib.robotparser>>> robot =  Urllib.robotparser.RobotFileParser () >>> robot.set_url (' Http://arstechnica.com/robots.txt ') None >>> robot.read () none>>> robot.can_fetch (' * ',  ' http://arstechnica.com/') True >>> robot.can_fetch (' * ',  ' http://arstechnica.com/cgi-bin/') False 
Here we import the robot parser class and then create an instance. Then we pass it a URL that indicates the location of the site robots.txt. Next we tell the parser to read the file. When we're done, we give it a different set of URLs to let it find out which ones we can crawl and those that can't be crawled. We soon saw that we could access the main station but could not access the Cgi-bin path.
Summarize
Now you have the ability to use Python's Urllib package. In this section, we learned how to download files, submit Web forms, modify our own user agents, and access robots.txt. Urllib also has a lot of additional features not mentioned here, such as website identity authentication. You might consider switching to the requests library before you use Urllib for authentication, because requests has implemented these features in a more user-friendliness and easy-to-debug manner. I would also like to remind you that Python has already supported cookies through the Http.cookies module, although this feature is also well encapsulated in the request package. You should probably consider trying two at a time to decide which one is best for you.

Learn Python's Urllib module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

pecl module module function stem module sphinx module coldfusion module pagespeed module imx6 module

Python abstract class (ABC module) 09-18

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learn Python's Urllib module

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support