Python urllib module details and examples
Let's take a look at an example. This example captures the html of the Google homepage and displays it on the console:
Import urllib
Print urllib. urlopen ('HTTP: // www.google.com '). read ()
# Don't be surprised. The whole program actually only uses two lines of code.
Import urllib
Print urllib. urlopen ('HTTP: // www.google.com '). read ()
Urllib. urlopen (url [, data [, proxies]):
Create a class file object that represents a remote url, and then operate on this class file object like a local file to obtain remote data. The parameter url indicates the path of the remote data, which is generally the url. The parameter data table shows the data submitted to the url in post mode (the web users should know the two methods of data submission: post and get. If you don't know, you don't have to worry too much about it. This parameter is rarely used in general). The parameter proxies is used to set the proxy (here we will not detail how to use the proxy, interested readers can refer to the Python manual urllib module ). Urlopen returns a class object, which provides the following methods:
Read (), readline (), readlines (), fileno (), close (): these methods are used in the same way as file objects;
Info (): returns an httplib. HTTPMessage object, indicating the header information returned by the remote server;
Getcode (): return the Http status code. For an http request, 200 indicates that the request is successfully completed; 404 indicates that the URL is not found;
Geturl (): return the request url;
Below we will expand the above example. You can run this example to deepen your impression on urllib:
Google = urllib. urlopen ('HTTP: // www.google.com ')
Print 'HTTP header: \ n', google.info ()
Print 'HTTP status: ', google. getcode ()
Print 'url: ', google. geturl ()
For line in google: # Just like operating a local file
Print line,
Google. close ()
Google = urllib. urlopen ('HTTP: // www.google.com ')
Print 'HTTP header: \ n', google.info ()
Print 'HTTP status: ', google. getcode ()
Print 'url: ', google. geturl ()
For line in google: # Just like operating a local file
Print line,
Google. close ()
Urllib. urlretrieve (url [, filename [, reporthook [, data]):
The urlretrieve method directly downloads remote data to the local device. The filename parameter specifies the path to be saved locally (if this parameter is not specified, urllib generates a temporary file to store data). The reporthook parameter is a callback function, this callback is triggered when the server is connected and the corresponding data block is transferred. We can use this callback function to display the current download progress. The following example shows the progress. The parameter data refers to the data that is post to the server. This method returns a tuple (filename, headers) containing two elements. filename indicates the path saved to the local directory, and header indicates the response header of the server. The following example demonstrates how to use this method. In this example, the html of the Sina homepage is crawled locally and saved in the D: \ sina.html file, and the download progress is displayed.
Def cbk (a, B, c ):
''''' Callback function
@ A: downloaded data blocks
@ B: data block size
@ C: Remote File Size
'''
Per = 100.0 * a * B/c
If per> 100:
Per = 100.
Print '%. 2f %' % per
Url = 'HTTP: // www.sina.com.cn'
Local = 'd: \ sina.html'
Urllib. urlretrieve (url, local, cbk)
Def cbk (a, B, c ):
''' Callback function
@ A: downloaded data blocks
@ B: data block size
@ C: Remote File Size
'''
Per = 100.0 * a * B/c
If per> 100:
Per = 100.
Print '%. 2f %' % per
Url = 'HTTP: // www.sina.com.cn'
Local = 'd: \ sina.html'
Urllib. urlretrieve (url, local, cbk)
The two methods described above are the most commonly used methods in urllib. These methods use the URLopener or FancyURLOpener class internally when obtaining remote data. As a user of urllib, we seldom use these two classes. I don't want to talk about them more here. If you are interested in urllib implementation or want urllib to support more protocols, you can study these two classes. In the Python manual, the author of urllib also lists the defects and shortcomings of this module. If you are interested, you can open the Python manual to learn more.
Urllib also provides some auxiliary methods for url encoding and decoding. There are no special symbols in the url, and some symbols have special purposes. We know that when we submit data in get mode, a string such as key = value will be added to the url, so '=' is not allowed in value ', therefore, it must be encoded. When the server receives these parameters, it must be decoded and restored to the original data. At this time, these auxiliary methods will be very useful: www.2cto.com
Urllib. quote (string [, safe]): encode the string. The safe parameter specifies characters that do not require encoding;
Urllib. unquote (string): decodes a string;
Urllib. quote_plus (string [, safe]): similar to urllib. quote, but this method uses '+' to replace '', while the quote uses'' to replace''
Urllib. unquote_plus (string): decodes a string;
Urllib. urlencode (query [, doseq]): converts a list of dict or tuples containing two elements into url parameters. For example, if the dictionary {'name': 'Dark-bull ', 'age': 200} will be converted to "name = dark-bull & age = 200"
Urllib. pathname2url (path): converts a local path to a url path;
Author: lmh12506