Python-urllib module, pythonurllib download

Source: Internet
Author: User

Python-urllib module, pythonurllib download

The urllib module is an advanced web communication library. Its core function is to request resources and return a class file object by imitating web browsers and other clients. Urllib supports various web protocols, such as HTTP, FTP, and Gopher. It also supports access to local files. But in general, it is used to write crawlers. The following content focuses on how to use the urllib library to write simple crawlers. In addition, if you want to crawl js dynamically generated items, such as js Dynamically Loaded Images, you also need some advanced skills. The examples here are for static html webpages.

The descriptions below are for the python2.7 version. There is a gap between versions. For details, refer to the official manual.

First, if I need to write a crawler to crawl a website image, I can take the following steps:

Of course, you can also perform various processing and analysis on the crawled data. For example, you can write a price comparison website, obtain quotations from various websites, and then integrate them together. Therefore, after extending the above layout to all crawlers:

Next we will follow this general arrangement to learn how to do each step.

 

1. Open the target website

Urllib. urlopen (url [, data [, proxies [, context])

Go to the url of the remote Request Response and return a class file object. (Note that a remote request has been initiated, that is, a network operation has been performed, with a large amount of traffic)

Url: a complete remote resource path, usually a website. (Note: Include the protocol header, for example, http://www.baidu.com/. the http: // parameter cannot be omitted)

If the URL does not specify the protocol type or its Protocol identifier isFile:, The function will open the local file. If the remote address cannot be openedIOError error.

Data: if http: // is used, this is an optional parameter to specify a POST request (the GET method is used by default ). This parameter must use the standardApplication/x-www-form-urlencodedFormat. You can use the urlencode () method to quickly generate a file.

Proxies: Set proxy. For more information, see the official documentation. The following is an example of the official website:

# Use http://www.someproxy.com:3128 for HTTP proxyingproxies = {'http': 'http://www.someproxy.com:3128'}filehandle = urllib.urlopen(some_url, proxies=proxies)# Don't use any proxiesfilehandle = urllib.urlopen(some_url, proxies={})# Use proxies from environment - both versions are equivalentfilehandle = urllib.urlopen(some_url, proxies=None)filehandle = urllib.urlopen(some_url)

 

Context: this parameter must be set ssl.SSLContextFor configuring SSL.

Generally, you only need to set url parameters.

For example:

f = urllib.urlopen('http://www.baidu.com/')

 

In this way, I can get a class object, and then I can perform various read operations on this class object.

 

2. Operation file objects

The methods below are the same as those in file operations. For details, refer to my explanation in python file operations:

1. read ([size])-> read at most size bytes, returned as a string.

Reads the entire file and returns a String object.

2. readline ([size])-> next line from the file, as a string.

Reads a row and returns a String object.

3. readlines ([size])-> list of strings, each a line from the file.

Read the entire file, encapsulate each row into an element in the list, and return a list.

4. readinto ()-> uninitialized ented. Don't use this; it may go away.

A method that can be ignored and will be discarded

5. close ()-> None or (perhaps) an integer. Close the file.

Close the file and return None or an integer indicating the close status.

 

In addition, you can perform iterative operations directly like file objects.

 

In addition to the above methods used in file operations, there are also the following special methods:

1. info ()

Returns the file information. For http, the header of the Response Message is returned.

Example:

f = urllib.urlopen('http://www.so.com/')print f.info()

 

 

2. geturl ()

Returns the real url of the current page. After the website server is redirected, we can use it to obtain the redirected page.

 

3. getcode ()

Returns the status code of the current request. For example, the status code of a successful request is 200. If http is not enabled, None is returned.

 

 

3. webpage source code after read Operations

This part is the same as file operations, but in most cases it will work with the re module to filter data, for example:

f = urllib.urlopen('http://www.baidu.com/')b = f.read()p = re.compile(r'', re.I)result = p.findall(b)print result

 

 

I tried to crawl the image address on the Baidu homepage. Of course there are still many improvements, which will be used for demonstration here.

The use of the regular re module in python will not be repeated here.

 

 

4. Download the corresponding resource

  After we filter out the URLs of the images we want through regular expressions, we can start downloading and provide the corresponding download functions in urllib.

1. urllib. urlretrieve (url [, filename [, reporthook [, data])

Download a given url as a local file. If the url points to a local file or a valid cache object, it will not be downloaded (note, the download directory contains the same files ). Returns a ancestor.(Filename, headers). The filename value indicates the locally saved file name, And the header indicates the object returned by the info () method above.

Url: The target url.

Filename: the name of the file to be downloaded to the local device. It can be a matching path or a relative path. If no value is provided, it is cached in a temporary folder.

Reporthook: a callback function that is called when the connection is established and the download is complete. At the same time, it will be passed to the FunctionThreeParameter: 1. How many data blocks have been downloaded so far; 2. The size of the data blocks (in bytes); 3. The total size of the files;

Data: if http: // is used, this is an optional parameter to specify a POST request (the GET method is used by default ). This parameter must use the standardApplication/x-www-form-urlencodedFormat. We can use urlencode () to generate quickly.

 

2. urllib. urlcleanup ()

Clear the cache that may have been built up by previous calltourlretrieve().

Clear previousurlretrieve() Created Cache

 

 

Built-in methods for other common modules:

1. urllib. quote (string [, safe])

Encode the string into the url format. safe specifies the unaffected characters (safe = '/' by default '/').

Because the url encoding is different from the commonly used ASCII, for example, the common space in ASCII: 'scolia good 'is a blank character, but in the url: 'scolia % 20good '. The original white space character is % 20.

The conversion rules are as follows:

You do not need to convert a comma, underscore, period, slash, or letter or number. For others, the conversion is required. In addition, the front sides of the disallowed characters are added with a percent sign (%) and converted to a hexadecimal notation. For example, "% xx" and "xx" indicate the letter'sHexadecimal value of the ASCII code.

Example:

f = urllib.quote('scolia good')print f

 

 

2. urllib. quote_plus (string [, safe])

It is almost the same as above, except that it does not convert white space characters, but replaces it with a + sign.

Example:

f = urllib.quote_plus('scolia good')print f

 

 

3. urllib. unquote (string)

Re-transcode the url back, which is the reverse operation of urllib. quote (string [, safe]) in the preceding operation.

4. unquote_plus (string)

Same as above, it is the inverse operation of urllib. quote_plus (string [, safe.

 

5. urllib. urlencode (query [, doseq])

Quickly convert a python dictionary to a request format for the preceding data attribute.

Example:

aDict = {'name': 'Georgina Garcia', 'hmdir': '~ggarcia'}print urllib.urlencode(aDict)

 

Note: once the data is set, the POST request is used. If you want to use a GET request, add? And then put the converted data.

For example:

GET method:

import urllibparams = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)print f.read()

POST method:

import urllibparams = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)print f.read()

 

6. urllib. pathname2url (path)

The local path name is converted into a url-style path name, which does not generate a complete URL and the result has been called.The quote () method is processed.

Example:

a = r'd:\abc\def\123.txt'print urllib.pathname2url(a)

 

7. urllib. url2pathname (path)

Converting a url-style path to a local path is the inverse operation of the above method.

a = r'www.xxx.com/12%203/321/'print urllib.url2pathname(a)

 

Same callThe quote () method is processed.

 

 

 

For more information, see the official documentation: https://docs.python.org/2/library/urllib.html

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.