Python crawling web pages (1)-urllib/urllib2/requests

Source: Internet
Author: User
Document directory
  • 1. Capture simple web pages
  • 2. Download an object
  • 3. basic use of urllib
  • 4. basic use of urllib2

I recently learned Python again. Unfortunately, I didn't use it for work. I could only use my spare time to play it out.

1. Capture simple web pages

# coding=utf-8import urllib2response = urllib2.urlopen('http://www.pythonclub.org/python-network-application/observer-spider')html = response.read()print html

Abbreviated:

# coding=utf-8import urllib2content = urllib2.urlopen('http://www.pythonclub.org/python-network-application/observer-spider').read()print content

Or: urllib

# Coding = utf-8import urlliburl = 'HTTP: // www.pythonclub.org/python-network-application/observer-spider'req = require

After running:

>>>================================== Restart ==== ==========================================>>><! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <HTML xmlns = "http://www.w3.org/1999/xhtml" XML: lang = "ZH" lang = "ZH" dir = "LTR"> 

If you encounter

   raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)HTTPError: HTTP Error 403: Forbidden

The target webpage is forbidden to be crawled.

Import urllib2headers = {'user-agent': 'mozilla/5.0 (windows; U; Windows NT 6.1; en-US; RV: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} # The error urllib2.httperror: HTTP Error 403: forbidden occurs because the website prohibits crawlers and adds header information to the request, disguising it as a browser access solution url = 'HTTP: // blog.csdn.net/julius_lee/article/details/7682909'req = urllib2.request (URL, headers = headers) content = urllib2.urlopen (req ). read () print content. decode ('u8 ')

After running:

>>>================================== Restart ==== ==========================================>>><! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <HTML xmlns = "http://www.w3.org/1999/xhtml"> 

2. Download an object

Method 1: urllib

import urlliburl = 'http://music.baidu.com/data/music/file?link=http://zhangmenshiting.baidu.com/data2/music/44799414/447992771372651261320.mp3?xcode=760fd59b6ae51718c0bb6b6933610b20a4f508beeec48fde'urllib.urlretrieve(url,'E:\\study\\python\\tmp\\123.mp3')

Captured Baidu mp3 songs and saved them in

E:\\study\\python\\tmp\\

Name 123. MP3

Method 2: Requests

# coding=utf-8import requestsurl = 'http://music.baidu.com/data/music/file?link=http://zhangmenshiting.baidu.com/data2/music/44799414/447992771372651261320.mp3?xcode=760fd59b6ae51718c0bb6b6933610b20a4f508beeec48fde'  print "downloading with requests"r = requests.get(url) with open("1234_requests.mp3", "wb") as code:    code.write(r.content)

Method 3: urllib2

# coding=utf-8import urllib2url = 'http://music.baidu.com/data/music/file?link=http://zhangmenshiting.baidu.com/data2/music/44799414/447992771372651261320.mp3?xcode=760fd59b6ae51718c0bb6b6933610b20a4f508beeec48fde'  print "downloading with urllib2"r = urllib2.urlopen(url)data = r.read() with open("1234_urllib2.mp3", "wb") as code:    code.write(data)

The disadvantage is that there is no progress prompt and you need to confirm that the program is running successfully. solution:

def abc(a,b,c):    '''    a: num    b: size    c: total    '''    process = 100.0*a*b/c    if process > 100:        process = 100    print '%.2f%%' %process 

Method for saving text:

Content = R. read () print "String Length:", Len (content) # open a text file fp = open ("163.txt"," W ") # Write Data FP. write (content) FP. close ();

Or:

with open('python.html','w') as code:    code.write(data)

Q: How to choose? What is the difference?

You can use urllib2 as the extension of urllib. The obvious advantage is that urllib2.urlopen can accept the request object as a parameter and thus control the HTTP request header;
Urllib2 library should be used whenever possible during HTTP request, but a series of quote and unquote functions, such as urllib. urlretrieve function and urllib. Quote, are not added to urllib2. Therefore, urllib is also required to assist
1. urllib2 can accept an instance of the Request class to set the headers of the URL request. urllib can only accept the URL: this means that you cannot disguise your user agent string.
2. urllib provides the urlencode method to generate get query strings, but urllib2 does not. This is why urllib is often used with urllib2. Data can also be encoded in the URL of the GET request. urllib. urlencode (data)
Urllib: simple and weak functions. You can download files from a specified URL or encode or decode some strings to make them a specific URL string.
Urllib2: it has a variety of handler, processor can handle more complex questions, such as network authentication, using proxy servers, and using cookies.
HTTP is based on the request and response mechanism-the client initiates a request, and the server provides a response. Urllib2 uses a request object to map your HTTP request. In its simplest form of use, you will create a request object using the address you want to request, by calling urlopen and passing in the request object, a response object of the relevant request will be returned. This response object is like a file object, so you can call it in response. read ()

3. basic use of urllib

Introduction to Python urllib

I. Purpose of this module:
1. Get data from the specified URL
2. Format the URL string
II. The main functions and classes in the urllib module of _ version __= '1. 17' are described as follows:
1. Function:
(1) def urlopen (URL, Data = none, proxies = none)
Parameter description:
URL-compliant strings (including HTTP, FTP, Gopher, and local-file Standards)
The data string sent by data to the specified URL. Both get and post can be used, but must comply with the standard format.
Format: Key = Value & key1 = value1 ....
The address Dictionary of the proxies proxy server. If this parameter is not specified, it is set based on IE on Windows.
The proxy server to be verified is not supported.
Example: proxies = {'http': 'http: // www.someproxy.com: 8080 '}
This example represents an HTTP Proxy Server http://www.someproxy.com: 3128
Function implementation instructions:
This function uses the fancyurlopener class to perform specific operations from the open method inherited by urlopener.
Return Value:
Returns a file-like object (file_like ).
The method of this object is
Read ()
Readline ()
Readlines ()
Fileno ()
Close ()
The above methods are basically the same as those used in similar methods of file object.
Info () returns the mime label header returned from the server.
Geturl () returns a real URL, which is called a real URL because, for some redirected URLs, the returned URL is reset.
(2) def urlretrieve (URL, filename = none, reporthook = none, Data = none ):
Parameter description:
URL-compliant string
Filename: the string of the local file path. The data returned from the URL is saved in this file. If it is set to none
Generate a temporary file
Reporthook is a function reference. You can define the behavior of the function at will. You only need to ensure that the function has three parameters.
Urlretrieve:
The first parameter is the number of data blocks passed so far.
The second parameter is the size of each data block, in bytes.
The total size of the third parameter file (sometimes-1)
The data string sent by data to the specified URL. Both get and post can be used, but must comply with the standard format.
Format: Key = Value & key1 = value1 ....
Function implementation instructions:
This function uses the retrieve method inherited from urlopener using the fancyurlopener class to perform specific operations.
Return Value:
Returns a tuple (filename, headers)
Filename is the filename in the parameter.
Headers is the mime label header returned from the server.
(3) def urlcleanup ():
Parameter description: No Parameter
Function implementation Description: This function uses the cleanup method inherited from the urlopener class to perform specific operations.
Clear cache files generated after urlopen or urlretrieve
Return Value: No Return Value
(4) def quote (S, safe = '/'):
Parameter description:
S string to be converted
Safe
Function implementation instructions:
According to RFC 2396, the reserved URL characters are
Reserved = ";" | "/" | "? "|": "|" @ "|" & "|" = "|" + "|" $ "| ","
However, these characters are not retained in all types of URLs,
Therefore, you must set different reserved characters when converting different URLs when using this function.
The characters that are not converted to any part are uppercase/lowercase letters, numbers, '_', '.', and ,'_','.','-'
You can also use Chinese characters, but add the corresponding encoding, such as quote (U 'python'. encode ('gb2312 '))

This function converts all non-reserved characters into the form of % XX, where XX is the hexadecimal number of two
Return Value:
Converted string
(5) def quote_plus (S, safe = ''):
Parameter description:
S string to be converted
Safe
Function implementation instructions:
Basically the same as the quote function. It only converts the space in parameter S to '+', not % 20.
Return Value:
Converted string
(6) def unquote (s ):
Parameter description:
S string to be reversed
Function implementation instructions:
Resolution function opposite to quote
Return Value:
Converted string
(7) def unquote_plus (s ):
S string to be reversed
Function implementation instructions:
Resolution function opposite to quote_plus
Return Value:
Converted string
(8) def urlencode (query, doseq = 0 ):
Parameter description:
A query can be a tuple of elements made up of binary tuples, or a dictionary.
How does doseq stringize parameters?
Function implementation instructions:
Combine the paired data into a parameter string according to the requirements in the URL.
For example:
Query = ('name', 'cs '), ('age', '1'), ('height', '2 '))
Re = urllib. urlencode (query)
Print re
Query1 = {'name': 'cs ', 'age': '1', 'height': '2 '}
Print urllib. urlencode (query1)
These two functions are basically the same, but the dictionary types are automatically sorted,
Output result:
Name = cs & age = 1 & Height = 2
Age = 1 & name = cs & Height = 2
The meaning of the doseq parameter. The document is too short. I tried it and found that it is different in how to normalize it.
Query1 = {'name': 'cs ', 'age' :( 'A',' B '), 'height': '1 '}
Print urllib. urlencode (query1, 1)
Print urllib. urlencode (query1, 0)
Output result:
Age = A & age = B & name = cs & Height = 1
Age = % 28% 27A % 27% 2C + % 27B % 27% 29 & name = cs & Height = 1
Return Value:
Parameter string connected by callback
(9) def url2pathname (pathname ):
Parameter description:
Pathname URL string
Function implementation instructions:
This function will determine how to convert the '/' in the URL to '\' based on the operating system '\',
Then the other behaviors are the same as those of the quote (that is, the strings are passed to the quote for processing)
Return Value:
String that conforms to the local file path format
(10) def pathname2url (pathname ):
Parameter description:
Pathname is a string in the format of the local file path name.
Function implementation instructions:
This function will determine how to convert the '/' in the URL to '\' based on the operating system '\',
Then the other behaviors are the same as those of unquote (that is, the strings are passed to unquote for processing)
Return Value:
URL-compliant string
(Note: 9 and 10 functions are generally not directly used, but are used in the provided interface functions for locating network and local resources in a unified way)

Example:

# Proxy Server # Coding = utf-8import urlliburl = 'HTTP: // www.python.org 'proxies = {'HTTP': 'http: // 127.0.0.1: 8086/'} r = urllib. urlopen (URL, proxies = none) # select not to use the proxy server to open print r.info () print R. getcode () print R. geturl ()
Output: Date: Sun, 07 Jul 2013 02:15:06 gmtserver: Apache/2.2.16 (Debian) Last-modified: sat, 06 Jul 2013 21:01:42 gmtetag: "105800d-5245-4e0de1e445980" Accept-ranges: bytescontent-length: 21061 vary: Accept-encodingconnection: closecontent-type: text/html200http: // www.python.org # Coding = utf-8import urllibproxies = {'http': 'http: // 127.0.0.1: 8087 '} url = 'HTTP: // www.python.org' r = urllib. urlopen (URL, proxies = proxies) # select to use the proxy server to open print r.info () print R. getcode () print R. geturl () Content-Length: 21061via: HTTP/1.1 GWA # Use the proxy accept-ranges: bytesvary: Accept-encodingserver: Apache/2.2.16 (Debian) Last-modified: sat, 06 Jul 2013 21:01:42 gmtetag: "105800d-5245-4e0de1e445980" Date: Sun, 07 Jul 2013 06:35:57 gmtcontent-type: text/html200http: // www.python.org

# Open the local file # Coding = utf-8import urlliburl = 'file: C: \ mywork \ workspace \ greetings. java 'r = urllib. urlopen (URL, proxies = none) print r.info () print R. getcode () print R. geturl () print R. read ()
Output: Content-Type: text/plaincontent-length: 258last-modified: Thu, 07 Jun 2012 14:50:19 gmtnonec: % 5 cmywork % 5 cworkspace % 5cgreetings. javaimport javax. swing. *; Class greetings {public static void main (string [] ARGs) {string fullname; fullname = joptionpane. showinputdialog (null, "What is your name? "); Joptionpane. showmessagedialog (null," Nice to meet you, "+ fullname +". "); system. Exit (0 );}}

# Display file download progress import urllibdef ABC (a, B, c): ''' A: num B: Size C: total '''process = 100.0 * a * B/c If process> 100: Process = 100 print '%. 2f % '% process url = 'HTTP: // music.baidu.com/data/music/file? Links = http://zhangmenshiting.baidu.com/data2/music/44054399/34182398136800320.mp3? Xcode = comment 'urllib. urlretrieve (URL, 'e: \ study \ Python \ TMP \ wind blow the waves ', ABC)
Output and download file completed: 0.00% 0.10% 0.21% 0.31% 0.41% .... 100.00%
# Get method Params = urllib. urlencode ({'spam': 1, 'eggs ': 2, 'bacon': 0}) F = urllib. urlopen ("http://www.musi-cal.com/cgi-bin/query? % S "% Params) print F. read () # POST method Params = urllib. urlencode ({'spam': 1, 'egg': 2, 'bacon': 0}) F = urllib. urlopen ("http://www.musi-cal.com/cgi-bin/query", Params) print F. read ()
#uses an explicitly specified HTTP proxy>>> import urllib>>> proxies = {'http': 'http://proxy.example.com:8080/'}>>> opener = urllib.FancyURLopener(proxies)>>> f = opener.open("http://www.python.org")>>> f.read()
#uses no proxies at all>>> import urllib>>> opener = urllib.FancyURLopener({})>>> f = opener.open("http://www.python.org/")>>> f.read()

4. basic use of urllib2

Urllib2 is a python module that obtains URLs (Uniform resourcelocators. It provides a very concise interface in the form of urlopen functions. This makes it possible to obtain URLs using various protocols. It also provides a slightly complex interface to handle common situations, such as basic authentication, cookies, proxy, and so on. These are all handled by objects called opener and handler.

# coding=utf-8 import urllib2 url = 'http://www.python.org' req = urllib2.Request(url) r = urllib2.urlopen(req) data = r.read() print data

1. The urllib2.request () function is to construct a request information, and the returned req is a constructed request.

You can also construct some additional request information (metadata) through request ).

2. The urllib2.urlopen () function is to send the constructed request req and return a file-type object. r includes all the returned information.

3. The code in R can be read through R. Read (), which is the source code of the returned page.

# coding=utf-8import urllib2,urllibheaders = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}url = 'http://blog.csdn.net/julius_lee/article/details/7682909'values = {'name': 'Michael Froord','location':'Northampton','language':'python'}data = urllib.urlencode(values)req = urllib2.Request(url,data,headers = headers)content = urllib2.urlopen(req).read()print content.decode('u8')

Construct a request with additional data through urllib2.request. Including "HTTP header" and the data to be sent. The data must be encoded in a standard way and then transmitted to the request object as a data parameter. Encoding is completed in urllib, not in urllib2.

Core classes in urllib2:
Request: a specific URL request that contains all the information of the request.
Openerdirector: it is combined with basehandler to process different requests by combining different handler
Basehandler: class used to complete request processing. Different request processing classes inherit this class.

In urllib2, a request is divided into three processes: request, open, and response.
Request: The purpose is to construct all the information required by the request object, such as the header information in HTTP.
Open: process a specific request, encapsulate the request object, call a lower-level class to complete the request, and return response
Response: process the returned response object
Of course there is an error processing process later, but this is not triggered automatically.

Openerdirector

Because the specific implementation of each request is different handler, and a request may consist of many handler. So the class implementing this coupling mechanism is openerdirector, which can register (ADD) different handler to help process a request. Generally, the naming rule in handler is protocol_request | open | response, which corresponds to three processes of different protocols.

Handler

Urllib2 provides many handler to process different requests. Common httphandler and ftphandler are easy to understand. Here we will mention httpcookieprocessor and httpredirecthandler.

Httpcookieprocessor processes cookies, which are essential for many authentication requests. In python, cookielib is used to perform cookie operations, the handler only calls its method. In the request and response processes, the cookie is added to the request and the cookie is parsed from the response.

Httpredirecthandler is the handler that processes the 30x status.

Error handler

The special feature of error handling is that in urllib2, The hanlder for error handling is completed by httperrorprocessor.

Urlopen, install_opener, build_opener

This is the method of the urllib2 module. There is a global variable in the urllib2 module to save the openerdirector instance.
The urlopen method calls the open method of the openerdirector instance.
The install_opener method uses an openerdirector instance as the current Opener
The most important thing is build_opener, which determines which handler exists in openerdirector


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.