6.1 simplest Crawler
Web Crawler is a program for automatically extracting Web pages. It Downloads Web pages from the World Wide Web for search engines and is an important component of search engines. Python's urllib \ urllib2 and other modules can easily implement this function. The following example shows how to download the baidu homepage. The Code is as follows:
Copy codeThe Code is as follows:
Import urllib2
Page = urllib2.urlopen ("http://www.baidu.com ")
Print page. read ()
6.2 submit form data
(1) Use the GET method to submit data
The GET Method for submitting a form is to encode the form data to the URL. Add a question mark to the page that gives the request, followed by the elements of the form. For example, search Baidu for "Mayi NLP" to get the url for http://www.baidu.com/s? Wd = % E9 % A9 % AC % E4 % BC % 8A % E7 % 90% 8D & pn = 100 & rn = 20 & ie = UTF-8 & usm = 4 & rsv_page = 1. Where? It is followed by a form element. Wd = % E9 % A9 % AC % E4 % BC % 8A % E7 % 90% 8D indicates that the search term is "Ma Yi ", pn indicates that the display starts from the page where 100th pieces of information are located. (I tried it several times. When I wrote 100, it was displayed from the page where it was located, but if I wrote 10, is displayed on the 1st page), rn = 20 indicates that 20 entries are displayed on each page, ie = UTF-8 indicates the encoding format, usm = 4 does not understand what it means, I tried it for 1, 2, and 3, but no changes were found. rsv_page = 1 indicates the page number. If you want to download the above page, you can simply use the above URL for extraction. Such as code:
Copy codeThe Code is as follows:
Import urllib2
Keyword = urllib. quote ('may ')
Page = urllib2.urlopen ("http://www.baidu.com/s? Wd = "+ keyword +" & pn = 100 & rn = 20 & ie = UTF-8 & usm = 4 & rsv_page = 1 ")
Print page. read ()
(2) Submit using the post method
In the GET method, the data is added to the URL. This method requires little data size. If you need to exchange a large amount of data, the POST method is a good method. The previous blog "python simulated 163 login to get mail list" is used as an example. The specific code is not listed. For details, see http://www.cnblogs.com/xiaowuyi/archive/2012/05/21/2511428.html.
6.3 introduction to urllib, urllib2, httplib, and Mechanic
6.3.1urllib module (reference: http://my.oschina.net/duhaizhang/blog/68893)
The urllib module provides interfaces that allow us to read www and ftp data just like accessing local files. The two most important functions in the module are urlopen () and urlretrieve ().
Urllib. urlopen (url [, data [, proxies]):
This function creates a class file object that represents a remote url, and then operates on this class file object like a local file to obtain remote data. The parameter url indicates the path of remote data, generally the url. The parameter data indicates the data submitted to the url in post mode. The parameter proxies is used to set the proxy. Urlopen returns a class object, which provides the following methods:
Read (), readline (), readlines (), fileno (), close (): these methods are used in the same way as file objects;
Info (): returns an httplib. HTTPMessage object, indicating the header information returned by the remote server;
Getcode (): return the Http status code. For an http request, 200 indicates that the request is successfully completed; 404 indicates that the URL is not found;
Geturl (): return the request url;
Copy codeThe Code is as follows:
#! /Usr/bin/env python
# Coding = UTF-8
Import urllib
Content = urllib. urlopen ("http://www.baidu.com ")
Print "http header:", content.info ()
Print "http status:", content. getcode ()
Print "url:", content. geturl ()
Print "content :"
For line in content. readlines ():
Print line
Urllib. urlretrieve (url [, filename [, reporthook [, data]):
The urlretrieve method directly downloads remote data to the local device. The filename parameter specifies the path to be saved locally (if this parameter is not specified, urllib generates a temporary file to store data). The reporthook parameter is a callback function, this callback is triggered when the server is connected and the corresponding data block is transferred (that is, the callback function is called every download ). We can use this callback function to display the current download progress or speed limit. The following example shows the download progress. The parameter data refers to the data that is post to the server. This method returns a tuple (filename, headers) containing two elements. filename indicates the path saved to the local directory, and header indicates the response header of the server.
Copy codeThe Code is as follows:
#! /Usr/bin/env python
# Coding: UTF-8
"Download file, and display download progress """
Import urllib
Def DownCall (count, size, total_filesize ):
"Count indicates the number of downloaded data blocks, size indicates the size of the data blocks, and total_filesize indicates the total size of the files """
Per = 100.0 * count * size/total_filesize
If per> 100:
Per = 100.
Print "Already download % d KB (%. 2f" % (count * size/1024, per) + "% )"
Url = "http://www.research.rutgers.edu /~ Rohanf/lp1"
Localfilepath = r "C: \ Users \ Administrator \ Desktop \ download.pdf"
Urllib. urlretrieve (url, localfilepath, DownCall)
Urllib also provides some auxiliary methods for url encoding and decoding. There are no special symbols in the url, and some symbols have special purposes. We know that when we submit data in get mode, a string such as key = value will be added to the url, so '=' is not allowed in value ', therefore, it must be encoded. When the server receives these parameters, it must be decoded and restored to the original data. These auxiliary methods are useful at this time:
Urllib. quote (string [, safe]): encode the string. The safe parameter specifies characters that do not require encoding;
Urllib. unquote (string): decodes a string;
Urllib. quote_plus (string [, safe]): similar to urllib. quote, but this method replaces ''with '+', while '% 20' is used for quote''
Urllib. unquote_plus (string): decodes a string;
Urllib. urlencode (query [, doseq]): converts a list of dict or tuples containing two elements into url parameters. For example, if the dictionary {'name': 'Dark-bull ', 'age': 200} will be converted to "name = dark-bull & age = 200"
Urllib. pathname2url (path): converts a local path to a url path;
Urllib. url2pathname (path): converts the url path to the local path;
6.3.2 urllib2 module (reference: http://hankjin.blog.163.com/blog/static/3373193720105140583594)
There are three main ways to access web pages using Python: urllib, urllib2, httplib
Urllib is relatively simple and has relatively weak functions. httplib is simple and powerful, but does not seem to support session
(1) simplest page access
Res = urllib2.urlopen (url)
Print res. read ()
(2) Add the data to get or post
Data = {"name": "hank", "passwd": "hjz "}
Urllib2.urlopen (url, urllib. urlencode (data ))
(3) add an http Header
Header = {"User-Agent": "Mozilla-Firefox5.0 "}
Urllib2.urlopen (url, urllib. urlencode (data), header)
Use opener and handler
Opener = urllib2.build _ opener (handler)
Urllib2.install _ opener (opener)
(4) add session
Cj = cookielib. CookieJar ()
Cjhandler = urllib2.HTTPCookieProcessor (cj)
Opener = urllib2.build _ opener (cjhandler)
Urllib2.install _ opener (opener)
(5) add Basic Authentication
Password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm ()
Top_level_url = "http://www.163.com /"
Password_mgr.add_password (None, top_level_url, username, password)
Handler = urllib2.HTTPBasicAuthHandler (password_mgr)
Opener = urllib2.build _ opener (handler)
Urllib2.install _ opener (opener)
(6) use proxy
Proxy_support = urllib2.ProxyHandler ({& quot; http & quot;: & quot; http: // 1.2.3.4: 3128/& quot /"})
Opener = urllib2.build _ opener (proxy_support)
Urllib2.install _ opener (opener)
(7) set timeout
Socket. setdefatimetimeout (5)
6.3.3 httplib module (derived from: http://hi.baidu.com/avengert/item/be5daec8517b12ddee183b81)
Httplib is the http client implementation in python. It can be used to interact with the HTTP server. Httplib does not have a lot of content and is also relatively simple. The following is a simple example: Use httplib to obtain the google homepage html:
Copy codeThe Code is as follows:
# Coding = gbk
Import httplib
Conn = httplib. HTTPConnection ("www.google.cn ")
Conn. request ('get ','/')
Print conn. getresponse (). read ()
Conn. close ()
The following describes the common types and methods provided by httplib.
Httplib. HTTPConnection (host [, port [, strict [, timeout])
The constructor of the HTTPConnection class, indicating an interaction with the server, that is, a request/response. The host parameter indicates the server host, for example, www.csdn.net; port indicates the port number and the default value is 80; the default value of the strict parameter is false, indicating that the status line returned by the server cannot be parsed) (typical status rows such as HTTP/1.0 200 OK): whether to throw a BadStatusLine exception. The optional parameter timeout indicates the timeout time.
Methods provided by HTTPConnection:
HTTPConnection. request (method, url [, body [, headers])
When the request method is called, a request is sent to the server. The method indicates the request method. Common methods include get and post. The url indicates the url of the requested resource. The body indicates the data submitted to the server, it must be a string (if the method is "post", the body can be understood as the data in the html form); headers indicates the http header of the request.
HTTPConnection. getresponse ()
Obtain the Http response. The returned object is an instance of HTTPResponse. The following describes HTTPResponse.
HTTPConnection. connect ()
Connect to the Http server.
HTTPConnection. close ()
Close the connection to the server.
HTTPConnection. set_debuglevel (level)
Set the height level. The default value of the parameter level is 0, indicating that no debugging information is output.
Httplib. HTTPResponse
HTTPResponse indicates the server's response to the client request. It is often created by calling HTTPConnection. getresponse (). It has the following methods and attributes:
HTTPResponse. read ([amt])
Obtain the Response Message Body. If the request is a common webpage, the method returns the html of the webpage. The optional parameter amt indicates that the specified bytes of data are read from the response stream.
HTTPResponse. getheader (name [, default])
Obtain the response header. Name indicates the header field Name. The optional parameter default is returned as the default value if the header domain Name does not exist.
HTTPResponse. getheaders ()
Returns all header information in the form of a list.
HTTPResponse. msg
Obtain all response header information.
HTTPResponse. version
Obtain the http protocol version used by the server. 11 indicates http/1.1; 10 indicates http/1.0.
HTTPResponse. status
Obtain the status code of the response. For example, 200 indicates that the request is successful.
HTTPResponse. reason
Returns the description of the server processing request. Generally "OK"
The following is an example to familiarize yourself with the methods in HTTPResponse:
Copy codeThe Code is as follows:
# Coding = gbk
Import httplib
Conn = httplib. HTTPConnection ("www.g.cn", 80, False)
Conn. request ('get', '/', headers = {"Host": "www.google.cn ",
"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: 1.9.1) Gecko/20090624 Firefox/3.5 ",
"Accept": "text/plain "})
Res = conn. getresponse ()
Print 'version: ', res. version
Print 'reason: ', res. reason
Print 'status: ', res. status
Print 'msg: ', res. msg
Print 'headers: ', res. getheaders ()
# Html
# Print '\ n' +'-'* 50 +' \ N'
# Print res. read ()
Conn. close ()
The Httplib module also defines many constants, such:
The value of Httplib. HTTP_PORT is 80, indicating that the default port number is 80;
The value of Httplib. OK is 200, indicating that the request is successfully returned;
The value of Httplib. NOT_FOUND is 40, indicating that the requested resource does not exist;
You can use httplib. responses to query the meanings of related variables, such:
Print httplib. responses [httplib. NOT_FOUND]
6.3.4 machize
There is no complete introduction to the Mechanism. I wrote a simple example as follows.
Copy codeThe Code is as follows:
#-*-Coding: cp936 -*-
Import time, string
Import mechanic, urllib
From mechanic import Browser
Urlname = urllib. quote ('may ')
Br = Browser ()
Br. set_handle_robots (False) # ignore the robots.txt
Urlhttp = r'http: // www.baidu.com/s? '+ Urlname + "& pn = 10 & rn = 20 & ie = UTF-8 & usm = 4 & rsv_page = 1"
Response = br. open (urlhttp)
Filename='temp.html'
F = open (filename, 'w ')
F. write (response. read ())
F. close ()