Python Network Programming Learning Notes (vi): Web Client Access _python

Source: Internet
Author: User
Tags html form urlencode in python

6.1 Simplest reptiles

Web crawler is a program that automatically extracts Web pages, which downloads Web pages from the World Wide Web and is an important component of search engines. Python urllib\urllib2 and other modules are easy to achieve this function, the following example to achieve the Baidu home page download. The specific code is as follows:

Copy Code code as follows:

Import Urllib2
Page=urllib2.urlopen ("http://www.baidu.com")
Print Page.read ()

6.2 Submitting the form data

(1) Submit data with Get method
The Get method for submitting a form is to encode the form data to a URL. After you give the requested page, add a question mark, followed by the elements of the form. such as in Baidu Search "Mai yili" Get URL for http://www.baidu.com/s?wd=%e9%a9%ac%e4%bc%8a%e7%90%8d&pn=100&rn=20&ie=utf-8& Usm=4&rsv_page=1. which The form element is followed by. wd=%e9%a9%ac%e4%bc%8a%e7%90%8d says the search term is "Mai Yili", and PN says it starts on the page where the 100th message is located (it feels like this, I tried it several times, when I wrote 100, it was displayed from the page, but if I wrote 10, it was on page 1th), rn= 20 for each page shows 20, ie=utf-8 expression encoding format, usm=4 did not understand what meaning, changed 1, 2, 3 tried, did not find any changes, Rsv_page=1 said the first few pages. If you want to download the above page more simple method is directly using the URL above to extract. such as code:

Copy Code code as follows:

Import Urllib2
Keyword=urllib.quote (' Mai Yili ')
Page=urllib2.urlopen ("http://www.baidu.com/s?wd=" +keyword+) &pn=100&rn=20&ie=utf-8&usm=4& Rsv_page=1 ")
Print Page.read ()

(2) Submit by post method

Get method, the data is added to the URL, this method of data volume requirements, if you need to exchange a lot of data time, post method is a good way. Here a period of time before the blog "Python simulation 163 landing Get mailing List" as an example, the specific code is not listed, see Address: http://www.cnblogs.com/xiaowuyi/archive/2012/05/21/2511428.html.

Introduction to 6.3 Urllib,urllib2,httplib,mechanize
6.3.1urllib Module (quote: http://my.oschina.net/duhaizhang/blog/68893)
The Urllib module provides an interface that allows us to read data on WWW and FTP as we would access local files. The two most important functions in the module are: Urlopen () and Urlretrieve ().

Urllib.urlopen (url[, data[, proxies]):
This function creates a class file object that represents a remote URL and then operates the class file object like a local file to obtain remote data. The parameter URL represents the path of the remote data, typically the URL; parameter data represents the data that is submitted to the URL post, and the parameter proxies is used to set up the proxy. Urlopen returns a class file object that returns a class file object that provides the following methods:

Read (), ReadLine (), ReadLines (), Fileno (), Close (): These methods are used in exactly the same way as the file objects;
info (): Returns a httplib. Httpmessage object that represents the header information returned by the remote server;
GetCode (): Returns the HTTP status code. In the case of an HTTP request, 200 indicates that the request completed successfully; 404 indicates that the URL was not found;
Geturl (): Returns the URL of the request;

Copy Code code as follows:

#! /usr/bin/env python
#coding =utf-8
Import Urllib
Content=urllib.urlopen ("http://www.baidu.com")
Print "HTTP header:", Content.info ()
Print "HTTP status:", Content.getcode ()
Print "URL:", Content.geturl ()
Print "Content:"
For line in Content.readlines ():
Print Line

Urllib.urlretrieve (url[, filename[, reporthook[, Data]]):
The Urlretrieve method downloads remote data directly to the local. parameter filename Specifies a path saved to the local (if not specified, Urllib generates a temporary file to hold the data); The parameter reporthook is a callback function that connects to the server and the corresponding data The callback is triggered when the block is transmitted (that is, a callback function is called once per download piece). We can use this callback function to display the current download progress, or to speed limits, as the following example shows. Parameter data refers to the post to the server. The method returns a tuple containing two elements (filename, headers), filename is saved to a local path, and the header represents the server's response header.

Copy Code code as follows:

#! /usr/bin/env python
# Coding:utf-8
"" Downloads the file and displays the download Progress ""
Import Urllib

def downcall (count,size,total_filesize):
"" Count is the number of downloaded blocks, size is a block of data, Total_filesize is the total size of the file "" "" "
Per=100.0*count*size/total_filesize
If per>100:
per=100
Print "Already download%d KB (%.2f"% (count*size/1024,per) + "%")

Url= "Http://www.research.rutgers.edu/~rohanf/LP.pdf"
Localfilepath=r "C:\Users\Administrator\Desktop\download.pdf"
Urllib.urlretrieve (Url,localfilepath,downcall)

Urllib also provides a number of helper methods for encoding and decoding URLs. There are some special symbols that cannot appear in URLs, and some have special uses. We know that when we submit the data in get way, we add a string such as Key=value to the URL, so we don't allow ' = ' in value, so we encode it, and when the server receives these parameters, it decodes it and restores it to the original data. At this point, these helper methods can be useful:

Urllib.quote (string[, safe]): Encodes a string. Parameter safe specifies characters that do not need to be encoded;
Urllib.unquote (String): decoding a string;
Urllib.quote_plus (string[, safe]): Similar to Urllib.quote, but this method replaces ' with ' + ', and quote uses '%20 ' instead of '
Urllib.unquote_plus (String): decoding a string;
Urllib.urlencode (query[, Doseq]): Converts a dict or a list of tuples containing two elements to a URL parameter. For example, the dictionary {' name ': ' Dark-bull ', ' Age ': 200} will be converted to ' name=dark-bull&age=200 '
Urllib.pathname2url (PATH): Converts a local path to a URL path;
Urllib.url2pathname (PATH): Converts a URL path to a cost path;

6.3.2 URLLIB2 Module (citation: http://hankjin.blog.163.com/blog/static/3373193720105140583594/)
There are three main ways to access a Web page using python: Urllib, Urllib2, Httplib
Urllib is simpler, the function is relatively weak, the httplib is simple and powerful, but it doesn't seem to support session
(1) The simplest page access
Res=urllib2.urlopen (URL)
Print Res.read ()
(2) plus data to get or post
data={"name": "Hank", "passwd": "HJZ"}
Urllib2.urlopen (URL, urllib.urlencode (data))
(3) Plus HTTP headers
header={"user-agent": "mozilla-firefox5.0"}
Urllib2.urlopen (URL, urllib.urlencode (data), header)

Use opener and handler
opener = Urllib2.build_opener (handler)
Urllib2.install_opener (opener)
(4) plus session
CJ = Cookielib. Cookiejar ()
Cjhandler=urllib2. Httpcookieprocessor (CJ)
opener = Urllib2.build_opener (Cjhandler)
Urllib2.install_opener (opener)
(5) Plus Basic authentication
Password_mgr = Urllib2. Httppasswordmgrwithdefaultrealm ()
Top_level_url = "http://www.163.com/"
Password_mgr.add_password (None, top _level_url, username, password)
handler = Urllib2. Httpbasicauthhandler (password_mgr)
opener = Urllib2.build_opener (handler)
Urllib2.install_opener (opener)
(6) Use proxy
Proxy_support = urllib2. Proxyhandler ({"http": "http://1.2.3.4:3128/"})
Opener = Urllib2.build_opener (proxy_support)
Urllib2.install _opener (opener)
(7) Set timeout
Socket.setdefaulttimeout (5)

6.3.3 Httplib Module (citation: http://hi.baidu.com/avengert/item/be5daec8517b12ddee183b81)
Httplib is a client-side implementation of the HTTP protocol in Python that can be used to interact with HTTP servers. Httplib is not a lot of content, but also relatively simple. The following is a very simple example of using Httplib to get the HTML of the homepage of Google:

Copy Code code as follows:

#coding =GBK
Import Httplib
conn = Httplib. Httpconnection ("www.google.cn")
Conn.request (' Get ', '/')
Print Conn.getresponse (). Read ()
Conn.close ()

The following is a detailed description of the common types and methods provided by Httplib.
Httplib. Httpconnection (host [, Port [, strict [, timeout]]]
The constructor of the Httpconnection class that represents an interaction with the server, that is, the request/response. The parameter host represents the server host, such as: Www.csdn.net;port is the port number, the default value is 80, and the default value of the parameter strict is false, which indicates that the state line (status lines) that the server returned cannot be resolved (the more typical status row, such as: http/ 1.0 OK), whether the Badstatusline exception is thrown, optional parameter timeout indicates the timeout time.
Httpconnection provides the following methods:
Httpconnection.request (method, url [, body [, headers]])
Calling the request method sends a call to the server to represent the method of the request, often with a method that has get and post; The URL represents the URL of the requested resource; The body indicates that the data submitted to the server must be a string (if method is "post", You can interpret the body as data in an HTML form; Headers represents the HTTP header of the request.
Httpconnection.getresponse ()
Gets the HTTP response. The returned object is an instance of HttpResponse, as explained in HttpResponse below.
Httpconnection.connect ()
Connect to the HTTP server.
Httpconnection.close ()
Close the connection to the server.
Httpconnection.set_debuglevel (Level)
Sets the level of height. The default value of the parameter level is 0, which means that no debugging information is exported.
Httplib. HttpResponse
HttpResponse represents the server's response to client requests. Often created by calling Httpconnection.getresponse (), it has the following methods and properties:
Httpresponse.read ([Amt])
Gets the message body of the response. If the request is for an ordinary Web page, the method returns the HTML of the page. Optional Parameter Amt represents the reading of the specified byte of data from the response stream.
Httpresponse.getheader (name[, default])
Gets the response header. Name represents the Header field (header field) name, and the optional parameter default is returned as the default value if the header domain name does not exist.
Httpresponse.getheaders ()
Returns all header information in the form of a list.
Httpresponse.msg
Gets all the response header information.
Httpresponse.version
Gets the version of the HTTP protocol used by the server. 11 indicates that HTTP/1.1;10 represents http/1.0.
Httpresponse.status
Gets the status code of the response. Such as: 200 indicates a successful request.
Httpresponse.reason
Returns a description of the results of the server processing request. Generally "OK"
The following is an example to familiarize yourself with the methods in HttpResponse:

Copy Code code as follows:

#coding =GBK
Import Httplib
conn = Httplib. Httpconnection ("www.g.cn", False)
Conn.request (' Get ', '/', headers = {"Host": "www.google.cn",
"User-agent": "mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.1) gecko/20090624 firefox/3.5 ",
"Accept": "Text/plain"})
res = Conn.getresponse ()
print ' version: ', res.version
print ' Reason: ', Res.reason
print ' Status: ', Res.status
print ' msg: ', res.msg
print ' headers: ', Res.getheaders ()
#html
#print ' \ n ' + '-' \ n '
#print Res.read ()
Conn.close ()

Many constants are also defined in the Httplib module, such as:
Httplib. The value of Http_port is 80, which indicates that the default port number is 80;
The value of Httplib.ok is 200, which indicates a successful return of the request;
Httplib. A value of Not_found of 40 indicates that the requested resource does not exist;
You can query the meaning of related variables by httplib.responses, such as:
Print Httplib.responses[httplib. Not_found]
6.3.4 mechanize
Mechanize did not find a more complete introduction, he wrote a simple example below.

Copy Code code as follows:

#-*-coding:cp936-*-
Import time,string
Import Mechanize,urllib
From mechanize import Browser

Urlname=urllib.quote (' Mai Yili ')
Br=browser ()
Br.set_handle_robots (False) # #ignore the robots.txt
Urlhttp=r ' http://www.baidu.com/s? ' +urlname+ "&pn=10&rn=20&ie=utf-8&usm=4&rsv_page=1"
Response=br.open (Urlhttp)
Filename= ' temp.html '
F=open (filename, ' W ')
F.write (Response.read ())
F.close ()

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.