3.0 Web Client
Most online activities access the Web through a web browser. Therefore, it is meaningful to create a client to access the Web through the HTTP protocol. This chapter describes how to use the twisted. Web. Client module to operate Internet resources, including downloading pages, using HTTP authentication, uploading files, and using HTTP fields.
3.1 download webpage
The simplest and most common task is to download webpages through a Web Client. The client connects to the server, sends an http get request, and receives an HTTP Response containing a webpage.
3.1.1 what should we do below?
You can use the built-in twisted protocol to quickly get into work. The twisted. Web package includes the complete HTTP implementation, eliminating the need to develop protocol and clientfactory on your own. In the future, it will also include tools and functions for creating HTTP requests. Obtain a webpage and use twisted. Web. Client. getpage. Example 3-1 is a webcat. py script used to obtain a webpage through a URL.
From twisted. Web import Client
From twisted. Internet import Reactor
Import sys
Def printpage (data ):
Print data
Reactor. Stop ()
Def printerror (failure ):
Print> SYS. stderr, "error:", failure. geterrormessage ()
Reactor. Stop ()
If Len (SYS. argv) = 2:
Url = SYS. argv [1]
Client. getpage (URL). addcallback (printpage). adderrback (printerror)
Reactor. Run ()
Else:
Print "Usage: webcat. py <URL>"
A URL for the webcat. py script will display the webpage code:
$ Python webcat. py http://www.oreilly.com/
<! Doctype HTML public "-//....
3.1.2 how do they work?
The printpage and printerror functions are simple event (by gashero) processors used to print the downloaded webpage and error information. The most important line is client. getpage (URL ). This function returns the deferred object for notifying downloading completion events in non-synchronous state.
Note how to add the deferred callback function to a row. This is possible because both addcallback and adderrback return references to the deferred object. Therefore, the statement:
D = deferredfunction ()
D. addcallback (resulthandler)
D. addcallback (errorhandler)
Equivalent:
Deferredfunction (). addcallback (resulthandler). adderrback (errorhandler)
Both methods can be used. The following method is more common in twisted code.
3.1.3 about
Is there a need to write web pages to the disk? This advanced feature may cause problems when a large file is downloaded using a 3-1 script. A better solution is to write data to a temporary file during download, and read all the data from the temporary file after download.
Twisted. Web. Client contains the downloadpage function, which is similar to getpage but writes files to files. Call downloadpage and input the URL as the first parameter. The file name or object is the second parameter. Example 3-2:
From twisted. Web import Client
Import tempfile
Def downloadtotempfile (URL ):
"""
Pass a URL and return the callback when a deferred object is used for download.
"""
Tmpfd, tempfilename = tempfile. mkstemp ()
OS. Close (tmpfd)
Return client. downloadpage (URL, tempfilename). addcallback (returnfilename, tempfilename)
Def returnfilename (result, filename ):
Return filename
If _ name __= = '_ main __':
Import sys, OS
From twisted. Internet import Reactor
Def printfile (filename ):
For line in file (filename, 'r + B '):
SYS. stdout. Write (line)
OS. Unlink (filename) # delete an object
Reactor. Stop ()
Def printerror (failure ):
Print> SYS. stderr, "error:", failure. geterrormessage ()
Reactor. Stop ()
If Len (SYS. argv) = 2:
Url = SYS. argv [1]
Downloadtotempfile (URL). addcallback (printfile). adderrback (printerror)
Reactor. Run ()
Else:
Print "Usage: % S <URL>" % SYS. argv [0]
The downloadtotempfile function returns the deferred object when calling twisted. Web. Client. downloadpage. Downloadtotempfile also adds the returnfilename as the deferred callback and uses the temporary file name as an additional parameter. This means that when downloadtotempfile is returned, the reactor will call returnfilename as the first parameter of downloadtotempfile, and the (by gashero) file name as the second parameter.
Example 3-2 registers two other callback functions to downloadtotempfile. Remember that the deferred returned by downloadtotempfile already contains the returnfilename as the callback processor. Therefore, when the result is returned, returnfilename will be called first. The result of this function is used to call printfile.
3.2 access password-protected pages
Some web pages require authentication. If you are developing an HTTP client application, you 'd better prepare to process the application and provide the user name and password as needed.
3.2.1 what should I do below?
If an HTTP request fails to be executed with a 401 status code, authentication is required. In this case, the user's logon username and password are passed to the authorization field, as shown in Example 3-3:
From twisted. Web import client, error as weberror
From twisted. Internet import Reactor
Import sys, getpass, base64
Def printpage (data ):
Print data
Reactor. Stop ()
Def checkhttperror (failure, URL ):
Failure. Trap (weberror. Error)
If failure. value. Status = '000000 ':
Print> SYS. stderr, failure. geterrormessage ()
# Username and password required by gashero
Username = raw_input ("User name :")
Password = getpass. getpass ("Password :")
Basicauth = base64.encodestring ("% s: % s" % (username, password ))
Authheader = "Basic" + basicauth. Strip ()
# Try to obtain the page again and add verification information
Return client. getpage (URL, headers = {"Authorization": authheader })
Else:
Return failure
Def printerror (failure ):
Print> SYS. stderr, 'error: ', failure. geterrormessage ()
Reactor. Stop ()
If Len (SYS. argv) = 2:
Url = SYS. argv [1]
Client. getpage (URL). adderrback (
Checkhttperror, URL). addcallback (
Printpage). adderrback (
Printerror)
Reactor. Run ()
Else:
Print "Usage: % S <URL>" % SYS. argv [0]
When you run webcat3.py and pass the URL as an additional parameter, you will try to download the page. If you receive the 401 error, you will be asked about the user name and password and then automatically try again:
$ Python webcat3.py http://example.com/protected/page
401 authorization required
User name: User
Password: <enter the password>
<HTML>
......
3.2.2 how do they work?
This example uses an extended error processor. It first adds deferred to client. getpage before adding printpage and printerror. This gives checkhttperror the opportunity to handle client. getpage errors before other processors.
As an errback error processor, checkhttperror will be called by the twisted. Python. Failure. Failure object. The failure object encapsulates running exceptions, records traceback when exceptions occur, and adds several useful methods. Checkhttperror started to use the failure. Trap method to confirm that the exception is of the twisted. Web. Error. Error type. If not, trap will throw an exception again, exit the current function, and allow the error to be passed to the next errback processor, printerror.
Then, checkhttperror checks the HTTP response status code. Failure. value is an exception object and is an object of twisted. Web. Error. Error. Known status attributes include the HTTP response status code. If the status code is not 401, the original error is returned by passing the error to printerror.
If the status code is 401, checkhttperror is executed. It prompts you to enter the user name and password and encode it into the HTTP Authorization header field. Call client. getpage to return the deferred object. This will lead to some cool events. The reactor waits for the second call result and then calls printpage or printerror to process the result. In fact, the checkhttperror will prompt "handle errors with another deferred, wait for the result of deferred, and process the event in another row ". This technology is very powerful and can be used multiple times in the twisted program.
The final result is as follows: Use printpage or printerror to output the result. Of course, if the initialization request is successful, authentication is not required, checkhttperror will not be called, and the returned results will also directly call printpage.
3.3 upload files
From the user's point of view, there is no simple way to upload files to webpages. You can use HTML form to select a file and press the submit button to upload the file. This is because many sites provide difficult upload methods. Sometimes when you need to upload files and cannot use a browser. You may need to develop programs to upload photos, Web-based file management systems, and so on. The following example shows how to use twisted to develop an HTTP client to upload files.
3.3.1 what should I do below?
First, encode the key-Value Pair and compile the uploaded file into the multipart/form-data MIME file. Python and twisted cannot provide a simple method for implementation, but you can also implement it yourself without too much effort. Then, the encoded form data is passed to the formdata key and sent to client. getpage or client. downloadpage, And the http post method is used. Then you can execute getpage or downloadpage to obtain the HTTP response. Example 3-4 (validate. py) shows how to upload a file according to W3C standards, save the response to a local file, and display it in the user's browser.
From twisted. Web import Client
Import OS, tempfile, webbrowser, random
Def encodeform (inputs ):
"""
Pass a dictionary parameter inputs and return a multipart/form-data string, which contains UTF-8 encoded
Data. The key name must be a string and the value can be a string or a file object.
"""
Getrandomchar = lambda: CHR (random. Choice (range (97,123 )))
Randomchars = [getrandomchar () for X in range (20)]
Boundary = "--- % s ---" % ''. Join (randomchars)
Lines = [boundary]
For key, Val in inputs. Items ():
Header = 'content-Disposition: Form-data; name = "% s" '% key
If hasattr (Val, 'name '):
Header + = '; filename = "% s"' % OS. Path. Split (Val. Name) [1]
Lines. append (header)
If hasattr (Val, 'read '):
Lines. append (Val. Read ())
Else:
Lines. append (Val. encode ('utf-8 '))
Lines. append ('')
Lines. append (Boundary)
Return "/R/N". Join (lines)
Def showpage (pagedata ):
# Write data to a temporary file and display it in the browser
Tmpfd,tmp1_tempfile.mkstemp('.html ')
OS. Close (tmpfd)
File (TMP, 'W + B '). Write (pagedata)
Webbrowser. Open ('file: // '+ TMP)
Reactor. Stop ()
Def handleerror (failure ):
Print "error:", failure. geterrormessage ()
Reactor. Stop ()
If _ name __= = '_ main __':
Import sys
From twisted. Internet import Reactor
Filename = SYS. argv [1]
Filetocheck = file (filename)
Form = encodeform ({'uploaded _ file': filetocheck })
Postrequest = client. getpage (
'Http: // validator.w3.org/check ',
Method = "Post ",
Headers = {'content-type': 'multipart/form-data; charset = UTF-8 ',
'Content-length': STR (LEN (form ))},
Postdata = form)
Postrequest. addcallback (showpage). adderrback (handleerror)
Reactor. Run ()
Run the validate. py script and provide the HTML file name in the first parameter:
$ Python validate. py test.html
Once an error occurs, you must obtain the validation error report in the browser.
3.3.2 how do they work?
Follow W3C standard pages, such as the http://validator.w3.org containing the form below:
<Form method = "Post" enctype = "multipart/form-Data" Action = "check">
<Label Title = "choose a local file to upload and validate"
For = "uploaded_file"> local file:
<Input type = "file" name = "uploaded_file" size = "30"/> </label>
<Label Title = "Submit file for validation">
<Input type = "Submit" value = "check"/> </label>
</Form>
In this way, you can create an HTTP client to send the same data as the browser. The encodeform function input a dictionary that contains a file object imported as a key-Value Pair (by gashero) and returns a string of the mime-encoded document that has been followed by multipart/form-data. When validate. py is running, open a specific file as the first parameter and pass encodeform as the value of 'uploaded _ file. This will return the valid data to be submitted.
Validate. py then uses client. getpage to submit the form data and pass the header fields including Content-Length and Content-Type. The showpage callback processor obtains the returned data and writes it to a temporary file. Then, use the python webbrowser module to call the default browser to open the file.
3.4 detect webpage updates
When an RSS feed is opened using a popular HTTP application, the latest RSS (or atom) blog update is automatically downloaded. RSS collection regularly downloads the latest updates, typically within an hour. This mechanism can reduce the waste of bandwidth. Because the content is rarely updated, the client often repeatedly downloads the same data.
To reduce the waste of network resources, we recommend that you use conditional http get requests for RSS collection (or other request page methods. By adding conditions in the header field of the HTTP request, the client can notify the server to only return updates after the time it has been viewed. Of course, one condition is whether it has been modified after the last download.
3.4.1 what should we do below?
Keep the tracking header field when you download the webpage for the first time. Search for the etag header field, which defines the revised version of the webpage, or the last-modified field, and provides the modification time of the webpage. The next time you request a webpage, add the etag value to the IF-None-match header, or add the last-modified value to if-modified-since. If the server supports conditional GET requests, if the webpage has not been modified, the system returns the 304 unchanged response.
The getpage and downloadpage functions are convenient, but they are unavailable for control condition requests. You can use a slightly lower-level httpclientfactory interface. Example 3-5 demonstrates the httpclientfactory test whether the webpage is updated.
From twisted. Web import Client
Class httpstatuschecker (client. httpclientfactory ):
Def _ init _ (self, URL, headers = none ):
Client. httpclientfactory. _ init _ (self, URL, headers = headers)
Self. Status = none
Self. Deferred. addcallback (
Lambda data: (data, self. Status, self. response_headers ))
Def nopage (self, reason): # When a non-200 response is returned
If self. Status = '000000' # The page has not changed
Client. httpclientfactory. Page (self ,'')
Else:
Client. httpclientfactory. nopage (self, reason)
Def checkstatus (URL, contextfactory = none, * ARGs, ** kwargs ):
Scheme, host, port, Path = client. _ parse (URL)
Factory = httpstatuschecker (URL, * ARGs, ** kwargs)
If scheme = 'https ':
From twisted. Internet Import SSL
If contextfactory is none:
Contextfactory = SSL. clientcontextfactory ()
Reactor. connectssl (host, port, factory, contextfactory)
Else:
Reactor. connecttcp (host, port, factory)
Return factory. Deferred
Def handlefirstresult (result, URL ):
Data, status, headers = Result
Nextrequestheaders = {}
Etag = headers. Get ('etag ')
If etag:
Nextrequestheaders ['if-None-Match'] = etag [0]
Modified = headers. Get ('Last-modified ')
If modified:
Nextrequestheaders ['if-modified-since '] = modified [0]
Return checkstatus (URL, headers = nextrequestheaders). addcallback (
Handlesecondresult)
Def handlesecondresult (result ):
Data, status, headers = Result
Print 'second request returned status % s: '% status,
If status = '000000 ':
Print 'page changed (or server does not support conditional request ).'
Elif status = '000000 ':
Print 'page is unchanged .'
Else:
Print 'unexcepted response .'
Reactor. Stop ()
Def handleerror (failure ):
Print 'error', failure. geterrormessage ()
Reactor. Stop ()
If _ name __= = '_ main __':
Import sys
From twisted. Internet import Reactor
Url = SYS. argv [1]
Checkstatus (URL). addcallback (
Handlefirstresult, URL). adderrback (
Handleerror)
Reactor. Run ()
Run the updatecheck. py script and input the URL as the first parameter. It first tries to download the web page and use the conditional get. If error 304 is returned, the webpage is not updated on the server. Common conditions: Get applies to static files, such as RSS seeds, but cannot dynamically generate pages, such as the home page.
$ Python updatecheck. py http://slashdot.org/slashdot.rss
Second request returnestatus 304: page is unchanged
$ Python updatecheck. py http://slashdot.org/
Second request returned status 200: Page changed
(Or server does not support conditional requests ).
3.4.2 what do they do?
The httpstatuschecker class is a subclass of client. httpclientfactory. It is responsible for two important tasks. Use Lambda to add the callback function to self. Deferred during initialization. This anonymous function intercepts self. Deferred before the result is passed to another processor. This will replace the result (downloaded data) with the tuples containing more information: data, HTTP status code, self. response_headers, which is a dictionary of response fields.
Httpstatuschecker also reloads the nopage method, which is the successful response code called by httpclientfactory. If the response code is 304 (the status code is not changed), The nopage method calls httpclientfactory. Page to replace the original nopage method, indicating a successful response. If the execution succeeds, the nopage method of httpstatuschecker also returns the nopage method in the overloaded httpclientfactory. At this time, it provides a 304 response to replace the error.
The checkstatus function requires a URL parameter converted by twisted. Web. Client. _ parse. It looks like a part of a URL. You can obtain the host address and use the HTTP/HTTPS protocol (TCP/SSL ). Next, checkstatus creates an httpstatuschecker factory object and opens the connection. All these codes are very simple for twisted. Web. Client. getpage (by gashero). We recommend that you use the modified httpstatuschecker factory instead of httpclientfactory.
When updatecheck. when Py is running, checkstatus is called, and handlefirstresult is set as the callback event processor. in sequence, the second request uses the IF-None-match or if-modified-since condition header field, set handlesecondresult as the callback function processor. The handlesecondresult function reports whether the Server Returns Error 304 and then stops the reactor.
Handlefirstresult actually returns the deferred result of handlesecondresult. This allows printerror, which is the error event processor assigned to checkstatus, to all other errors that call checkstatus In the second request.
3.5 monitor download progress
The preceding example cannot monitor the download progress. Of course, deferred can return results when the download is complete, but sometimes you need to monitor the download progress.
3.5.1 what should we do below?
Again, twisted. Web. client does not provide enough practical functions to control the progress. Define a subclass of client. httpdownloader. This factory class is often used to download webpages to files. You can reload a method to keep track of the download progress. The webdownload. py script is a 3-6 example, showing how to use it:
From twisted. Web import Client
Class httpprogressdownloader (client. httpdownloader ):
Def gotheaders (self, headers ):
If self. Status = '20140901 ':
If headers. has_key ('content-length '):
Self. totallength = int (headers ['content-length'] [0])
Else:
Self. totallength = 0
Self. currentlength = 0.0
Print''
Return client. httpdownloader. gotheaders (self, headers)
Def pagepart (self, data ):
If self. Status = '20140901 ':
Self. currentlength + = Len (data)
If self. totallength:
Percent = "% I %" % (
(Self. currentlength/self. totallength) * 100)
Else:
Percent = "% DK" % (self. currentlength/1000)
Print "/033 [1 fprogress:" + percent
Return client. httpdownloader. pagepart (self, data)
Def downloadwithprogress (URL, file, contextfactory = none, * ARGs, ** kwargs ):
Scheme, host, port, Path = client. _ parse (URL)
Factory = httpprogressdownloader (URL, file, * ARGs, ** kwargs)
If scheme = 'https ':
From twisted. Internet Import SSL
If contextfactory is none:
Contextfactory = SSL. clientcontextfactory ()
Reactor. connectssl (host, port, factory, contextfactory)
Else:
Reactor. connecttcp (host, port, factory)
Return factory. Deferred
If _ name __= = '_ main __':
Import sys
From twisted. Internet import Reactor
Def downloadcomplete (result ):
Print "Download complete ."
Reactor. Stop ()
Def downloaderror (failure ):
Print "error:", failure. geterrormessage ()
Reactor. Stop ()
URL, outputfile = SYS. argv [1:]
Downloadwithprogress (URL, outputfile). addcallback (
Downloadcomplete). adderrback (
Downloaderror)
Reactor. Run ()
Run the webdownload. py script and add the URL and storage file name parameters. As a command, the download progress is updated.
$ Python webdownload. py http://www.oreilly.com/oreilly.html
Progress: 100% <-update during download
Download complete.
If the Web server does not return the Content-Length header field, the download progress cannot be calculated. In this case, webdownload. py prints the number of downloaded kb.
$ Python webdownload. py http://www.slashdot.org/slashdot.html
Progress: 60 k <-update during download
Download complete.
3.5.2 how do they work?
Httpprogressdownloader is a subclass of client. httpdownloader. The gotheaders method is overloaded and the Content-Length header field is used to calculate the total download volume. It also provides the pagepart method that is used to reload downloads when a piece of data is received each time.
Httpprogressdownloader prints the progress report every time data arrives. String/033 [1f is a terminal control sequence that ensures that the current row is replaced each time. This effect looks like it can be updated in the original location.
The downloadwithprogress function contains the same code in Example 3-5 to convert the URL, create an httpprogressdownloader factory object, and initialize the connection. Downloadcomplete and downloaderror are simple callback functions used to print messages and stop reactors.