Web Apps also follow the client server architecture
The browser is a basic Web client, she implements two basic functions, one is to download the file from the Web server, and the other is to render the file
Modules such as Urllib and URLLIB2 (which can open web pages that need to be logged on), with similar functionality to browsers for simple Web clients
There are also loads of Web clients that not only download Web files, but also perform other complex tasks, a typical example of which is the crawler
Python implementation crawlers also have some framework modules: such as Scrapy
Create a simple Web client using PythonYou have to figure out that the browser is just one of the Web clients, and the functionality is limited, any web-based applications are Web clients such as curl and Python urllib why is urllib rather than httplib? Read down
What is a URL??? Composition is very importantURLs are used to locate a document on the web, or to invoke a CGI program to generate a document for your client. CGI-generated documents are like some web frameworks, especially Python's Web client is actually
File Transfer, the most direct way is to use the URL directly to locate and obtain files, in fact, most of the clients are relying on this so should first
learn about the URL compositionHttp://zh.wikipedia.org/zh/%E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6
Python URL module description: Urllib and UrlparsePython supports two different modules, each of which handles URLs in different functions and compatibility. One is Urlparse, one is urllib. Where urlparse is used for URL parsing and compositing. Use it you can also learn the composition of the URL Oh, about its usage you can help. Urllib is a high-level module, Urllib module provides all the features you need unless you plan to write a lower-tier network client. Urllib provides an advanced Web Exchange library that supports Web protocols, HTTP, FTP, and Gopher protocols, while also supporting access to local files. The special function of the Urllib module is to download data from the above protocol (from Internet, LAN, host). Use this module to avoid using Httplib, Ftplib and gopherlib these modules, unless you want to use the lower function urllib The main function is to download the file from the URL, want to understand the function of this module can start from the following several functions Urlopen () Urllib.urlretrieve () Urllib.quote () and Urllib.quote_plus () Urllib.unquote () and Urllib.unquote_plus () Urllib.urlencode ()
Urllib2If you plan to access more complex URLs or want to handle more complex situations such as digital-based authorization, relocation, Coockie, etc., we recommend that you use the URLLIB2 module, which is especially useful for logging in to fetch data.
Advanced Web ClientBrowser implementation is actually a simple Web client, the basic Web client download files from the server, Urllib and URLLIB2 and the modules described above is to implement similar functions so advanced Web Client is not just download so simple advanced web An example of a client is a web crawler (aka Spider and Robot). These programs can explore and download pages on the Internet for different purposes, including:
- Index large search engines such as Google and Yahoo!
- Offline Browsing-Download the document locally, reset the hyperlink, and create a mirror for your local browser. (This requirement is usually said to download the entire online help document)
- Download and save a history or frame
- Caching of Web pages saves the time to download the Web site again.
Here's a crawler implementation.
1 #!/usr/bin/env python2 3 fromSysImportargv4 fromOsImportmakedirs, unlink, Sep5 fromOs.pathImportIsdir, exists, dirname, Splitext6 fromStringImportReplace, find, lower7 fromHtmllibImportHtmlparser8 fromUrllibImportUrlretrieve9 fromUrlparseImportUrlparse, UrljoinTen fromFormatterImportDumbwriter, Abstractformatter One fromCstringioImportStringio A - classRetriever (object):#Download Web pages - the def __init__(self, url): -Self.url =URL -Self.file =self.filename (URL) - + deffilename (self, url, deffile='index.htm'): -Parsedurl = Urlparse (URL,'http:', 0)#Parse Path +Path = parsedurl[1] + parsedurl[2] Aext =splitext (path) at ifEXT[1] = ="': - ifPATH[-1] = ='/': -Path + =Deffile - Else: -Path + ='/'+Deffile -Ldir = dirname (path)#Local Directory in ifSep! ='/':#os-indep. Path separator -Ldir = replace (Ldir,',', Sep) to if notIsdir (Ldir):#Create archive dir if nec. + ifexists (Ldir): Unlink (Ldir) - makedirs (Ldir) the returnPath * $ defDownload (self):#Download Web pagePanax Notoginseng Try: -retval =Urllib.urlretrieve (Self.url, Self.file) the exceptIOError: +retval = ('* * * error:invalid URL '%s ''% A Self.url,) the returnretval + - defParseandgetlinks (self):#Pars HTML, save links $Self.parser =Htmlparser (Abstractformatter ( $ Dumbwriter (Stringio () ))) - self.parser.feed (Open (self.file). Read ()) - self.parser.close () the returnself.parse.anchorlist - Wuyi classCrawler (object):#manage entire crawling process the -Count = 0#static downloaded page counter Wu - def __init__(self, url): AboutSELF.Q =[url] $Self.seen = [] -self.dom = urlparse (URL) [1] - - defgetpage (self, url): AR =Retriever (URL) +retval =r.download () the ifRetval[0] = ='*':#error situation, do not parse - Printretval'... skipping parse' $ return theCrawler.count = Crawler.count + 1 the Print '\ n (', Crawler.count,')' the Print 'URL:', the URL the Print 'FILE:', Retval[0] - self.seen.append (URL) in theLinks = r.parseandgetlinks ()#Get and process links the forEachlinkinchLinks: About ifEACHLINK[:4]! ='http' and theFind (Eachlink,'://') = =-1: theEachlink =urljoin (URL, eachlink) the Print '* ', Eachlink, + - ifFind (Lower (Eachlink),'mailto:')! =-1: the Print '... discarded, mailto link'Bayi Continue the the ifEachlink not inchSelf.seen: - ifFind (Eachlink, self.dom) = =-1: - Print '... discarded, not in domain' the Else: the ifEachlink not inchSELF.Q: the self.q.append (Eachlink) the Print '... new, added to Q' - Else: the Print '... discarded, already in Q' the Else: the Print '... discarded, already processed'94 the defGo (self):#process links in queue the whileSELF.Q: theURL =Self.q.pop ()98 self.getpage (URL) About - defMain ():101 ifLen (argv) > 1:102URL = argv[1]103 Else:104 Try: theurl = raw_input ('Enter starting URL:')106 except(Keyboardinterrupt, eoferror):107URL ="'108 109 if notUrl:return theRobot =Crawler (URL)111 Robot.go () the 113 if __name__=='__main__': theMain ()
View Code
In fact, there are also some reptile libraries, not much introduction
Python Web programming-web client Programming