Use Urlparse, urllib in Python to crawl and parse Web pages (i)

Source: Internet
Author: User

For applications such as search engines, file indexing, document conversions, data retrieval, site backup, or migration, the parsing of Web pages (that is, HTML files) is often used. In fact, the various modules available in the Python language allow us to parse and manipulate HTML documents without using a Web server or Web browser. This article explains in detail how to use Python to crawl and parse Web pages. First, we introduce a Python module that can help simplify opening HTML documents on-premises and on the web, and then we discuss how to use Python modules to quickly parse the data in an HTML file to handle specific content such as links, images, and cookies. Finally, we'll give you an example of a structured HTML file format tag, and you'll find that using Python to process the contents of an HTML file is a very simple thing.

First, resolve the URL
 import   Urlparse parsedtuple  = Urlparse.urlparse (  http:// Www.google.com/search?   hl  =en&q=urlparse&btng=google+search " "   Unparsedurl  = Urlparse.urlunparse ((Urlscheme, Urllocation, URLPath,  ,  ", "   = Urlparse.urljoin (Unparsedurl,   / module-urllib2/request-objects.html  ) 

The function of Urlparse (urlstring [, Default_scheme [, Allow_fragments]]) is to break the URL into different parts, it gets the URL from the urlstring, and returns the tuple (scheme, Netloc, path, parameters, query, fragment). Note that the returned tuple is useful, for example, to determine the network protocol (HTTP, FTP, and so on), the server address, the file path, and so on.

The function of a urlunparse (tuple) is to assemble a URL component into a URL that, after it receives a tuple (scheme, Netloc, path, parameters, query, fragment), will re-form a URL with the correct format. For use by other Python HTML parsing modules.

The function urljoin (base, url [, allow_fragments]) is the concatenation URL, which takes the first parameter as its base address, and then combines the relative address in the second parameter to form an absolute URL address. The function urljoin is particularly useful when handling several files at the same location by appending a new file name to the URL base address. It should be noted that if the base address is not a character/end, then the rightmost part of the URL base site will be replaced by this relative path. For example, if the URL's base address is Http://www.testpage.com/pub,URL relative to test.html, then the two will merge into http://www.testpage.com/test.html instead of HTTP/ Www.testpage.com/pub/test.html. If you want to keep the end directory in this path, make sure that the URL base address is a character/end.

Here are some examples of the more detailed uses of the above functions:

ImportUrlparse Urlscheme="http"urllocation="www.python.org"URLPath="lib/module-urlparse.html"modlist= ("Urllib","Urllib2","Httplib","Cgilib")#resolving addresses to Components    Print "parsing results for URLs in the address bar when searching Python with Google"Parsedtuple=Urlparse.urlparse ("Http://www.google.com/search?HL=en&q=python&btng=google+search")    PrintParsedtuple#to parse a component back into a URL    Print "\ Anti-parsing of URLs for Python document pages"Unparsedurl=Urlparse.urlunparse (Urlscheme, Urllocation, URLPath,"',"',"'))Print "\ t"+Unparsedurl#make a new URL for the path and the new file    Print "\ nthe URL to add more Python document pages using stitching"     forMoDinchModlist:newurl=Urlparse.urljoin (Unparsedurl,"module-%s.html"%(MoD))Print "\ t"+Newurl#compose a new URL by adding a sub-path to the path    Print "\ nthe URL of the Python document page is generated by stitching the subpath"Newurl=Urlparse.urljoin (Unparsedurl,"module-urllib2/request-objects.html")Print "\ t"+ Newurl

The execution results of the above code are as follows:

Parsing results for URLs in the address bar when searching Python with Google

('http','www.google.com','/search',"','Hl=en&q=python&btng=google+search',"'URL http to parse the Python document page:www.python.org/lib/module-urlparse.html add more URL http for Python document pages using stitching:www.python.org/lib/module-urllib.html http:www.python.org/lib/module-urllib2.html http:www.python.org/lib/module-httplib.html http:www.python.org/lib/module-cgilib.html generates the URL http for the Python document page by stitching the subpath:Www.python.org/lib/module-urllib2/request-objects.html
Second, open the HTML document

The above describes how to parse the URL of the page, and now begins to explain how to open a Web page with a URL. In fact, the two modules that Python brings with Urllib and URLLIB2 provide us with the ability to open and retrieve data from a URL, which, of course, includes HTML documents.

  Import== = U.read ()  print  u.info ()  print"  %d bytes of data was read from%s. \ n" % (U.geturl (), Len (buffer))

To open an HTML document by using the Urlopen (URL [, data]) function in the Urllib module, you must provide the URL address of the document, including the file name. The function Urlopen can not only open a file located on a remote Web server, but can open a local file and return a file-like object from which we can read data from an HTML document.

Once the HTML document is open, we can use the read ([Nbytes], ReadLine (), and ReadLines () functions as a regular file for reading the file. To read the contents of an entire HTML document, you can use the Read () function, which returns the contents of the file as a string.

After you open an address, you can use the Geturl () function to get the true URL of the page being fetched. This is useful because urlopen (or the opener object used) may be accompanied by a redirect. The URL of the Web page you get may not be the same as the URL requested.

Another commonly used function is the info () function in the class file object returned from Urlopen, which returns metadata about the location of the URL, such as content length, content type, and so on. These functions are described in a more detailed example below.

   Importurllib WebURL="http://www.python.org"Localurl="index.html"    #open a remote page with a URLu=Urllib.urlopen (webURL) buffer=U.read ()PrintU.info ()Print "%d bytes of data was read from%s. \ n"%(U.geturl (), Len (buffer))#open local page via URLu=Urllib.urlopen (localurl) buffer=U.read ()PrintU.info ()Print "%d bytes of data was read from%s. \ n"% (U.geturl (), Len (buffer))

The result of running the above code is as follows:

 Date:fri, June 10:22:11 GMT server:apache /2.2.9 (Debian) DAV/2 svn/1.5 .1 mod_ssl/2.2.9 openssl/0.9.8g mod_wsgi/2.3 python/2.5.2 last -modified:thu, 25 June 09:44:54 GMT ETag:   105800d-46e7-46d29136f7180    Accept  --type:text/html from http: //www.python.org read 18151 bytes of data. Content -type:text/html content -length:865   last -modified:fri, June 10:16:10 GMT 865 bytes of data were read from index.html.  
Third, summary

For applications such as search engines, file indexing, document conversions, data retrieval, site backup, or migration, the parsing of Web pages (that is, HTML files) is often used. In fact, the various modules available in the Python language allow us to parse and manipulate HTML documents without using a Web server or Web browser. In this article, we describe a Python module that can help simplify opening HTML documents that reside on-premises and on the web. In the next section, we will discuss how to use Python modules to quickly parse the data in an HTML file to handle specific content, such as links, images, and cookies.

(Transferred from: http://www.cnblogs.com/huangcong/archive/2011/08/31/2160633.html)

Use Urlparse, urllib in Python to crawl and parse pages (i) (GO)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.