Introduction to the Python Module

Last Update:2013-12-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A large number of excellent libraries in the Python language can be called directly to efficiently fulfill different requirements. Here we list some common libraries and some Python modules, call the extension function developed by C ++ in a Python program.

This article describes in detail how to capture and parse web pages using Python. First, we will introduce a Python module that can help simplify opening HTML documents locally and on the Web. Later, we will discuss how to use the Python module to quickly parse data in HTML files to process specific content, such as links, images and cookies.

Finally, we will provide an example of regular HTML file format tags. Through this example, you will find that using python to process HTML file content is very simple. Through the urlparse module in Python, we can easily break down URLs into components, and then re-assemble these components into a URL. This function is very convenient when processing HTML documents.

 
 
  
  import urlparse  
  
   
  
  parsedTuple = urlparse.urlparse(  
  
   
  
  "http://www.google.com/search?  
  
   
  
  hl=en&q=urlparse&btnG=Google+Search")  
  
   
  
  unparsedURL = urlparse.urlunparse((URLscheme, \  
  
   
  
  URLlocation, URLpath, '', '', ''))  
  
   
  
  newURL = urlparse.urljoin(unparsedURL,  
  
   
  
  "/module-urllib2/request-objects.html")

The urlparse (urlstring [, default_scheme [, allow_fragments]) function is to break the URL into different components. It retrieves the URL from urlstring and returns the tuples (scheme, netloc, path, parameters, query, fragment ). Note that the returned tuples are very useful, such as network protocols (HTTP, FTP, etc.), server addresses, file paths, and so on.

The urlunparse (tuple) function is to assemble the URL component into a URL. After receiving the tuples (scheme, netloc, path, parameters, query, and fragment, a URL with the correct format is re-formed for other HTML parsing modules in Python.

The urljoin (base, url [, allow_fragments]) function is to concatenate a URL. It uses the first parameter as its base address, then combine with the relative address in the second parameter to form an absolute URL address. The urljoin function is particularly useful when you append a new file name to the URL base address to process several files at the same location.

Note that if the base address does not end with a character or character, the rightmost part of the base address will be replaced by the relative path. For example, the base address of a URL is Http: // response.

The following is a detailed usage example of the above functions:

 
 
  
  Date: Fri, 26 Jun 2009 10:22:11 GMT
  
   
  
  Server: Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_ssl/2.2.9 OpenSSL/0.9.8g mod_wsgi/2.3 Python/2.5.2
  
   
  
  Last-Modified: Thu, 25 Jun 2009 09:44:54 GMT
  
   
  
  ETag: "105800d-46e7-46d29136f7180"
  
   
  
  Accept-Ranges: bytes
  
   
  
  Content-Length: 18151
  
   
  
  Connection: close
  
   
  
  Content-Type: text/html
  
   
  
  Read 18151 bytes of data from the http://www.python.org.
  
   
  
  Content-Type: text/html
  
   
  
  Content-Length: 865
  
   
  
  Last-modified: Fri, 26 Jun 2009 10:16:10 GMT
  
   
  
  865 bytes of data is read from index.html.

To open an HTML document using the urlopen (url [, data]) function in the urllib module, you must provide the URL address of the document, including the file name. The urlopen function can not only open a file on a remote web server, but also open a local file and return an object similar to a file. We can read data from the HTML document through this object.

Once the HTML document is opened, we can use the read ([nbytes]), readline (), and readlines () functions to read the file like a regular file. To read the content of the entire HTML document, you can use the read () function, which returns the file content as a string.

After opening an address, you can use the Python module to obtain the real URL of the obtained webpage. This is useful because urlopen (or the opener object used) may be accompanied by a redirection. The obtained webpage URL may be different from the requested webpage URL.

Another common function is the info () function in the class file object returned from urlopen. This function can return metadata related to the URL location, such as the content length and content type. The following is a detailed example to describe these functions.

Introduction to Python system files
How to correctly use Python Functions
Detailed introduction and analysis of Python build tools
Advantages of Python in PythonAndroid
How to Use the Python module to parse the configuration file?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to the Python Module

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support