A large number of excellent libraries in the Python language can be called directly to efficiently fulfill different requirements. Here we list some common libraries and some Python modules, call the extension function developed by C ++ in a Python program.
This article describes in detail how to capture and parse web pages using Python. First, we will introduce a Python module that can help simplify opening HTML documents locally and on the Web. Later, we will discuss how to use the Python module to quickly parse data in HTML files to process specific content, such as links, images and cookies.
Finally, we will provide an example of regular HTML file format tags. Through this example, you will find that using python to process HTML file content is very simple. Through the urlparse module in Python, we can easily break down URLs into components, and then re-assemble these components into a URL. This function is very convenient when processing HTML documents.
- import urlparse
-
- parsedTuple = urlparse.urlparse(
-
- "http://www.google.com/search?
-
- hl=en&q=urlparse&btnG=Google+Search")
-
- unparsedURL = urlparse.urlunparse((URLscheme, \
-
- URLlocation, URLpath, '', '', ''))
-
- newURL = urlparse.urljoin(unparsedURL,
-
- "/module-urllib2/request-objects.html")
The urlparse (urlstring [, default_scheme [, allow_fragments]) function is to break the URL into different components. It retrieves the URL from urlstring and returns the tuples (scheme, netloc, path, parameters, query, fragment ). Note that the returned tuples are very useful, such as network protocols (HTTP, FTP, etc.), server addresses, file paths, and so on.
The urlunparse (tuple) function is to assemble the URL component into a URL. After receiving the tuples (scheme, netloc, path, parameters, query, and fragment, a URL with the correct format is re-formed for other HTML parsing modules in Python.
The urljoin (base, url [, allow_fragments]) function is to concatenate a URL. It uses the first parameter as its base address, then combine with the relative address in the second parameter to form an absolute URL address. The urljoin function is particularly useful when you append a new file name to the URL base address to process several files at the same location.
Note that if the base address does not end with a character or character, the rightmost part of the base address will be replaced by the relative path. For example, the base address of a URL is Http: // response.
The following is a detailed usage example of the above functions:
- Date: Fri, 26 Jun 2009 10:22:11 GMT
-
- Server: Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_ssl/2.2.9 OpenSSL/0.9.8g mod_wsgi/2.3 Python/2.5.2
-
- Last-Modified: Thu, 25 Jun 2009 09:44:54 GMT
-
- ETag: "105800d-46e7-46d29136f7180"
-
- Accept-Ranges: bytes
-
- Content-Length: 18151
-
- Connection: close
-
- Content-Type: text/html
-
- Read 18151 bytes of data from the http://www.python.org.
-
- Content-Type: text/html
-
- Content-Length: 865
-
- Last-modified: Fri, 26 Jun 2009 10:16:10 GMT
-
- 865 bytes of data is read from index.html.
To open an HTML document using the urlopen (url [, data]) function in the urllib module, you must provide the URL address of the document, including the file name. The urlopen function can not only open a file on a remote web server, but also open a local file and return an object similar to a file. We can read data from the HTML document through this object.
Once the HTML document is opened, we can use the read ([nbytes]), readline (), and readlines () functions to read the file like a regular file. To read the content of the entire HTML document, you can use the read () function, which returns the file content as a string.
After opening an address, you can use the Python module to obtain the real URL of the obtained webpage. This is useful because urlopen (or the opener object used) may be accompanied by a redirection. The obtained webpage URL may be different from the requested webpage URL.
Another common function is the info () function in the class file object returned from urlopen. This function can return metadata related to the URL location, such as the content length and content type. The following is a detailed example to describe these functions.
- Introduction to Python system files
- How to correctly use Python Functions
- Detailed introduction and analysis of Python build tools
- Advantages of Python in PythonAndroid
- How to Use the Python module to parse the configuration file?