Python Chapter 2-python and the World Wide Web

Source: Internet
Author: User
Tags tidy

1. screen capture: You can use urllib to obtain the HTML source code of the webpage, and then use a regular expression to extract information. The following is an example:

This method has at least three disadvantages: If the HTML code is complicated, the expressions will be messy and cannot be maintained. The program cannot process HTML features such as CDATA and character entities (such as & amp. Regular Expressions are constrained by the HTML source code, rather than relying on a more abstract structure. This means that a small change in the webpage structure will cause program interruption. There will be two solutions to solve the problems caused by this program. The first solution is to use the program named Tidy and XHTML parsing; the second solution is to use the Beautiful Soup library, it is specially designed for screen capture.
2. tidy: Tidy is a tool used to fix nonstandard and arbitrary HTML. It ensures that the file format is correct (that is, all elements are correctly nested ), this makes parsing easier. It is relatively simple to obtain and install the Tidy library. Www.2cto.com
Now, assume that you have a messy HTML file called messy.html. The following program runs Tidy on the file and prints the result:

3. Use HTMLParser: After the XHTML code in good format is obtained above, we can use the standard library module HTMLParser for parsing. We only need to inherit HTMLParser and overwrite the event handling methods such as handle_starttage or handle_data. Summarizes some related methods and when the parser automatically calls them.

The following code uses the HTMLParser module to obtain the webpage:

4. Beautiful Soup: Download The BeautifulSoup. py file and place it in the python path (for example, the site-packages directory in the python installation folder ). The following example shows a program that uses it for screen capture:

5. Use CGI to create a dynamic webpage:
Step 1: the CGI program should be placed in a directory accessible through the network and must be identified as a CGI script. There are two methods: put the script in a subdirectory called cgi-bin; change the script file extension. cgi.
Step 2: add the pound bang row: After the script is placed in the correct position, add the pound bang row at the beginning of the script. That is, as long as #! Add/usr/bin/env pthon to the beginning of the script. In windows, #! is required #! C: \ python22 \ python.exe.
Step 3: Set the file license: chmod 755 somescript. cgi. In this way, the script can be opened as a webpage and executed. Generally, the CGI script is not allowed to modify any files on the computer. If you want it to modify files, you must explicitly set the corresponding license for it. There are two options. If you have the root permission, you can create a user account for your script to change the ownership of the file to be modified. If you do not select all root users, you can set File Permission for the file so that all users on the system can write files. Chmod 666 editable_file.txt.
6. Simple CGI Script: Example:

In this example, the Content-type line indicates that the page is a common file. If the page is HTML, this line should be like this: print 'content-type: text/html '.
7. Use cgitb for debugging: only #! Add the import cgitb: cgitb. enable () line after/usr/bin/env pthon. When the cgi script has a program error, the detailed error information will be displayed on the webpage.
8. Using cgi Module: We usually need scripts to receive input in any form. The input is a key-value pair or name field provided to the CGI script through an HTML form. You can use the FieldStorage class of the cgi Module to obtain these fields from the CGI script. When a FieldStorage instance is created (only one should be created), it obtains input variables (or fields) from the request and provides them to the program through the class dictionary excuse. FieldStorage values can be accessed through common key lookup methods. A simple way to get a value is to use the getvalue () method, which is similar to the get method of the dictionary, but it will return the value of the value feature of the project. For example, form = cgi. fieldStorage (); name = form. getvalue ('name', 'unknow') Here I provide a default value. If it is not provided, None is used as the default value. The default value is used when the field has no value. The following is a complete simple example:

The input of the CGI script is generally obtained from the submitted web form, but you can also directly call the CGI program using parameters, such as http://www.someserver.com/simple.cgi? Name = a & age = 1. You can use the urlencode method of the urllib module to create such URL queries: urllib. urlencode ("name": "a", "age": "1 ");
9. Create a form as follows:

Get the CGI parameter name at the beginning of the script and use the default 'World '. If you open the browser without submitting anything, the program uses the default value.
10. mod_python: installation. On unix, download the source code of mod_python, decompress it, and enter the directory. Run the configure script of mod_python:./configure -- with-apxs =/usr/local/apache/bin/apxs. If apxs is not in this position, modify the path of the apxs program. Then compile all the files: make. Then, make install is installed.
Download and double-click it. Configure apache: Find the apache configuration file used for a specific module, which is usually called httpd. conf or apache. conf: Add LoadModule python_module lobexec/mod_python.so in unix, and add LoadModule python_module modules/mod_python.so in windows. Now, apache guides you where to find mod_python, but you still cannot use it: Tell it when to find it. You must add several lines of code to the apache configuration file, which can be in the main configuration file (maybe commonapache2.conf) or put in the file named. in the htaccess file, the directory contains scripts for web access. The following assumes that the. htaccess file is used. If yes, you can package the command as follows:

To use a CGI handler, place the following code in the. htaccess file in the directory where the CGI script is located:

Add PythonDebug On to debug information. These commands should be taken out after the development is complete.
To support PSP pages, add the following code:

The following is an example PSP with a small amount of random data:

Author: uohzoaix

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.