Use the HTMLParser module to parse HTML pages
HTMLParser is a python module used to parse html and xhtml file formats. It can analyze tags and data in html, and is a simple way to process html. HTMLParser adopts an event-driven mode. When HTMLParser finds a specific tag, it calls a user-defined function to notify the program to process it. Its main callback functions are all member functions starting with handler _ and all of them are HTMLParser. When we use it, we will derive a new class from HTMLParser, and then redefine these functions starting with handler. Unlike the parser in htmllib, this parser is not a SGML parser Based on the sgmllib module.
Htmllib and sgmllib modules are not encouraged from Python2.6, and are removed from Versions later than 3.0 ~~~
HTMLParser
Class HTMLParser. HTMLParser
The HTMLParser class is instantiated without arguments.
The HTMLParser class does not require parameter instantiation.
An HTMLParser instance is fed HTML data and callhandler functions when tags begin and end. The HTMLParser class is meant to be overridden by the user to provide a desired behavior.
An HTMLParser instance imports HTML data and calls the handler function when the start and end tags parameters are passed in. The HTMLParser class provides the required behavior by the user override method.
Unlike the parser in htmllib, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.
Different from the parser in htmllib, this parser does not check the matching between the end tag and the start tag, or handler indicates that the end tag is called for an element that is closed by closing external elements.
An exception is defined as well:
Exceptions are also defined:
Exception HTMLParser. HTMLParseError
Exception raised by the HTMLParser class when it encounters an error while parsing. this exception provides three attributes: msg is a brief message explaining the error, lineno is the number of the line on which the broken construct was detected, and offset is the number of characters into the line at which the construct starts.
This class generates an exception when parsing an error. This exception provides three attributes: msg is used to interpret the error message, lineno is the number of lines that are detected to interrupt the constructor, and offset is the number of characters that the row generates.
HTMLParser instances have the following methods:
The following methods are available for HTMLParser instances:
HTMLParser. reset ()
Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time.
Reset the instance. Lost all unprocessed data. This is implicitly called when an object is instantiated.
HTMLParser. feed (data)
Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or close () is called.
Provide some text to the parser. Processing is performed within the range composed of complete elements. Incomplete data is buffered until more data is provided or close () is called.
HTMLParser. close ()
Force processing of all buffered data as if it were followed by an end-of-file mark. this method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version shoshould always call theHTMLParser base class method close ().
All buffered data is processed as the ending data. This method can be redefined through the definition of a derived class for additional processing at the end of the input, but the redefined version should always call the HTMLParser base class method close ()
HTMLParser. getpos ()
Return current line number and offset.
Returns the current number of rows and the displacement value.
HTMLParser. get_starttag_text ()
Return the text of the most recently opened start tag. this shoshould not normally be needed for structured processing, but may be useful in dealing with HTML "as deployed" or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc .).
Returns the most recent open tag of the text.
HTMLParser. handle_starttag (tag, attrs)
This method is called to handle the start of a tag. It is intended to be overridden by a derived class; the base class implementation does nothing.
The tag argument is the name of the tag converted to lower case. the attrs argument is a list of (name, value) pairs containing the attributes found inside the tag's <> brackets. the name will be translated to lower case, and quotes in the value have been removed, and character and entity references have been replaced. for instance, for the tag <a href = "http://www.cwi.nl/">, this method wocould be called as handle_starttag ('A', [('href ', 'HTTP: // www. cwi. nl/')]).
This method is used to process the start tag. The purpose is to be overwritten by a derived class; the base class is not implemented at all. The tag parameter is a tag name that is converted to lowercase letters. The attrs parameter is a (name, value) pair that contains attributes in the tag <>. Name will be converted to lower-case letters, and the quotation marks in the value will be extracted, and the string and entity references will be replaced.
For example, for label <a href = "http://www.cwi.nl/">, this method will call as handle_starttag ('A', [('href ', 'HTTP: // www. cwi. nl/')])
Changed in version 2.6: All entity references from htmlentitydefs are now replaced in the attribute values.
Version changes in version 2.6: All entity references from htmlentitydefs are now replaced by attribute values.
HTMLParser. handle_startendtag (tag, attrs)
Similar to handle_starttag (), but called when the parser encounters an XHTML-style empty tag (<... /> ). this method may be overridden by subclasses which require this participates lexical information; the default implementation simple callhandle_starttag () and handle_endtag ().
Similar to handle_starttag (), but it is used to process empty tags in the XHTML style (<a.../> ).
HTMLParser. handle_endtag (tag)
This method is called to handle the end tag of an element. it is intended to be overridden by a derived class; the base class implementation does nothing. the tag argument is the name of the tag converted to lower case.
This method is used to process the end tag of an element.
HTMLParser. handle_data (data)
This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing.
This method is used to process arbitrary data.
HTMLParser. handle_charref (name)
This method is called to process a character reference of the form & # ref;. It is intended to be overridden by a derived class; the base class implementation does nothing.
This method is used to handle character references in the form of & # ref.
HTMLParser. handle_entityref (name)
This method is called to process a general entity reference of the form & name; where name is an general entity reference. it is intended to be overridden by a derived class; the base class implementation does nothing.
This method is used to handle general object references in the form of & name; the parameter name is a general object reference.
HTMLParser. handle_comment (data)
This method is called when a comment is encountered. The comment argument is a string containing the text between the -- and -- delimiters, but not the delimiters themselves. For example, the comment <! -- Text --> will cause this method to be called with the argument 'text'. It is intended to be overridden by a derived class; the base class implementation does nothing.
This method is used to process comments.
HTMLParser. handle_decl (decl)
Method called when an SGML doctype declaration is read by the parser. The decl parameter will be the entire contents of the declaration inside the <!...> Markup. It is intended to be overridden by a derived class; the base class implementation does nothing.
When the parser reads a SGML doctype, it declares that this method is called.
HTMLParser. unknown_decl (data)
Method called when an unrecognized SGML declaration is read by the parser. The data parameter will be the entire contents of the declaration inside the <!...> Markup. It is sometimes useful to be overridden by a derived class; the base class implementation throws an HTMLParseError.
This method is called when the parser reads an unrecognized SGML declaration.
HTMLParser. handle_pi (data)
Method called when a processing instruction is encountered. The data parameter will contain in the entire processing instruction. For example, for the processing instruction <? Proc color = 'red'>, this method wocould be called as handle_pi ("proccolor = 'red '"). it is intended to be overridden by a derived class; the base class implementation does nothing.
This method is called when a processing command is encountered.
Note The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '? Will cause '? 'To be encoded in data.
Practical Application
Taking crawling URLs of Web Crawlers as an example, we need to capture all the links on the Netease homepage. First, we need to introduce some knowledge about HTML. The following content is from w3cshool.
What is an HTML hyperlink?
A hyperlink can be a word, a word, a group of words, or an image. You can click the hyperlink to jump to a new document or a part of the current document.
When you move the mouse pointer to a link on a webpage, the arrow becomes a small hand.
You can use the <a> tag to create a link in HTML.
There are two ways to use the <a> label:
Use the href attribute to create a link to another document.
Use the name attribute to create bookmarks in a document
HTML link syntax
The HTML code of the link is very simple. It is similar to this:
<A href = "url"> Link text </a>
The href attribute specifies the target of the link.
The text between the start tag and the end tag is displayed as a hyperlink.
Instance
<A href = "http://www.w3school.com.cn/"> Visit W3School </a>
Write code
We learned from the above that the link is in the starting tag <a>, and the href attribute points to the link we need to resolve. Rewrite handle_startag () to achieve this goal.
#-*-Coding: UTF-8 -*-
"""
Created on Tue Aug 30 09:46:45 2011
@ Author: Nupta
"""
Import urllib2
Import HTMLParser
Class MyParser (HTMLParser. HTMLParser ):
Def handle_starttag (self, tag, attrs ):
If tag = 'A ':
For name, value in attrs:
If name = 'href 'and value. startswith ('http '):
Print value
If _ name _ = '_ main __':
Url = raw_input (U' input address: '. encode ('cp936 '))
F = urllib2.urlopen (url). read ()
My = MyParser ()
Try:
My. feed (content)
Failed t HTMLParser. HTMLParseError, e:
Print e
Problem Analysis
There are many output links. Most links are omitted first. Check the last line:
Http://www.hd315.gov.cn/beian/view.asp? Bianhao = 0102000102300012
Http://www.itrust.org.cn/yz/pjwx.asp? Wm = 2012043533
Http://www.bj.cyberpolice.cn/index.htm
Malformed start tag, at line 3339, column 44
In the html file read, a defective start tag is read in column 3,339th of row 44th, and an HTMLParseError error occurs. According to the provided information, there are 43 elements in the 3,338th rows in the html file. Because the read () method is used previously, here we need to use readlines () to read the html file into a list.
Print f [3338] [34: 67]
Check the results to see why:
<A href = \ 'HTTP: // mail.163.com/alipay /\'
These two escape characters cause parsing exceptions in the parser. If you do not know how to judge the correctness of the html code, click the W3C portal and enter the code to obtain the analysis result:
Line 1, Column 9: an attribute value must be a literal unless it contains only name characters
<A href = \ 'HTTP: // mail.163.com/alipay /\'
You have used a character that is not considered a "name character" in an attribute value. which characters are considered "name characters" varies between the different document types, but a good rule of thumb is that unless the value contains only lower or upper case letters in the range a-z you must put quotation marks around the value. in fact, unless you have extremefile size requirements it is a very good idea to always put quote marks around your attribute values. it is never wrong to do so, and very often it is absolutely necessary.
From the check results on the login page of mailbox 126 below, we can see that there is still a long way to go for HTML to comply with XHTML specifications.
Author: lmh12506