Python crawler section explained

Source: Internet
Author: User
Tags closing tag opening and closing tags

Example:self.file = www.baidu.com has Baidu station index.html

1     def parseandgetlinks (self):  #  parse HTML, save links2         self.parser =  Htmlparser (Abstractformatter (Dumbwriter (Stringio ()))3        self.parser.feed (open (self.file ). Read ())4        self.parser.close ()5         return Self.parser.anchorlist

@ Search the results of the partners first *****************************

Htmlparser (Abstractformatter (Dumbwriter (Stringio))) "' uses the Htmlparser method for processing, Stringio is to read data from memory, Dumbwriter converting an event stream to a saved text document "'
It's mainly abstractformatter the relevant information.

Formatter module
The formatter module provides some format classes (formatter classes) that can be used for htmllib.
There are two types of these classes, formatter and writer. Formatter transforms the HTML parser's label and data stream into an event stream that is appropriate for the output device, and writer outputs the event stream to the device.
In most cases, you can use the Abstractformatter class for formatting. It invokes the method of the writer object based on the different formatting events. The Abstractwriter class prints a piece of information each time the method is called.

Htmlparser is a module that Python uses to parse HTML. It can analyze the HTML tags, data, and so on, is a simple way to deal with HTML. Htmlparser uses an event-driven pattern that, when Htmlparser finds a specific tag, invokes a user-defined function to notify the program to process it. Its primary user callback function is named after Handler_, which is the member function of Htmlparser. When we use it, we derive the new class from Htmlparser and redefine the functions that begin with Handler_. These functions include the following:
Handle_startendtag handling start and end tags
Handle_starttag processing start tags, such as <xx>
Handle_endtag processing end tags, such as </xx>
HANDLE_CHARREF handles special strings, that is, & #开头的, which is usually the character represented by the inner code.
Handle_entityref handles some special characters, starting with &, such as &nbsp;
Handle_data processing data, that is, the data in the middle of <xx>data</xx>
Handle_comment processing annotations
Handle_decl <! The beginning of the process, such as <! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
Handle_pi to handle things like <?instruction>
Here I take the URL from the Web page as an example to introduce. To get to the URL, be sure to parse the <a> tag and then take the value of its href attribute. Here's the code:

#-*-encoding:gb2312-*-
Import Htmlparser

Class Myparser (Htmlparser.htmlparser):
def __init__ (self):
Htmlparser.htmlparser.__init__ (self)

def handle_starttag (self, Tag, attrs):
# The function that handles the start tag is redefined here
if tag = = ' A ':
# Determine the properties of tags <a>
For Name,value in Attrs:
If name = = ' href ':
Print value

if __name__ = = ' __main__ ':
A = '
my = Myparser ()
# incoming data to be parsed is HTML.
My.feed (a)

Class Htmlparser.htmlparser: Populates the HTML data with an instance of Htmlparser, and calls the function between the opening and closing tags. The Htmlparser class means overloading. Unlike the Htmllib parser, this parser does not detect and start the tag corresponding to the closing tag

The Htmlparser instance is the following method:
Htmlparser.reset () #重置实例. All the unhandled data is lost. Called automatically at initialization time.
Htmlparser.feed (data) #给分析器喂食. Work in the case of a complete element; If incomplete data is in the case, Buffering will be done to know that more data is added or close () is called.
Htmlparser.close () #处理所有缓冲数据. This method can be redefined by the derived class to handle additional things after the input is finished, and the redefined version also calls the Close () method of the Htmlparser base class.
Htmlparser.getpos () #返回当前行数和列数
Htmlparser.get_starttag_text () #返回最近打开过得开始标记处的文本. Usually not used
Htmlparser.handle_starttag (tag, attrs) #该方法用来处理一个标记的开始. The tag parameter is the lowercase name of the tag. The attrs parameter is a list, consisting of (Name,value), which reflects the < > The properties inside. The name is translated into lowercase letters, the quotation marks in value are also removed, and the character entity references are also replaced. For example, there is a tag<a href= "http://www.cwi.nl/" >, then use this method to do so: Handle_ Starttag (' A ', [(' href ', ' http://www.cwi.nl/')])
htmlparser.handle_startendtag (tag, attrs) #和handle_starttag () similar to the empty tags used to process xhtml styles ( <a .../>). May quilt class overload
htmlparser.handle_endtag (TAG) # This method is used to process the element end tag. Can be overloaded by derived classes; The base class does nothing. The tag parameter is the lowercase letter of the tag's name conversion.
htmlparser.handle_data (data) #该方法用来处理随机的数据.
htmlparser.handle_charref (name) #处理 & character reference in #ref format.
htmlparser.handle_entityref (name)   #处理一般的 The entity reference in the &name format. Name is a generic entity reference .
htmlparser.handle_comment (data) #处理遇到注释的情况. The comment parameter is a string literal between-and--instead of the delimiter itself. For example, <!–text–>, the method will call ' text '.
htmlparser.handle_decl (Decl) #当分析器遇到SGML声明时调用此方法. The Decl parameter is the entire contents of the <!... > tag.
htmlparser.handle_pi (data) #处理命令, the data parameter contains the entire processing command. For example <?proc color= ' Red ' >, This method should be written as Handle_pi ("Proc color= ' Red '").

Usage examples:

#!/usr/bin/python
#-*-Encoding:utf-8-*-
Import HtmlparserClass Myparser (Htmlparser.htmlparser):
def __init__ (self):
Htmlparser.htmlparser.__init__ (self)
def handle_starttag (self, Tag, attrs):
# The function that handles the start tag is redefined here
if tag = = ' A ':
# Determine the properties of the label <a>
             for Name,value in attrs:
                 if name = = ' href ':
                     print value
       
If __name __ = = ' __main__ ':
    a = ' http:// Www.linuxqq.net      my = Myparser ()
    # incoming data to parse, is HTML.
    My.feed (a)

Operation Result:

Www.163.comwww.linuxqq.nethttp://www.163.com
http://www.linuxqq.net Handle_startendtag handling start and end tags
Handle_starttag processing start tags, such as <xx>
Handle_endtag processing end tags, such as </xx>
HANDLE_CHARREF handles special strings, that is, & #开头的, which is usually the character represented by the inner code.
Handle_entityref handles some special characters, starting with &, such as &nbsp;
Handle_data processing data, that is, the data in the middle of <xx>data</xx>
Handle_comment processing annotations
Handle_decl <! The beginning of the process, such as <! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
Handle_pi to handle things like <?instruction>

This article is from small Q brother, please indicate the source and corresponding links when reproduced.

*************************************************************************************************

Python crawler section explained

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.