The lxml and htmlparser of Python web crawler

Source: Internet
Author: User
Tags xpath python web crawler

Python lxml Package for parsing HTML and XML files, personally feel more flexible than BeautifulSoup  
The path expression in lxml is as follows:

In the table below, we have listed some path expressions and the results of the expressions:

You can also select multiple paths in the path representation, using ' | ' Operators, such as the following:
Book/title | Book/price selects all the title and price elements of the book element  . 
Here's a look at the usage of lxml: or using the website we used before, the code is as follows:
Etree
defParse_url_xml ():
Try:
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Html=etree. HTML (Fd.read ())
Result=html.xpath ('//*[@id = ' content_1 ']/span[7]/a ')
Print type(Result)
for R in result:
Print R.text

except baseexception, E:
Print e
Use etree First, then use etree. Initialization of HTML (). Then use XPath to find it. Where XPath is //*[@id = "Content_1"]/span[7]/a is the XPath address of the page element   

You can see from the expression that you first find  any label with the id attribute content_1. //* means that regardless of location, just follow the properties to meet. Then look down  to the 7 span tab and find the label for a below. Then the result is a list. Represents all the elements found. Print out the content by traversing the list. The results of the operation are as follows:      
E:\python2.7.11\python.exe e:/py_prj/test.py
<type ' list ' >
Section 7
As can be seen from the above, in fact, XPath is still very good to write, relative BeautifulSoup to the positioning of elements more accurate. In fact, if the trouble, do not want to write XPath, there is a more simple way. Press F12 in the browser , find the element you want to locate in the Web page source code, then right click on Copy XPath to get the XPath path

Here are a few more examples: for example, to get to the last span element, you can use the following example
Result=html.xpath ('//*[@id = "content_1"]/span[last ()]/a ') 
The results are as follows:
E:\python2.7.11\python.exe e:/py_prj/test.py
Section 657
We can also streamline the//*[that we just used  @id = "Content_1"]/span[7]/a
Reduced to://*[@href = "7.shtml"] means directly looking for an element with a property of 7.shtml
If you want to return more than one element, you can use the following method to represent the estoppel section 7 and section 8 
Result=html.xpath ('//*[@href = "7.shtml"] |//*[@href = "8.shtml"] 
If you want to get the attribute value of the node you are looking for: you can use the Get method
Result=html.xpath ('//*[@href = "7.shtml"] |//*[@href = "8.shtml"]
Type(Result)
Result
R.get (' href ')
the value of the node href attribute is displayed as a result
E:\python2.7.11\python.exe e:/py_prj/test.py
<type ' list ' >
7.shtml
8.shtml

Here's how to use the following htmlparser:

Htmlparser is a python-brought web parsing tool that is simple to use. Easy parsing of HTML files 
Let's look at the relevant code:
classNewparser (Htmlparser):
def__init__(Self):
Htmlparser.__init__(Self)
Self. flag=False
Self. text=[]
defHandle_starttag (Self, Tag,attrs):
ifTag = =' span ':
Self. flag=True
defHandle_data (Self, data):
ifSelf. Flag = =True:
PrintData
Self. Text.append (data)
defHandle_endtag (Self, tag):
ifTag = =' span ':
Self. flag=False




if__name__=="__main__":
Parser=newparser ()
Try :
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Parser.feed (Fd.read ())
Print Parser.text

except baseexception, E:
Print e
First define a class that inherits from Htmlparser. define some of your own parameters in the __init__ function. 
Parser.feed (Fd.read ()) where the feed function is a function of the class itself. The parameter is the HTML code of the Web page . Where the feed is equivalent to a driver function. Let's look at the prototype of the feed function. The following is the implementation of the feed. You can see that 2 features are implemented. Which is to assign the incoming page code to rawdata. Then run goahead to start processing       
  def   feed (self       r "" "Feed data to the parser. 

    call this as often as want, with as little or as much text
    as you want (m ay include ' \ n ').
    ""
   
self.rawdata = self.rawdata + data
    self.goahead (0
Goahead function code too much, here is not all posted out, the specific function is to traverse rawdata per row of data. Different functions are then called according to the different identities. The key functions are as follows. You can see the start of ' < ' when you encounter it. Call Parse_startag when encountering the ' </' call parse_endtag    

Parse_startag inside realizes Handle_starttag,Parse_endtag inside realizes Handle_endtag. 
Handle_starttag and Handle_endtag in the code are empty functions. Only the current tag and attrs are passed in . This gives us the chance to rewrite this function.   
Handle_starttag (self, tag, attrs):
Pass
Handle_endtag (self, tag):
Pass
Handle_data(self, data):
Pass
Where Hanle_data is the processing of specific data in Web page code
Having said so much, it should be clear to the realization of Htmlparser. The code for each line of the page is processed. In turn, determine whether to enter Handle_starttag,handle_endtag,handle_data. Htmlparser for us to parse out the tag,attrs and data for each line . We extract the information we need by rewriting these functions. So back to our previous code, this code we want to implement the function is to extract the <span></span> fields. 
First __init__ defines 2 parameters,flag and text,flag initial value is False     
__init__(self):
Htmlparser. __init__(self)
Self. flag=False
Self. text=[]
Handle_starttag implemented as long as Tag=span, then set flag to true  
Handle_starttag (self, tag,attrs):
' span ' :
self. Flag=true
Handle_data is implemented as long as flag=true extracts the data and saves it in the text list   .
Handle_data (self, data):
SelfTrue:
Data
Self. Text.append (data)
So when does the data-extracting action end: It depends on the handle_endtag. Similarly , when encountering Tag=span, the Flag=false is set . This will not extract data until the next tag=span is encountered. 
Handle_endtag (self, tag):
' span ' :
Self. Flag=false
This is the full functionality of Htmlparser. is it more concise and clear than the previous BeautifulSoup and lxml? For non-canonical web pages,Htmlparser is better than BeautifulSoup and lxml. All function functions are listed below:     
Handle_startendtag handling start and end tags
Handle_starttag processing start tags, such as<xx>
Handle_endtag processing end tags, such as</xx>
HANDLE_CHARREF handles special strings, that is, & #开头的, which is usually the character represented by the inner code.
Handle_entityref handles some special characters, starting with &, such as &nbsp;
Handle_data processing data, that is, the data in the middle of <xx>data</xx>
Handle_comment processing annotations
Handle_decl <! The beginning of the process, such as <! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
Handle_pi to handle things like <?instruction>
The other implementations are similar. The usage of scrapy will be introduced from the next chapter

The lxml and htmlparser of Python web crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.