Python lxml Package for parsing HTML and XML files, personally feel more flexible than BeautifulSoup
The path expression in lxml is as follows:
In the table below, we have listed some path expressions and the results of the expressions:
You can also select multiple paths in the path representation, using ' | ' Operators, such as the following:
Book/title | Book/price selects all the title and price elements of the book element .
Here's a look at the usage of lxml: or using the website we used before, the code is as follows:
Etree
defParse_url_xml ():
Try:
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Html=etree. HTML (Fd.read ())
Result=html.xpath ('//*[@id = ' content_1 ']/span[7]/a ')
Print type(Result)
for R in result:
Print R.text
except baseexception, E:
Print e
Use etree First, then use etree. Initialization of HTML (). Then use XPath to find it. Where XPath is //*[@id = "Content_1"]/span[7]/a is the XPath address of the page element
You can see from the expression that you first find any label with the id attribute content_1. //* means that regardless of location, just follow the properties to meet. Then look down to the 7 span tab and find the label for a below. Then the result is a list. Represents all the elements found. Print out the content by traversing the list. The results of the operation are as follows:
E:\python2.7.11\python.exe e:/py_prj/test.py
<type ' list ' >
Section 7
As can be seen from the above, in fact, XPath is still very good to write, relative BeautifulSoup to the positioning of elements more accurate. In fact, if the trouble, do not want to write XPath, there is a more simple way. Press F12 in the browser , find the element you want to locate in the Web page source code, then right click on Copy XPath to get the XPath path
Here are a few more examples: for example, to get to the last span element, you can use the following example
Result=html.xpath ('//*[@id = "content_1"]/span[last ()]/a ')
The results are as follows:
E:\python2.7.11\python.exe e:/py_prj/test.py
Section 657
We can also streamline the//*[that we just used @id = "Content_1"]/span[7]/a
Reduced to://*[@href = "7.shtml"] means directly looking for an element with a property of 7.shtml
If you want to return more than one element, you can use the following method to represent the estoppel section 7 and section 8
Result=html.xpath ('//*[@href = "7.shtml"] |//*[@href = "8.shtml"]
If you want to get the attribute value of the node you are looking for: you can use the Get method
Result=html.xpath ('//*[@href = "7.shtml"] |//*[@href = "8.shtml"]
Type(Result)
Result
R.get (' href ')
the value of the node href attribute is displayed as a result
E:\python2.7.11\python.exe e:/py_prj/test.py
<type ' list ' >
7.shtml
8.shtml
Here's how to use the following htmlparser:
Htmlparser is a python-brought web parsing tool that is simple to use. Easy parsing of HTML files
Let's look at the relevant code:
classNewparser (Htmlparser):
def__init__(Self):
Htmlparser.__init__(Self)
Self. flag=False
Self. text=[]
defHandle_starttag (Self, Tag,attrs):
ifTag = =' span ':
Self. flag=True
defHandle_data (Self, data):
ifSelf. Flag = =True:
PrintData
Self. Text.append (data)
defHandle_endtag (Self, tag):
ifTag = =' span ':
Self. flag=False
if__name__=="__main__":
Parser=newparser ()
Try :
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Parser.feed (Fd.read ())
Print Parser.text
except baseexception, E:
Print e
First define a class that inherits from Htmlparser. define some of your own parameters in the __init__ function.
Parser.feed (Fd.read ()) where the feed function is a function of the class itself. The parameter is the HTML code of the Web page . Where the feed is equivalent to a driver function. Let's look at the prototype of the feed function. The following is the implementation of the feed. You can see that 2 features are implemented. Which is to assign the incoming page code to rawdata. Then run goahead to start processing
def feed (self r "" "Feed data to the parser.
call this as often as want, with as little or as much text
as you want (m ay include ' \ n ').
""
self.rawdata = self.rawdata + data
self.goahead (0
Goahead function code too much, here is not all posted out, the specific function is to traverse rawdata per row of data. Different functions are then called according to the different identities. The key functions are as follows. You can see the start of ' < ' when you encounter it. Call Parse_startag when encountering the ' </' call parse_endtag
Parse_startag inside realizes Handle_starttag,Parse_endtag inside realizes Handle_endtag.
Handle_starttag and Handle_endtag in the code are empty functions. Only the current tag and attrs are passed in . This gives us the chance to rewrite this function.
Handle_starttag (self, tag, attrs):
Pass
Handle_endtag (self, tag):
Pass
Handle_data(self, data):
Pass
Where Hanle_data is the processing of specific data in Web page code
Having said so much, it should be clear to the realization of Htmlparser. The code for each line of the page is processed. In turn, determine whether to enter Handle_starttag,handle_endtag,handle_data. Htmlparser for us to parse out the tag,attrs and data for each line . We extract the information we need by rewriting these functions. So back to our previous code, this code we want to implement the function is to extract the <span></span> fields.
First __init__ defines 2 parameters,flag and text,flag initial value is False
__init__(self):
Htmlparser. __init__(self)
Self. flag=False
Self. text=[]
Handle_starttag implemented as long as Tag=span, then set flag to true
Handle_starttag (self, tag,attrs):
' span ' :
self. Flag=true
Handle_data is implemented as long as flag=true extracts the data and saves it in the text list .
Handle_data (self, data):
SelfTrue:
Data
Self. Text.append (data)
So when does the data-extracting action end: It depends on the handle_endtag. Similarly , when encountering Tag=span, the Flag=false is set . This will not extract data until the next tag=span is encountered.
Handle_endtag (self, tag):
' span ' :
Self. Flag=false
This is the full functionality of Htmlparser. is it more concise and clear than the previous BeautifulSoup and lxml? For non-canonical web pages,Htmlparser is better than BeautifulSoup and lxml. All function functions are listed below:
Handle_startendtag handling start and end tags
Handle_starttag processing start tags, such as<xx>
Handle_endtag processing end tags, such as</xx>
HANDLE_CHARREF handles special strings, that is, & #开头的, which is usually the character represented by the inner code.
Handle_entityref handles some special characters, starting with &, such as
Handle_data processing data, that is, the data in the middle of <xx>data</xx>
Handle_comment processing annotations
Handle_decl <! The beginning of the process, such as <! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
Handle_pi to handle things like <?instruction>
The other implementations are similar. The usage of scrapy will be introduced from the next chapter
The lxml and htmlparser of Python web crawler