International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

The lxml and htmlparser of Python web crawler

Last Update:2017-06-18 Source: Internet

Author: User

Tags xpath python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python lxml Package for parsing HTML and XML files, personally feel more flexible than BeautifulSoup

The path expression in lxml is as follows:

In the table below, we have listed some path expressions and the results of the expressions:

You can also select multiple paths in the path representation, using ' | ' Operators, such as the following:

Book/title | Book/price selects all the title and price elements of the book element  .

Here's a look at the usage of lxml: or using the website we used before, the code is as follows:

Etree

defParse_url_xml ():
Try:
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Html=etree. HTML (Fd.read ())
Result=html.xpath ('//*[@id = ' content_1 ']/span[7]/a ')
Print type(Result)
 for R  in result:
Print R.text

except baseexception, E:
Print e

Use etree First, then use etree. Initialization of HTML (). Then use XPath to find it. Where XPath is //*[@id = "Content_1"]/span[7]/a is the XPath address of the page element

You can see from the expression that you first find  any label with the id attribute content_1. //* means that regardless of location, just follow the properties to meet. Then look down  to the 7 span tab and find the label for a below. Then the result is a list. Represents all the elements found. Print out the content by traversing the list. The results of the operation are as follows:

E:\python2.7.11\python.exe e:/py_prj/test.py

<type ' list ' >

Section 7

As can be seen from the above, in fact, XPath is still very good to write, relative BeautifulSoup to the positioning of elements more accurate. In fact, if the trouble, do not want to write XPath, there is a more simple way. Press F12 in the browser , find the element you want to locate in the Web page source code, then right click on Copy XPath to get the XPath path

Here are a few more examples: for example, to get to the last span element, you can use the following example

Result=html.xpath ('//*[@id = "content_1"]/span[last ()]/a ')

The results are as follows:

E:\python2.7.11\python.exe e:/py_prj/test.py

Section 657

We can also streamline the//*[that we just used  @id = "Content_1"]/span[7]/a

Reduced to://*[@href = "7.shtml"] means directly looking for an element with a property of 7.shtml

If you want to return more than one element, you can use the following method to represent the estoppel section 7 and section 8

Result=html.xpath ('//*[@href = "7.shtml"] |//*[@href = "8.shtml"]

If you want to get the attribute value of the node you are looking for: you can use the Get method

Result=html.xpath ('//*[@href = "7.shtml"] |//*[@href = "8.shtml"]
Type(Result)
Result
R.get (' href ')

the value of the node href attribute is displayed as a result

E:\python2.7.11\python.exe e:/py_prj/test.py

<type ' list ' >

7.shtml

8.shtml

Here's how to use the following htmlparser:

Htmlparser is a python-brought web parsing tool that is simple to use. Easy parsing of HTML files

Let's look at the relevant code:

classNewparser (Htmlparser):
def__init__(Self):
Htmlparser.__init__(Self)
Self. flag=False
Self. text=[]
 defHandle_starttag (Self, Tag,attrs):
ifTag = =' span ':
Self. flag=True
defHandle_data (Self, data):
ifSelf. Flag = =True:
 PrintData
Self. Text.append (data)
defHandle_endtag (Self, tag):
ifTag = =' span ':
Self. flag=False




if__name__=="__main__":
Parser=newparser ()
Try :
Req=urllib2. Request (' http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml ')
Fd=urllib2.urlopen (req)
Parser.feed (Fd.read ())
Print Parser.text

except baseexception, E:
Print e

First define a class that inherits from Htmlparser. define some of your own parameters in the __init__ function.

Parser.feed (Fd.read ()) where the feed function is a function of the class itself. The parameter is the HTML code of the Web page . Where the feed is equivalent to a driver function. Let's look at the prototype of the feed function. The following is the implementation of the feed. You can see that 2 features are implemented. Which is to assign the incoming page code to rawdata. Then run goahead to start processing

  def   feed (self       r "" "Feed data to the parser. 
 
     call this as often as want, with as little or as much text 
     as you want (m ay include ' \ n '). 
     "" 
       self.rawdata = self.rawdata + data 
     self.goahead (0

Goahead function code too much, here is not all posted out, the specific function is to traverse rawdata per row of data. Different functions are then called according to the different identities. The key functions are as follows. You can see the start of ' < ' when you encounter it. Call Parse_startag when encountering the ' </' call parse_endtag

Parse_startag inside realizes Handle_starttag,Parse_endtag inside realizes Handle_endtag.

Handle_starttag and Handle_endtag in the code are empty functions. Only the current tag and attrs are passed in . This gives us the chance to rewrite this function.

Handle_starttag (self, tag, attrs):
Pass

Handle_endtag (self, tag):
Pass

Handle_data(self, data):
Pass

Where Hanle_data is the processing of specific data in Web page code

Having said so much, it should be clear to the realization of Htmlparser. The code for each line of the page is processed. In turn, determine whether to enter Handle_starttag,handle_endtag,handle_data. Htmlparser for us to parse out the tag,attrs and data for each line . We extract the information we need by rewriting these functions. So back to our previous code, this code we want to implement the function is to extract the <span></span> fields.

First __init__ defines 2 parameters,flag and text,flag initial value is False

__init__(self):
Htmlparser. __init__(self)
Self. flag=False
Self. text=[]

Handle_starttag implemented as long as Tag=span, then set flag to true

Handle_starttag (self, tag,attrs):
' span ' :
self. Flag=true

Handle_data is implemented as long as flag=true extracts the data and saves it in the text list   .

Handle_data (self, data):
SelfTrue:
Data
Self. Text.append (data)

So when does the data-extracting action end: It depends on the handle_endtag. Similarly , when encountering Tag=span, the Flag=false is set . This will not extract data until the next tag=span is encountered.

Handle_endtag (self, tag):
' span ' :
Self. Flag=false

This is the full functionality of Htmlparser. is it more concise and clear than the previous BeautifulSoup and lxml? For non-canonical web pages,Htmlparser is better than BeautifulSoup and lxml. All function functions are listed below:

Handle_startendtag handling start and end tags
Handle_starttag processing start tags, such as<xx>
Handle_endtag processing end tags, such as</xx>
HANDLE_CHARREF handles special strings, that is, & #开头的, which is usually the character represented by the inner code.
Handle_entityref handles some special characters, starting with &, such as &nbsp;
Handle_data processing data, that is, the data in the middle of <xx>data</xx>
Handle_comment processing annotations
Handle_decl <! The beginning of the process, such as <! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
Handle_pi to handle things like <?instruction>

The other implementations are similar. The usage of scrapy will be introduced from the next chapter

The lxml and htmlparser of Python web crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

python web crawler tutorial python web crawler source code web crawler in python pdf python crawler apt get install python lxml python crawler tutorial scala web crawler tutorial

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

The difference between OS and sys two modules in Python 04-05

Python: send emails 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The lxml and htmlparser of Python web crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support