Analysis--lxml/xpath and Bs4/beautifulsoup of two common web parsing tools in reptiles

Source: Internet
Author: User
Tags xpath

Readers may wonder what my title looks like, mostly just write lxml and bs4 the two PY module names may not be able to attract the attention of the public, generally speaking of web page parsing technology, referring to the keywords are more beautifulsoup and XPath, and their respective modules ( Python is called a module, but other platforms are more known as libraries, and rarely gets materializing to talk about. Below I will compare the difference between XPath and BeautifulSoup from various angles, such as efficiency and complexity.

Efficiency in terms of efficiency, XPath is indeed much more efficient than beautifulsoup, each step of the debugging, soup object generation has a significant delay, and Lxml.etree.HTML (HTML) in step over the instant it was built successfully an executable x The object that the path operates on is amazingly fast. In principle, BS4 is written in Python, lxml is implemented in C, and BeautifulSoup is DOM-based, loading the entire document, parsing the entire DOM tree, so the time and memory overhead will be much larger. lxml will only perform local traversal.  Using complexity from the use of complexity, BeautifulSoup's Find method is simpler than XPath, which requires not only the knowledge of XPath syntax, but also the return object of the XPath method is always a list, which makes it awkward to handle some unique elements of the page, such as ID gets a tab of the page, and below I implement a method to get the navigation bar of the Web page in two ways (the annotation section is the implementation of BS4):
    defGet_nav (self,response):#soup = BeautifulSoup (Response.body_as_unicode (), ' lxml ')        #nav_list = Soup.find (' ul ', id= ' nav '). Find_all (' li ')Model =etree. HTML (Response.body_as_unicode ()) Nav_list= Model.xpath ('//ul[@id = "NAV"]/li')         forNavinchNav_list[1:]:            #href = nav.find (' a '). Get (' href ')href = Nav.xpath ('./a/@href') [0]yieldRequest (href, callback=self.get_url)

You can see that XPath, in addition to its special syntax, looks a little awkward (as with regular expressions), it is quite impressive in code brevity, except that all XPath methods return a list, and if the target is a single element, for a 0 operation without a brain subscript, People with obsessive-compulsive disorder may feel a bit uncomfortable. In contrast, beautifulsoup this long list of find and Find_all methods appear somewhat stiff, if the search route is more tortuous, such as:

# href = article.find (' div ', class_= ' txt '). Find (' P ', class_= ' tit blue '). Find (' em '). Find (' a '). Get (' a '). href ')href = Article.xpath ('./div[@class = txt]//p[@class = Blue]/span/em/a/@href' ) [0]

In this case, BeautifulSoup's writing seems to be a bit disgusting, of course, there will not be such a long path location.

Functional Defect Summary--beautifulsoup

BeautifulSoup in the use of a short board, is in the nested list to match the elements of the time will be very weak, the following is an example (the specific page structure can be opened according to Index_page in the browser to review):

classRankspider (spider): Name='Pcauto_rank'Index_page ='http://price.pcauto.com.cn/top/hot/s1-t1.html'Api_url='http://price.pcauto.com.cn%s'    defstart_requests (self):yieldRequest (Self.index_page, callback=Self.get_left_nav)#test BeautifulSoup for continuous use of two Find_all methods    defGet_left_nav (self,response):#model = etree. HTML (Response.body_as_unicode ())        #nav_list = Model.xpath ('//div[@id = "LeftNav"]/ul[@class = "pb200"]/li//a[@class = "dd"])Soup = BeautifulSoup (Response.body_as_unicode (),'lxml') Nav_list= Soup.find ('Div', id='LeftNav'). Find ('ul', class_='pb200'). Find_all ('Li'). Find_all ('a', class_='DD')         forSub_navinchNav_list:href= Self.api_url% Sub_nav.xpath ('./@href') [0]yieldRequest (HREF, callback=Self.get_url)defGet_url (self):Pass

Using the comment part of the XPath notation is not a problem, can achieve accurate positioning, but used to beautifulsoup to achieve the corresponding logic, it is necessary to continuously use two Find_all method, it is obvious that this writing does not conform to the specification, run time will be reported Attributeerror: ' ResultSet ' object has no attribute ' Find_all ' ERROR, this time we want to achieve this match, can only go through each Li, and then the Find_all method to find the Li under the various a tag, it is cumbersome, so this scenario with XPA Th to solve will save a lot of trouble.

Of course, I just want to explain this problem in order to take the target URL at this level, the actual development I use here is:

        # nav_list = Model.xpath ('//div[@id = ' LeftNav ']///a[@class = ' dd '] ') = soup.find (' div ', id= ' LeftNav '). Find_all (          ' a ', class_='dd')

But if our goal is not all Li under the a tag, but part of the class= "*" Li under the a tag, this time we can only choose to use XPath to achieve the purpose, of course, if you like to write the traversal, feel that the logic show is more clear, then you can skip this section.

Functional Defect Summary--xpath XPath's class selector has a short board when making a public class name selection, and barely counts it as a functional flaw, such as:
Model = etree. HTML (Response.body_as_unicode ())
Model.xpath ('//div[@class = "box box-2 box-4"]
You cannot locate the two div in the HTML class for box box-2 box-4 mt25 and box box-2 box-4 mt17, which must be:
Model.xpath ('//div[@class = "box box-2 box-4 mt25"]') model.xpath ('  div[@class = "box box-2 box-4 mt17"])

To match the target, which may be due to the fact that the XPath itself is designed with the exact match of the class name to determine the target, even more than one space:

A label on the page is written like this: <a href= "/top/hot/s1-t1.html" class= "dd" >5 </a> use XPath to choose, write:

Model.xpath ('//a[@class = "DD"])

Dead and Alive match not (at that time really is pretty confused forced), must be in the back with a space, but in the JS console a.dd This class selector can also be targeted to target, and BeautifulSoup find_all (' A ', class_= ' DD ') is no problem, The XPath in this scenario is somewhat inflexible.

Text acquisition

  

        #Place = Soup.find (' div ', class_= "guide")Place = Model.xpath ('//div[@class = "Guide"]')        #nav and Aiticle        ifPlace :#mark = place.find (' span ', class_= "Mark")Mark = Place[0].xpath ('./span[@class = "Mark"]')            ifMark:#text = Mark.get_text (). Strip (). replace (' \ n ', '). Replace (' \ R ', ')                #text = Mark[0].text.strip (). replace (' \ n ', '). Replace (' \ R ', ') # falseText = Mark[0].xpath ('string ()') result['Address'] = text

Analysis--lxml/xpath and Bs4/beautifulsoup of two common web parsing tools in reptiles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.