Three web crawl methods of Python crawler performance comparison __python

Source: Internet
Author: User
Tags xpath

Below we will introduce three kinds of methods to crawl Web data, first is regular expression , then is popular beautifulsoup module, finally is the powerful lxml module.

1. Regular Expressions

If you are not familiar with regular expressions, or need some hints, you can refer to regular Expression HOWTO for a complete introduction.

When we use regular expressions to crawl country area data, we first try to match the contents of the element, as follows:

>>> Import re
>>> import urllib2
>>> url = ' http://example.webscraping.com/view/ united-kingdom-239 '
>>> html = urllib2.urlopen (URL). Read ()
>>> re.findall (' <td class= " W2P_FW "> (. *?) </td> ', HTML]
['  ', ' 244,820 square kilometres ', ' 62,348,447 ', ' GB ', ' United Kingdom ', ' London ', ' <a href= '/continent/eu ' >EU</a> ', '. uk ', ' GBP ', ' Pound ', ' 44 ' , '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@| Gir0aa ', ' ^ ([a-z]\\d{2}[a-z]{2}) | ( [A-z]\\d{3}[a-z]{2}] | ([a-z]{2}\\d{2}[a-z]{2}) | ([a-z]{2}\\d{3}[a-z]{2}) | ([a-z]\\d[a-z]\\d[a-z]{2}) | ([a-z]{2}\\d[a-z]\\d[a-z]{2}) | (GIR0AA)) $ ', ' en-gb,cy-gb,gd ', ' <div><a href= '/iso/ie ' >ie </a></div> ']

As seen from the above results, the < TD class= "W2P_FW" > tags are used in multiple country properties. To isolate the area attribute, we can select only the second element, as follows:

>>> re.findall (' <td class= "W2P_FW" > (. *?) </td> ', HTML] [1]
' 244,820 square kilometres '

Although this scenario can now be used, if the page changes, the scenario is likely to fail. For example, the table has changed, removing the land area data in the second row. If we only crawl the data now, we can ignore the possible changes in the future. But if we want to crawl that data again in the future, we need to give a more robust solution to avoid the impact of this layout change as much as possible. To make this regular expression more robust, we can also add its parent element < tr >. Because the element has an id attribute, it should be unique.

>>> re.findall (' <tr id= "Places_area__row" ><td class= "W2P_FL" ><label for= "Places_area" id= " Places_area__label ">area: </label></td><td class=" W2P_FW "> (. *?) </td> ', HTML
] [' 244,820 square kilometres ']

This iteration version looks better, but there are many other ways to update the Web page, as well as to make the regular expression impossible to satisfy. For example, double quotes into single quotes,< td > Tags Add extra spaces, or change area_label, and so on. Here is an improved version that attempts to support these possibilities.

>>> re.findall (' <tr id= ' places_area__row ' >.*?<td\s*class=[' \ ']w2p_fw[' \ ']> ' (. *?) </td> ', HTML] [' 244,820 square kilometres ']

Although the regular expression is easier to adapt to the future changes, but there are difficult to construct, poor readability. In addition, some minor layout changes also make the regular expression unsatisfied, such as adding title attributes to < td > tags.
As we can see from this example, the regular expression provides us with a shortcut to crawl the data, but the method is too fragile to have problems after the Web page is updated. Fortunately there are some better solutions that will be introduced later.

2. Beautiful Soup

beautiful Soup is a very popular Python module. The module can parse Web pages and provide a convenient interface for locating content. If you have not yet installed the module, you can use the following command to install the latest version (you need to install the pip, please own Baidu):

Pip Install Beautifulsoup4

The first step in using beautiful Soup is to parse the downloaded HTML content into a Soup document. Because most Web pages do not have good HTML formatting, beautiful Soup needs to determine their actual format. For example, in the following list of simple pages, there is a problem with missing quotes on both sides of the property value and the label not closing.

<ul class=country>
    <li>area
    <li>population
</ul>

If the Population list item is resolved to a child element of an area list item, rather than a two list item in parallel, we get the wrong result when we crawl. Now let's take a look at how beautiful Soup is handled.

>>> from BS4 import beautifulsoup
>>> broken_html = ' <ul class=country><li>area< Li>population</ul> '
>>> # parse the HTML
>>> soup = beautifulsoup (broken_html, ' Html.parser ')
>>> fixed_html = soup.prettify ()
>>> print fixed_html
<ul class= " Country ">
 <li>
  area
  <li>
   Population
  </li>
 </li>
</ul>

As you can see from the results above,beautiful Soup can correctly parse missing quotes and close tags. Now you can use the Find () and Find_all () methods to locate the elements we need.

>>> ul = Soup.find (' ul ', attrs={' class ': ' Country '})
>>> ul.find (' Li ') # return just the match
<li>Area<li>Population</li></li>
>>> ul.find_all (' Li ') # return all Matches
[<li>area<li>population</li></li>, <li>population</li>]

Note: Because of the different versions of the Python built-in Library fault tolerance, may be processed and the results are different, specifically, please refer to: https://www.crummy.com/software/BeautifulSoup/bs4/doc/# Installing-a-parser. To find out all the methods and parameters, you can check the official documentation for beautiful Soup.

The following is the complete code that uses this method to extract sample country area data.

>>> from BS4 import beautifulsoup
>>> import urllib2
>>> url = ' http:// example.webscraping.com/view/united-kingdom-239 '
>>> html = urllib2.urlopen (URL). Read ()
>>  > # Locate the area row
>>> tr = soup.find (attrs={' id ': ' places_area__row '})
>>> # Locate the Area tag
>>> td = Tr.find (attrs={' class ": ' W2P_FW '})
>>> area = td.text # Extract the text from This tag
>>> print Area
244,820 square kilometres

Although this code is more complex than regular expression code, it is easier to construct and understand. And, like extra space and label properties, we don't have to worry about small changes in the layout.

3. Lxml

Lxml is a Python package based on the libxml2 XML Parsing Library. The module is written in C language, parsing faster than beautiful Soup , but the installation process is more complex. The latest installation instructions can refer to http://lxml.de/installation.html. * *

As with beautiful Soup , the first step in using the lxml module is to parse potentially illegal HTML into a unified format. Here is an example of using this module to parse an incomplete HTML :

>>> import lxml.html
>>> broken_html = ' <ul class=country><li>area<li> Population</ul> '
>>> # parse the HTML
>>> tree = lxml.html.fromstring (broken_html)
>>> fixed_html = lxml.html.tostring (tree, pretty_print=true)
>>> print fixed_html
<ul class= "Country" >
<li>Area</li>
<li>Population</li>
</ul>

Similarly,lxml can correctly parse the missing quotes on both sides of the property and close the tag, although the module has no additional < HTML > and < BODY > tags added.

After parsing the input, enter the steps to select the element, at which point lxml has several different methods, such as the XPath selector and the Find () method similar to beautiful Soup . However, later we will use the CSS selector because it is more concise and can be reused when parsing dynamic content. In addition, some readers who have experience with the jQuery selector will be more familiar with it.

The following is a sample code that uses the lxml CSS Selector to extract area data:

>>> Import urllib2
>>> import lxml.html
>>> url = ' http://example.webscraping.com/ view/united-kingdom-239 '
>>> html = urllib2.urlopen (URL). Read ()
>>> tree = Lxml.html.fromstring (HTML)
>>> td = Tree.cssselect (' tr#places_area__row > Td.w2p_fw ') [0] # * Line code
>>> area = td.text_content ()
>>> print Area
244,820 square kilometres

* The line code first finds the table row element with the ID Places_area__row , and then selects the table Data child label with class w2p_fw .

The CSS selector represents the pattern used by the selection element, and here are some common selector examples:

Select all Tags: *
Select <a> tag: A
Select all class= "link" elements:. Link
Select class= "link" <a> tag: a.link
Select id= "Ho Me <a> Tags: a#home
Select all <span> sub tags for the parent element <a> tag: a > Span
Select All <a> within the <span> tab Tags: a span 
Select all <a> tags with title property "Home": A[title=home]

The CSS3 specification has been proposed and its Web site is https://www.w3.org/TR/2011/REC-css3-selectors-20110929/

Lxml has implemented most of the CSS3 properties, and its unsupported features can be seen in: https://cssselect.readthedocs.io/en/latest/.

In the internal implementation of Note:lxml, the CSS selector is actually converted to an equivalent XPath selector.

4. Performance comparison

In the following code, each crawler executes 1000 times, each execution checks to see if the crawl results are correct, and then prints the total time.

#-*-coding:utf-8-*-import CSV import time import urllib2 import re import timeit from BS4 import BeautifulSoup Impor T lxml.html FIELDS = (' area ', ' population ', ' ISO ', ' country ', ' capital ', ' Continent ', ' TLD ', ' currency_code ', ' currency_n Ame ', ' phone ', ' postal_code_format ', ' Postal_code_regex ', ' languages ', ' neighbours ') def regex_scraper (HTML): Result s = {} for field in Fields:results[field] = Re.search (' <tr id= places_{}__row ">.*?<td class=" W2P_FW " > (. *?) </td> '. Format (field), HTML. groups () [0] return results def beautiful_soup_scraper (html): soup = beautifuls  OUP (HTML, ' html.parser ') results = {} for field in Fields:results[field] = soup.find (' table '). Find (' tr ',  Id= ' Places_{}__row '. Format (field). Find (' TD ', class_= ' W2P_FW '). Text return results def lxml_scraper (HTML): Tree = lxml.html.fromstring (html) results = {} for field in Fields:results[field] = tree.cssselect (' table ' ; Tr#places_{}__row > TD.W2P_FW '. Format (field) [0].text_content () return results def main (): Times = {} HTML = urllib2.u Rlopen (' http://example.webscraping.com/view/United-Kingdom-239 '). Read () num_iterations = 1000 # number of times to TES T each scraper to name, scraper in (' Regular expressions ', regex_scraper), (' Beautiful Soup ', beautiful_soup_scraper)
        , (' Lxml ', lxml_scraper): times[name] = [] # Record start time of scrape start = Time.time () For I in Range (num_iterations): if scraper = = Regex_scraper: # The regular expression Modul  e'll cache results # so need to purge the cache for meaningful Timings Re.purge () # * Line Code result = Scraper (HTML) # Check scraped ' as expected assert (result[' are A '] = = ' 244,820 square kilometres ') times[name].append (Time.time ()-start) # Record end time of scrap
 E and output the total       End = Time.time () print ' {}: {:. 2f} seconds '. Format (name, end-start) writer = csv.writer (open  . csv ', ' W ') header = sorted (Times.keys ()) Writer.writerow (header) for row in Zip (*[times[scraper) for scraper In header]: Writer.writerow (row) if __name__ = = ' __main__ ': Main ()


Notice that we called the Re.purge () method in the * Line code . By default, the regular expression caches the search results, and for the sake of fairness, we need to use this method to clear the cache.

The following is the result of my computer running this script:


Due to the differences in hardware conditions, the results of different computer implementation will also have a certain difference. However, the relative difference between each method should be considerable. As you can see from the results,beautiful Soup is more than 7 times times slower than the other two methods when crawling our sample Web pages. In fact, this result is expected because lxml and regular expression modules are written in C , while beautiful Soup is written in pure Python . An interesting fact is thatlxml behaves almost as well as regular expressions. Because lxml must parse the input into an internal format before searching for an element, it can incur additional overhead. When capturing multiple features of the same Web page, the cost of this initialization parsing is reduced andlxml is more competitive, solxml is a powerful module.

5. Summary

Advantages and disadvantages of three web crawler methods:

Crawl Method Performance Difficult to use Installation Difficulty
Regular expressions Fast Difficult Simple (built-in module)
Beautiful Soup Slow Simple Simple (pure Python)
Lxml Fast Simple Relatively difficult



If your crawler bottleneck is downloading Web pages, rather than extracting data, then it's not a problem to use slower methods such as beautiful Soup. Regular expressions are useful in a one-time extraction, and also avoid the overhead of parsing an entire Web page, which can be more appropriate if you need to crawl a small amount of data and want to avoid additional dependencies. Typically, however,lxml is the best choice for fetching data because it is not only fast, but also richer, while regular expressions and beautiful Soup are only useful in certain scenarios.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.