International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Three web crawl methods of Python crawler performance comparison __python

Last Update:2018-07-24 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Below we will introduce three kinds of methods to crawl Web data, first is regular expression , then is popular beautifulsoup module, finally is the powerful lxml module.

1. Regular Expressions

If you are not familiar with regular expressions, or need some hints, you can refer to regular Expression HOWTO for a complete introduction.

When we use regular expressions to crawl country area data, we first try to match the contents of the element, as follows:

>>> Import re
>>> import urllib2
>>> url = ' http://example.webscraping.com/view/ united-kingdom-239 '
>>> html = urllib2.urlopen (URL). Read ()
>>> re.findall (' <td class= " W2P_FW "> (. *?) </td> ', HTML]
['  ', ' 244,820 square kilometres ', ' 62,348,447 ', ' GB ', ' United Kingdom ', ' London ', ' <a href= '/continent/eu ' >EU</a> ', '. uk ', ' GBP ', ' Pound ', ' 44 ' , '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@| Gir0aa ', ' ^ ([a-z]\\d{2}[a-z]{2}) | ( [A-z]\\d{3}[a-z]{2}] | ([a-z]{2}\\d{2}[a-z]{2}) | ([a-z]{2}\\d{3}[a-z]{2}) | ([a-z]\\d[a-z]\\d[a-z]{2}) | ([a-z]{2}\\d[a-z]\\d[a-z]{2}) | (GIR0AA)) $ ', ' en-gb,cy-gb,gd ', ' <div><a href= '/iso/ie ' >ie </a></div> ']

As seen from the above results, the < TD class= "W2P_FW" > tags are used in multiple country properties. To isolate the area attribute, we can select only the second element, as follows:

>>> re.findall (' <td class= "W2P_FW" > (. *?) </td> ', HTML] [1]
' 244,820 square kilometres '

Although this scenario can now be used, if the page changes, the scenario is likely to fail. For example, the table has changed, removing the land area data in the second row. If we only crawl the data now, we can ignore the possible changes in the future. But if we want to crawl that data again in the future, we need to give a more robust solution to avoid the impact of this layout change as much as possible. To make this regular expression more robust, we can also add its parent element < tr >. Because the element has an id attribute, it should be unique.

>>> re.findall (' <tr id= "Places_area__row" ><td class= "W2P_FL" ><label for= "Places_area" id= " Places_area__label ">area: </label></td><td class=" W2P_FW "> (. *?) </td> ', HTML
] [' 244,820 square kilometres ']

This iteration version looks better, but there are many other ways to update the Web page, as well as to make the regular expression impossible to satisfy. For example, double quotes into single quotes,< td > Tags Add extra spaces, or change area_label, and so on. Here is an improved version that attempts to support these possibilities.

>>> re.findall (' <tr id= ' places_area__row ' >.*?<td\s*class=[' \ ']w2p_fw[' \ ']> ' (. *?) </td> ', HTML] [' 244,820 square kilometres ']

Although the regular expression is easier to adapt to the future changes, but there are difficult to construct, poor readability. In addition, some minor layout changes also make the regular expression unsatisfied, such as adding title attributes to < td > tags.
As we can see from this example, the regular expression provides us with a shortcut to crawl the data, but the method is too fragile to have problems after the Web page is updated. Fortunately there are some better solutions that will be introduced later.

2. Beautiful Soup

beautiful Soup is a very popular Python module. The module can parse Web pages and provide a convenient interface for locating content. If you have not yet installed the module, you can use the following command to install the latest version (you need to install the pip, please own Baidu):

Pip Install Beautifulsoup4

The first step in using beautiful Soup is to parse the downloaded HTML content into a Soup document. Because most Web pages do not have good HTML formatting, beautiful Soup needs to determine their actual format. For example, in the following list of simple pages, there is a problem with missing quotes on both sides of the property value and the label not closing.

<ul class=country>
    <li>area
    <li>population
</ul>

If the Population list item is resolved to a child element of an area list item, rather than a two list item in parallel, we get the wrong result when we crawl. Now let's take a look at how beautiful Soup is handled.

>>> from BS4 import beautifulsoup
>>> broken_html = ' <ul class=country><li>area< Li>population</ul> '
>>> # parse the HTML
>>> soup = beautifulsoup (broken_html, ' Html.parser ')
>>> fixed_html = soup.prettify ()
>>> print fixed_html
<ul class= " Country ">
 <li>
  area
  <li>
   Population
  </li>
 </li>
</ul>

As you can see from the results above,beautiful Soup can correctly parse missing quotes and close tags. Now you can use the Find () and Find_all () methods to locate the elements we need.

>>> ul = Soup.find (' ul ', attrs={' class ': ' Country '})
>>> ul.find (' Li ') # return just the match
<li>Area<li>Population</li></li>
>>> ul.find_all (' Li ') # return all Matches
[<li>area<li>population</li></li>, <li>population</li>]

Note: Because of the different versions of the Python built-in Library fault tolerance, may be processed and the results are different, specifically, please refer to: https://www.crummy.com/software/BeautifulSoup/bs4/doc/# Installing-a-parser. To find out all the methods and parameters, you can check the official documentation for beautiful Soup.

The following is the complete code that uses this method to extract sample country area data.

>>> from BS4 import beautifulsoup
>>> import urllib2
>>> url = ' http:// example.webscraping.com/view/united-kingdom-239 '
>>> html = urllib2.urlopen (URL). Read ()
>>  > # Locate the area row
>>> tr = soup.find (attrs={' id ': ' places_area__row '})
>>> # Locate the Area tag
>>> td = Tr.find (attrs={' class ": ' W2P_FW '})
>>> area = td.text # Extract the text from This tag
>>> print Area
244,820 square kilometres

Although this code is more complex than regular expression code, it is easier to construct and understand. And, like extra space and label properties, we don't have to worry about small changes in the layout.

3. Lxml

Lxml is a Python package based on the libxml2 XML Parsing Library. The module is written in C language, parsing faster than beautiful Soup , but the installation process is more complex. The latest installation instructions can refer to http://lxml.de/installation.html. * *

As with beautiful Soup , the first step in using the lxml module is to parse potentially illegal HTML into a unified format. Here is an example of using this module to parse an incomplete HTML :

>>> import lxml.html
>>> broken_html = ' <ul class=country><li>area<li> Population</ul> '
>>> # parse the HTML
>>> tree = lxml.html.fromstring (broken_html)
>>> fixed_html = lxml.html.tostring (tree, pretty_print=true)
>>> print fixed_html
<ul class= "Country" >
<li>Area</li>
<li>Population</li>
</ul>

Similarly,lxml can correctly parse the missing quotes on both sides of the property and close the tag, although the module has no additional < HTML > and < BODY > tags added.

After parsing the input, enter the steps to select the element, at which point lxml has several different methods, such as the XPath selector and the Find () method similar to beautiful Soup . However, later we will use the CSS selector because it is more concise and can be reused when parsing dynamic content. In addition, some readers who have experience with the jQuery selector will be more familiar with it.

The following is a sample code that uses the lxml CSS Selector to extract area data:

>>> Import urllib2
>>> import lxml.html
>>> url = ' http://example.webscraping.com/ view/united-kingdom-239 '
>>> html = urllib2.urlopen (URL). Read ()
>>> tree = Lxml.html.fromstring (HTML)
>>> td = Tree.cssselect (' tr#places_area__row > Td.w2p_fw ') [0] # * Line code
>>> area = td.text_content ()
>>> print Area
244,820 square kilometres

* The line code first finds the table row element with the ID Places_area__row , and then selects the table Data child label with class w2p_fw .

The CSS selector represents the pattern used by the selection element, and here are some common selector examples:

Select all Tags: *
Select <a> tag: A
Select all class= "link" elements:. Link
Select class= "link" <a> tag: a.link
Select id= "Ho Me <a> Tags: a#home
Select all <span> sub tags for the parent element <a> tag: a > Span
Select All <a> within the <span> tab Tags: a span 
Select all <a> tags with title property "Home": A[title=home]

The CSS3 specification has been proposed and its Web site is https://www.w3.org/TR/2011/REC-css3-selectors-20110929/

Lxml has implemented most of the CSS3 properties, and its unsupported features can be seen in: https://cssselect.readthedocs.io/en/latest/.

In the internal implementation of Note:lxml, the CSS selector is actually converted to an equivalent XPath selector.

4. Performance comparison

In the following code, each crawler executes 1000 times, each execution checks to see if the crawl results are correct, and then prints the total time.

#-*-coding:utf-8-*-import CSV import time import urllib2 import re import timeit from BS4 import BeautifulSoup Impor T lxml.html FIELDS = (' area ', ' population ', ' ISO ', ' country ', ' capital ', ' Continent ', ' TLD ', ' currency_code ', ' currency_n Ame ', ' phone ', ' postal_code_format ', ' Postal_code_regex ', ' languages ', ' neighbours ') def regex_scraper (HTML): Result s = {} for field in Fields:results[field] = Re.search (' <tr id= places_{}__row ">.*?<td class=" W2P_FW " > (. *?) </td> '. Format (field), HTML. groups () [0] return results def beautiful_soup_scraper (html): soup = beautifuls  OUP (HTML, ' html.parser ') results = {} for field in Fields:results[field] = soup.find (' table '). Find (' tr ',  Id= ' Places_{}__row '. Format (field). Find (' TD ', class_= ' W2P_FW '). Text return results def lxml_scraper (HTML): Tree = lxml.html.fromstring (html) results = {} for field in Fields:results[field] = tree.cssselect (' table ' ; Tr#places_{}__row > TD.W2P_FW '. Format (field) [0].text_content () return results def main (): Times = {} HTML = urllib2.u Rlopen (' http://example.webscraping.com/view/United-Kingdom-239 '). Read () num_iterations = 1000 # number of times to TES T each scraper to name, scraper in (' Regular expressions ', regex_scraper), (' Beautiful Soup ', beautiful_soup_scraper)
        , (' Lxml ', lxml_scraper): times[name] = [] # Record start time of scrape start = Time.time () For I in Range (num_iterations): if scraper = = Regex_scraper: # The regular expression Modul  e'll cache results # so need to purge the cache for meaningful Timings Re.purge () # * Line Code result = Scraper (HTML) # Check scraped ' as expected assert (result[' are A '] = = ' 244,820 square kilometres ') times[name].append (Time.time ()-start) # Record end time of scrap
 E and output the total       End = Time.time () print ' {}: {:. 2f} seconds '. Format (name, end-start) writer = csv.writer (open  . csv ', ' W ') header = sorted (Times.keys ()) Writer.writerow (header) for row in Zip (*[times[scraper) for scraper In header]: Writer.writerow (row) if __name__ = = ' __main__ ': Main ()

Notice that we called the Re.purge () method in the * Line code . By default, the regular expression caches the search results, and for the sake of fairness, we need to use this method to clear the cache.

The following is the result of my computer running this script:

Due to the differences in hardware conditions, the results of different computer implementation will also have a certain difference. However, the relative difference between each method should be considerable. As you can see from the results,beautiful Soup is more than 7 times times slower than the other two methods when crawling our sample Web pages. In fact, this result is expected because lxml and regular expression modules are written in C , while beautiful Soup is written in pure Python . An interesting fact is thatlxml behaves almost as well as regular expressions. Because lxml must parse the input into an internal format before searching for an element, it can incur additional overhead. When capturing multiple features of the same Web page, the cost of this initialization parsing is reduced andlxml is more competitive, solxml is a powerful module.

5. Summary

Advantages and disadvantages of three web crawler methods:

Crawl Method	Performance	Difficult to use	Installation Difficulty
Regular expressions	Fast	Difficult	Simple (built-in module)
Beautiful Soup	Slow	Simple	Simple (pure Python)
Lxml	Fast	Simple	Relatively difficult

If your crawler bottleneck is downloading Web pages, rather than extracting data, then it's not a problem to use slower methods such as beautiful Soup. Regular expressions are useful in a one-time extraction, and also avoid the overhead of parsing an entire Web page, which can be more appropriate if you need to crawl a small amount of data and want to avoid additional dependencies. Typically, however,lxml is the best choice for fetching data because it is not only fast, but also richer, while regular expressions and beautiful Soup are only useful in certain scenarios.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

python web crawler code python web crawler tutorial python web crawler source code web crawler in python pdf python crawler rising three methods high performance python web server

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

The difference between OS and sys two modules in Python 04-05

Python: send emails 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Three web crawl methods of Python crawler performance comparison __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support