Python uses lxml modules and requests modules to crawl HTML pages

Source: Internet
Author: User
Web Crawl

Web sites use HTML descriptions, which means that each Web page is a structured document. Sometimes it is useful to get data from it while maintaining its structure. Web sites do not always provide their data in an easy-to-handle format, such as CSV or JSON.

This is the time for the web to crawl out. Web crawling is the practice of using a computer program to collect and organize Web page data into the desired format while preserving its structure.

lxml and requests
lxml (http://lxml.de/) is a graceful extension library that is used to quickly parse XML and HTML documents even if the labels you are dealing with are very confusing. We will also replace the built-in URLLIB2 module with the requests (http://docs.python-requests.org/en/latest/#) module because it is faster and more readable. You can install these two modules by using the PIP install lxml with the PIP install requests command.

Let's start with the following import:

From lxml import Htmlimport requests

Next we'll use Requests.get to get our data from the Web page, parse it using the HTML module, and save the results to the tree.

page = Requests.get (' http://econpy.pythonanywhere.com/ex/001.html ') tree = html.fromstring (page.text)

Tree now contains the entire HTML file into an elegant tree structure, and we can access it in two ways: XPath and CSS selector. In this example, we will choose the former.

XPath is a way of locating information in a structured document, such as HTML or XML. A nice introduction to XPath, see W3Schools.

There are many tools available to get XPath for an element, such as Firefox's Firebug or Chrome's inspector. If you use chrome, you can right-click the element, select ' Inspect element ', highlight the code, right-click again, and select ' Copy XPath '.

After a quick analysis, we see that the data in the page is stored in two elements, one is the title ' Buyer-name ' p, and the other class is ' Item-price ' span:

<p title= "Buyer-name" >carson busses</p><span class= "Item-price" >$29.95</span>

Knowing this, we can create the correct XPath query and use lxml's XPath function, like this:

#这将创建buyers的列表: Buyers = Tree.xpath ('//p[@title = "Buyer-name"]/text () ') #这将创建prices的列表: Prices = Tree.xpath ('//span[@ class= "Item-price"]/text () ')

Let's see what we Got:

print ' buyers: ', buyersprint ' prices: ', pricesbuyers: [' Carson busses ', ' Earl E. Byrd ', ' Patty cakes ', ' Derri Anne Connec ' Ticut ', ' Moe Dess ', ' Leda doggslife ', ' Dan druff ', ' Al Fresco ', ' Ido Hoe ', ' Howie Kisses ', ' Len Lease ', ' Phil meup ', ' Ira Pe NT ', ' Ben D. Rules ', ' Ave sectomy ', ' Gary shattire ', ' Bobbi soks ', ' Sheila Takya ', ' Rose Tattoo ', ' Moe tell ']prices: [' $29. 95 ', ' $8.37 ', ' $15.26 ', ' $19.25 ', ' $19.25 ', ' $13.99 ', ' $31.57 ', ' $8.49 ', ' $14.47 ', ' $15.86 ', ' $11.11 ', ' $15.98 ', ' $16.27 ', ' $7.50 ', ' $50.85 ', ' $14.26 ', ' $5.68 ', ' $15.00 ', ' $114.07 ', ' $10.09 '

Congratulations! We've managed to grab all the data we want from a Web page through lxml and request. We present them in memory in the form of a list. Now we can do all sorts of cool things with it: we can use Python to analyze it, or we can save it as a file and share it with the world.

We can consider some cooler ideas: Modify the script to traverse the remaining pages in the sample dataset, or use multithreading to rewrite the app to speed it up.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.