Python uses the lxml module and Requests module to capture HTML pages

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using the urllib or urllib2 module that comes with Pyhton to Capture webpages may be a bit lazy. let's take a look at some new things today. let's take a look at the tutorial of using the lxml module and the Requests module to capture HTML pages in Python: Web capture

The Web site uses HTML description, which means that each web page is a structured document. Sometimes it is useful to retrieve data from it while maintaining its structure. Websites do not always provide their data in an easy-to-process format, such as csv or json.

This is exactly the time for web crawling. Web crawling is the practice of using a computer program to collect web page data and organize it into the required format, while saving its structure.

Lxml and Requests
Lxml (http://lxml.de/) is a beautiful extension library that is used to quickly parse XML and HTML documents even if the tags being processed are messy. We will also replace the built-in urllib2 module with the Requests (http://docs.python-requests.org/en/latest/#) module because it is faster and more readable. You can install the two modules by using the pip install lxml and pip install requests commands.

Let's start the following import:

from lxml import htmlimport requests

Next we will use requests. get to retrieve our data from the web page, parse it by using the html module, and save the result to the tree.

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')tree = html.fromstring(page.text)

Tree now contains the entire HTML file into an elegant tree structure. we can access the tree in two ways: XPath and CSS selector. In this example, we will select the former.

XPath is a way to locate information in structured documents (such as HTML or XML. For a good introduction to XPath, see W3Schools.

There are many tools that can obtain element XPath, such as Firefox FireBug or Chrome Inspector. If you use Chrome, you can right-click the element, select 'inspect element', highlight the code, right-click it again, and select 'copy XPath '.

After a quick analysis, we can see that the data on the page is saved in two elements. one is p with the title 'buyer-name, the other class is the span of 'item-price:

Carson Busses

$29.95

After knowing this, we can create the correct XPath query and use the lxml xpath function, as shown below:

# This will create a list of buyers: buyers = tree. xpath ('// p [@ title = "buyer-name"]/text ()') # This creates a prices list: prices = tree. xpath ('// span [@ class = "item-price"]/text ()')

Let's see what we get:

print 'Buyers: ', buyersprint 'Prices: ', pricesBuyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes','Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff','Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup','Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire','Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25','$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11','$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68','$15.00', '$114.07', '$10.09']

Congratulations! We have successfully captured all the data we want from a web page through lxml and Request. We store them in memory as a list. Now we can do a variety of cool things on it: we can use Python to analyze it, or we can save it as a file and share it with the world.

We can consider some cool ideas: modify this script to traverse the remaining pages in the data set of this example, or rewrite the application with multiple threads to speed up the process.

More Python uses the lxml module and the Requests module to capture HTML pages!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python uses the lxml module and Requests module to capture HTML pages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python uses the lxml module and Requests module to capture HTML pages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support