Using the urllib or urllib2 module that comes with Pyhton to Capture webpages may be a bit lazy. let's take a look at some new things today. let's take a look at the tutorial of using the lxml module and the Requests module to capture HTML pages in Python:
Web capture
The Web site uses HTML description, which means that each web page is a structured document. Sometimes it is useful to retrieve data from it while maintaining its structure. Websites do not always provide their data in an easy-to-process format, such as csv or json.
This is exactly the time for web crawling. Web crawling is the practice of using a computer program to collect web page data and organize it into the required format, while saving its structure.
Lxml and Requests
Lxml (http://lxml.de/) is a beautiful extension library that is used to quickly parse XML and HTML documents even if the tags being processed are messy. We will also replace the built-in urllib2 module with the Requests (http://docs.python-requests.org/en/latest/#) module because it is faster and more readable. You can install the two modules by using the pip install lxml and pip install requests commands.
Let's start the following import:
from lxml import htmlimport requests
Next we will use requests. get to retrieve our data from the web page, parse it by using the html module, and save the result to the tree.
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')tree = html.fromstring(page.text)
Tree now contains the entire HTML file into an elegant tree structure. we can access the tree in two ways: XPath and CSS selector. In this example, we will select the former.
XPath is a way to locate information in structured documents (such as HTML or XML. For a good introduction to XPath, see W3Schools.
There are many tools that can obtain element XPath, such as Firefox FireBug or Chrome Inspector. If you use Chrome, you can right-click the element, select 'inspect element', highlight the code, right-click it again, and select 'copy XPath '.
After a quick analysis, we can see that the data on the page is saved in two elements. one is p with the title 'buyer-name, the other class is the span of 'item-price:
Carson Busses
$29.95
After knowing this, we can create the correct XPath query and use the lxml xpath function, as shown below:
# This will create a list of buyers: buyers = tree. xpath ('// p [@ title = "buyer-name"]/text ()') # This creates a prices list: prices = tree. xpath ('// span [@ class = "item-price"]/text ()')
Let's see what we get:
print 'Buyers: ', buyersprint 'Prices: ', pricesBuyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes','Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff','Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup','Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire','Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25','$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11','$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68','$15.00', '$114.07', '$10.09']
Congratulations! We have successfully captured all the data we want from a web page through lxml and Request. We store them in memory as a list. Now we can do a variety of cool things on it: we can use Python to analyze it, or we can save it as a file and share it with the world.
We can consider some cool ideas: modify this script to traverse the remaining pages in the data set of this example, or rewrite the application with multiple threads to speed up the process.
More Python uses the lxml module and the Requests module to capture HTML pages!