Web Crawl
Web sites use HTML descriptions, which means that each Web page is a structured document. Sometimes it is useful to get data from it while maintaining its structure. Web sites do not always provide their data in an easy-to-handle format, such as CSV or JSON.
This is the time for the web to crawl out. Web crawling is the practice of using a computer program to collect and organize Web page data into the desired format while preserving its structure.
lxml and requests
lxml (http://lxml.de/) is a graceful extension library that is used to quickly parse XML and HTML documents even if the labels you are dealing with are very confusing. We will also replace the built-in URLLIB2 module with the requests (http://docs.python-requests.org/en/latest/#) module because it is faster and more readable. You can install these two modules by using the PIP install lxml with the PIP install requests command.
Let's start with the following import:
From lxml import Htmlimport requests
Next we'll use Requests.get to get our data from the Web page, parse it using the HTML module, and save the results to the tree.
page = Requests.get (' http://econpy.pythonanywhere.com/ex/001.html ') tree = html.fromstring (page.text)
Tree now contains the entire HTML file into an elegant tree structure, and we can access it in two ways: XPath and CSS selector. In this example, we will choose the former.
XPath is a way of locating information in a structured document, such as HTML or XML. A nice introduction to XPath, see W3Schools.
There are many tools available to get XPath for an element, such as Firefox's Firebug or Chrome's inspector. If you use chrome, you can right-click the element, select ' Inspect element ', highlight the code, right-click again, and select ' Copy XPath '.
After a quick analysis, we see that the data in the page is stored in two elements, one is the title ' Buyer-name ' p, and the other class is ' Item-price ' span:
<p title= "Buyer-name" >carson busses</p><span class= "Item-price" >$29.95</span>
Knowing this, we can create the correct XPath query and use lxml's XPath function, like this:
#这将创建buyers的列表: Buyers = Tree.xpath ('//p[@title = "Buyer-name"]/text () ') #这将创建prices的列表: Prices = Tree.xpath ('//span[@ class= "Item-price"]/text () ')
Let's see what we Got:
print ' buyers: ', buyersprint ' prices: ', pricesbuyers: [' Carson busses ', ' Earl E. Byrd ', ' Patty cakes ', ' Derri Anne Connec ' Ticut ', ' Moe Dess ', ' Leda doggslife ', ' Dan druff ', ' Al Fresco ', ' Ido Hoe ', ' Howie Kisses ', ' Len Lease ', ' Phil meup ', ' Ira Pe NT ', ' Ben D. Rules ', ' Ave sectomy ', ' Gary shattire ', ' Bobbi soks ', ' Sheila Takya ', ' Rose Tattoo ', ' Moe tell ']prices: [' $29. 95 ', ' $8.37 ', ' $15.26 ', ' $19.25 ', ' $19.25 ', ' $13.99 ', ' $31.57 ', ' $8.49 ', ' $14.47 ', ' $15.86 ', ' $11.11 ', ' $15.98 ', ' $16.27 ', ' $7.50 ', ' $50.85 ', ' $14.26 ', ' $5.68 ', ' $15.00 ', ' $114.07 ', ' $10.09 '
Congratulations! We've managed to grab all the data we want from a Web page through lxml and request. We present them in memory in the form of a list. Now we can do all sorts of cool things with it: we can use Python to analyze it, or we can save it as a file and share it with the world.
We can consider some cooler ideas: Modify the script to traverse the remaining pages in the sample dataset, or use multithreading to rewrite the app to speed it up.