Requests Package: is a practical Python HTTP client library, writing crawler from the Web crawl data often used, simple and practical, interface simple, Requests.get (URL).
lxml package: Mainly used to parse the HTML content crawled through requests, extract the data we need, use XPath syntax to extract and filter HTML text content, lxml use XPath syntax to locate and filter the HTML content.
Use of lxml packages:
The Lxml tool lets you extract the data we need from the HTML code
A Web page is an HTML file.
Need to organize (organize into a tree structure) the contents of an HTML-formatted file through lxml
HTML file is a tree structure-the directory structure of the type Linux system
After the lxml is organized into a tree structure, then the content is positioned, filtered, filtered by using XPath syntax
The syntax of XPath is used:
Path representation (using XPath syntax to represent the path of a label in an XML literal)
The div navigates to all the div tags under the root node and returns an iterative object
div[class= "J-r-list-c-desc"]/hl/text () extracts text data under a label
/@href Extract the attribute value under a label with the attribute name of href
Filter criteria
div[@class = "link" to navigate to the DIV tag that contains the class attribute and the attribute value is link under the root directory
Div[li] filter out all div tags that contain the Li child tags in the root directory
div[@class] filter out div tags that contain class attributes