Basic syntax
An expression |
Description |
Node |
Select all child nodes of this node, tag or * Choose any tag |
/ |
Select from the root node, choose a direct child node, and do not contain smaller descendants (such as Sun, great-grandchild) |
// |
Selects the nodes in the document from the current node of the matching selection, regardless of their location, including all descendants |
. |
Select the current node |
.. |
Select the parent node of the current node |
@ |
Select Properties |
@ Property
Querying nodes in the DOM tree in a way
Select an attribute with the @ symbol
<a rel= "nofollow" class= "External text" href= "Http://google.ac" >goole<wbr/>.ac</a>
Rel Class href is a property that can be selected by "//*[@class = ' external text ']"
The = symbol requires an exact match of attributes and can be partially matched by the Contains method, for example
"//*[contains (@class, ' External ')]" can match, and
"//*[@class = ' external ']" is not
Operator
And and OR operators:
Select elements of P or span or h1 tags
Soup = Tree.xpath ('//td[@class = "Editor Bbsdetailcontainer"]//*[self::p or Self::span or SELF::H1])
Select the element with the class editor or tag
Soup = Tree.xpath ('//td[@class = ' editor ' or @class = ' tag '] ')
demo
Import lxmlfrom lxml import htmlfrom lxml import etreefrom bs4 Import BEAUTIFULSOUPF = open (' jd.com_2131674.html ', ' r ') con Tent = F.read () tree = etree. HTML (Content.decode (' Utf-8 ')) print '--------------------------------------------' print ' # different quote//*[@ class= "P-price j-p-2131674" ' print '--------------------------------------------' Print tree.xpath (u "//*[@class = ' P-price j-p-2131674 ' ") print ' print '--------------------------------------------' print ' # partial match ' + '//*[@ class= ' j-p-2131674 ' "print '--------------------------------------------' Print tree.xpath (u"//*[@class = ' j-p-2131674 '] print ' print '--------------------------------------------' print ' # exactly match class string ' + '//*[ @class = "P-price j-p-2131674"] ' print '--------------------------------------------' Print tree.xpath (u '//*[@class = " P-price j-p-2131674 "] print ' print '--------------------------------------------' print ' # use contain ' + '//*[ Contains (@class, ' j-p-2131674 ')] "print"--------------------------------------------' Print tree.xpath (U "//*[contains (@class, ' j-p-2131674 ')]") print ' print '------------------- -------------------------' print ' # Specify tag name ' + '//strong[contains (@class, ' j-p-2131674 ') ' print '------------- -------------------------------' Print tree.xpath (U "//strong[contains (@class, ' j-p-2131674 ')]") print ' print '----- ---------------------------------------' print ' # CSS selector with tag ' + ' Cssselect (' strong. j-p-2131674 ') "print '--------------------------------------------' htree = lxml.html.fromstring (content) Print Htree.cssselect (' strong. j-p-2131674 ') print ' print '--------------------------------------------' print ' # CSS selector without tag, partial Match ' + ' Cssselect ('. j-p-2131674 ') "print '--------------------------------------------' htree = lxml.html.fromstring (content) elements = Htree.cssselect ('. j-p-2131674 ') print elementsprint ' print '--------------------------------------------' print ' # attrib and text ' Print ‘--------------------------------------------' for element in Tree.xpath (U "//strong[contains (@class, ' j-p-2131674 ')"): Print element.text print Element.att Ribprint ' print '--------------------------------------------' print ' ########## use BeautifulSoup ############## ' print '--------------------------------------------' print ' # loading content to BeautifulSoup ' soup = beautifulsoup ( Content, ' Html.parser ') print ' # loaded, show result ' Print Soup.find (attrs={' class ': ' j-p-2131674 '}). Textf.close ()
The XPath of the Python crawler