Difference between Python Xpath and Regex, pythonxpathregex
When crawling webpage information, we often need to use Regex or Xpath.Differences between the two:
RegexItself isText matching toolBecause it needs to be matched multiple times, it appliesShort and centralized information. It can be precisely matched and captured. HoweverLarge Capacity,Scattered contentHTML and other text, the efficiency will b
XPath is a language for locating elements in an XML document, commonly used in XML, HTML file parsing, and easier to use than CSS selectors.
XML file minimum constituent unit:
-Element (Elements node)
-Attribute (attribute node)
-Text (texts)
-namespace (name space)
-processing-instruction (Command handling)
-Comment (note)
-Root (root node)
XPath syntax:
Locating the root tag
/down Level Search
/text () Extract text content
/@xxx Extract Attribute Contents
Sample:Import requestsfrom lxml Import etreefor i in range (1): URL = "http://www.xxx.com/topic/tv/page/{}". Format (i)
req = Requests.get (URL). Content HTML = etree. HTML (req) # extract Text = Html.xpath (
Second day, busy home some things, shun with people to crawl the watercress book top2501. Construct the URLs list urls=[' https://book.douban.com/top250?start={} '. Format (str (i) for I in range (0, 226, 25))]2. Module requests get webpage source code lxml Parse Web page XPath extract3. Extracting information4, can be encapsulated into a function here does not encapsulate the callPython code:#coding: Utf-8import sysreload (SYS) sys.setdefaultencoding
First, XPath basic positioning usage1.1 Using ID to locate--driver.find_element_by_xpath ('//input[@id = "kw"]) 1.2 Using class positioning-driver.find_element_by_xpath ('//input[@class = "S_ipt"] 1.3 Of course, through the usual 8 ways of combining XPath can be located (name, Tag_name, Link_text, Partial_link_text) above only listed 2 common ways OH.Second, XPath
One: XPath introductionThe XPath full name XML Path language, which determines the location of a part of an XML document. XPath is based on an XML tree structure, looking for nodes in the tree.Now, it is common to use XPath to find and extract information in XML, and it also supports
Common statements:1.starts-with (@ attribute name, same part of attribute character) use case: Start with the same characterselector = etree. HTML (HTML) content = Selector.xpath ('//div[start-with (@id, ' Test ')]/text () ') 2.string (.) use case: Label set labelselector = etree. HTML (HTML) data = Selector.xpath ('/
This article explains how to use XPath in scrapy to get the various values you wantUsing watercress as an exampleHttps://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start=20type=TYou can verify that your XPath is correct in conjunction with the Plugin XPath helper in Chrome.Here I want to get the title in the href and a tag under the a tag, use the Extract_first () in
fromlxmlImportetreeImportRequestsurl='Https://movie.douban.com/chart'Headers= {"user-agent":"mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36"}response= Requests.get (url,headers=headers) Html_str=Response.content.decode ()#print (HTML_STR)#using etree to process dataHTML =etree. HTML (HTML_STR)#get the URL address of the movieUrl_list = Html.xpath ("//div[@class = ' indent ']/div
Driver.find_element_by_xpath (input[@id = "kw"])The above code, I believe a lot of learning Selenium + python friends are very familiar with, is to locate Baidu home search box code, if we want to "kw", with a variable to indicate how to operate it?At present, I know there are two ways, such as the next, is to locate the Baidu search box, click the Search code, in the process of XPath positioning, using the
1. After the XPath () double quotation mark ("") inside cannot apply the double quotation mark (""), the inside double quotation mark ("") to the single quotation mark ("") the error is gone. 2. How can I find the positioning point exactly when locating the element?To skillfully apply F12, determine the page element to be positioned, and see if the element-related attribute value is unique in the code in the page (if there is an ID value that can be u
This is a test.html file content
Here's how XPath is used
#coding: Utf-8 import lxml import lxml.etree html=lxml.etree.parse ("test.html") print type (HTML) res=html.xpath ("//li") Print res print len (res) #列表长度 Print type (res) #元素列表 print type (res[0]) #树的元素 res1=html.xpath ("//li/@class") #同级目录 print R Es1 Res2=html.xpath ("//li/@text") Print Res2 Res3=ht
# with contains, look for the page where the Style property value contains all the DIV elements with the keyword sp.gif, where the @ can be followed by any property name of the element.Self.driver.find_element_by_xpath ('//div[contains (@style, "Sp.gif")] '). Click ()# with Start-with, look for a DIV element with the style attribute starting with position, where the @ can be followed by any property name of the element.Self.driver.find_element_by_xpath ('//div[start-with (@style, "position")] ')
BS4 does not have this good, bs4 tree is too complexlxml is good.Very good locationDetailed explanations are in the comments.1 #!/usr/bin/python3.42 #-*-coding:utf-8-*-3 4 fromlxmlImportetree5 Importurllib.request6 7 #the HTML of the destination URL can be viewed8URL ="http://www.1kkk.com/manhua589/"9 #parsing URLsTendata =urllib.request.urlopen (URL). Read () One #decoding Ahtml = Data.decode ('UTF-8','Ignore') - -page =etree.
Using the urllib or urllib2 module that comes with Pyhton to capture webpages may be a bit lazy. Let's take a look at some new things today. Let's take a look at the tutorial of using the lxml module and the Requests module to capture HTML pages in Python:
Web captureThe Web site uses HTML description, which means that each web page is a structured document. Some
:
Carson Busses$29.95
After knowing this, we can create the correct XPath query and use the lxml xpath function, as shown below:
# This will create a list of buyers: buyers = tree. xpath ('// p [@ title = "buyer-name"]/text ()') # This creates a prices list: prices = tree. xpath ('// span [@ class = "item-price"]/te
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.