Regular, BS4, XPath and Jsonpath matching rules

Last Update:2018-06-04 Source: Internet

Author: User

Tags tag name xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular match:　　　

Rules
Single-character
. : All characters except line break
[]: [AOE] [a-w] matches any one of the characters in the set
\d: Number [0-9]
\d: Non-digital
\w: Numbers, letters, underscores, Chinese
\w: Non-\w
\s: all whitespace characters
\s: Non-blank
Quantity modification:
*: Any number of times >=0
+: At least 1 times >=1
? : Optional 0 or 1 times
{m}: fixed M-times
{m,}: at least m times
{m,n}: M-n times
Boundary:
\b \b
$: End With XXX
^: Start with XXX
Group:
(AB) {3}
() {4} as a whole
() sub-mode \ Group Mode \1 \2
Greedy mode:
.*? .+?
Re. I: Ignore case
Re. M: Multi-line matching
Re. S: Single line match

Match\search\findall
Re.sub (regular expression, replace content, original string)

BS4 matching Rules

Need to install: Pip install BS4
BS4 need a third-party library when using it, and install the library with PIP install lxml

Simple to use:
Description: Selector, jquery
From BS4 import BeautifulSoup
How to use: You can convert an HTML document into a specified object and then find the specified content through the object's methods or properties
(1) Convert local file:
Soup = beautifulsoup (open (' Local file '), ' lxml ')
(2) Conversion network files:
Soup = BeautifulSoup (' String type or byte type ', ' lxml ')
(1) Search by tag name
Soup.a can only find the first label that meets the requirements
(2) Get Properties
Soup.a.attrs gets all the properties and values, returns a dictionary
soup.a.attrs[' href '] get href attribute
soup.a[' href ') can also be abbreviated to this form
(3) Get content
Soup.a.string
Soup.a.text
Soup.a.get_text ()
Note: If the label also has a label, the string gets the result of none, while the other two can get the text content
(4) Find
Soup.find (' a ') found the first symbol required by a
Soup.find (' A ', title= "xxx")
Soup.find (' A ', alt= "xxx")
Soup.find (' A ', class_= "xxx")
Soup.find (' A ', id= "xxx")

Note: The Find method not only soup can be called, the ordinary Div object can also be called, will go to the specified div to find the node that meets the requirements
Find is the first label that meets the requirements.
(5) Find_all
Soup.find_all (' a ')
Soup.find_all ([' A ', ' B '])
Soup.find_all (' A ', limit=2) limits the first two
(6) Select
Selects the specified content according to the selector
Common selectors: Tag Selector, class selector, ID selector, combo selector, hierarchy selector, pseudo class selector, property selector
A #标签选择器
. Dudu class Selector
#lala ID Selector
A,. Dudu, #lala,. Meme #组合选择器
Div. Dudu #lala. Meme. Xixi #包含选择器
div > P > A >. lala can only be the following level #层级选择器
Input[name= ' Lala '] #属性选择器

The Select selector return is always a list, and you need to extract the specified object by subscript, and then get the properties and nodes
This method can also be called by ordinary objects to find all the nodes that meet the requirements under this object

XPath matching principle

Installing the XPath Plugin
Drag the XPath plugin to the Google Chrome extension to install successfully
Starting and closing plugins
CTRL + SHIFT + X
Attribute positioning
input[@id = "KW"]
input[@class = "BG s_btn"]
Level positioning
Index positioning
div[@id = "Head"]/div/div[2]/a[@class = "Toindex"]
"Note" Index starting from 1
div[@id = "Head"]//a[@class = "Toindex"]
The "note" Double slash represents all the a nodes below, regardless of location
Logical operations
input[@class = "S_ipt" and @name = "WD"]
Fuzzy matching
Contains
Input[contains (@class, "s_i")]
All input, with the class attribute, and the node with s_i in the attribute
Input[contains (Text (), "Love")]
Starts-with
Input[starts-with (@class, "s")]
All input, has a class attribute, and the attribute starts with s
Fetching text
div[@id = "U1"]/a[5]/text () gets the contents of the node
div[@id = "U1"]//text () gets all the contents of the node without labels

Just stitch everything together and return it to you.
ret = Tree.xpath ('//div[@class = "song"])
string = Ret[0].xpath (' string (.) ')
Print (String.Replace (' \ n ', '). replace (' \ t ', ')
Take attribute
div[@id = "U1"]/a[5]/@href

Using XPath in your code
From lxml import etree
Two ways to use: To turn an HTML document into an object, and then invoke the object's method to find the specified node
(1) Local file
Tree = etree.parse (file name)
(2) Network files
Tree = etree. HTML (Web page string)

ret = Tree.xpath (path expression)
"Note" RET is a list

Jsonpath matching Rules

Jsonpath: Used to parse JSON data using the
Python handles the functions used in JSON format
Import JSON
Json.dumps (): Converts a dictionary or list to a JSON-formatted string
Json.loads (): Converts a JSON format string to a Python object
Json.dump (): Converts a dictionary or list into a JSON-formatted string and writes to a file
Json.load (): Reading JSON format strings from a file into a Python object
Front-end Processing:
Converts a JSON format string to a JS object
Json.parse (' JSON format string ')
Eval (' (' + JSON format string + ') ')
Installation:
Pip Install lxml
Pip Install Jsonpath
http://blog.csdn.net/luxideyao/article/details/77802389
Comparison of XPath and Jsonpath
/$ root element
. @ Current Element
/ . Child elements
// .. Find Anywhere
* * Wildcard characters
[] ? () filter
XPath index subscript starting from 1
Jsonpath index subscript starting from 0

Regular, BS4, XPath and Jsonpath matching rules

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More