Regular match:
Rules
Single-character
. : All characters except line break
[]: [AOE] [a-w] matches any one of the characters in the set
\d: Number [0-9]
\d: Non-digital
\w: Numbers, letters, underscores, Chinese
\w: Non-\w
\s: all whitespace characters
\s: Non-blank
Quantity modification:
*: Any number of times >=0
+: At least 1 times >=1
? : Optional 0 or 1 times
{m}: fixed M-times
{m,}: at least m times
{m,n}: M-n times
Boundary:
\b \b
$: End With XXX
^: Start with XXX
Group:
(AB) {3}
() {4} as a whole
() sub-mode \ Group Mode \1 \2
Greedy mode:
.*? .+?
Re. I: Ignore case
Re. M: Multi-line matching
Re. S: Single line match
Match\search\findall
Re.sub (regular expression, replace content, original string)
BS4 matching Rules
Need to install: Pip install BS4
BS4 need a third-party library when using it, and install the library with PIP install lxml
Simple to use:
Description: Selector, jquery
From BS4 import BeautifulSoup
How to use: You can convert an HTML document into a specified object and then find the specified content through the object's methods or properties
(1) Convert local file:
Soup = beautifulsoup (open (' Local file '), ' lxml ')
(2) Conversion network files:
Soup = BeautifulSoup (' String type or byte type ', ' lxml ')
(1) Search by tag name
Soup.a can only find the first label that meets the requirements
(2) Get Properties
Soup.a.attrs gets all the properties and values, returns a dictionary
soup.a.attrs[' href '] get href attribute
soup.a[' href ') can also be abbreviated to this form
(3) Get content
Soup.a.string
Soup.a.text
Soup.a.get_text ()
Note: If the label also has a label, the string gets the result of none, while the other two can get the text content
(4) Find
Soup.find (' a ') found the first symbol required by a
Soup.find (' A ', title= "xxx")
Soup.find (' A ', alt= "xxx")
Soup.find (' A ', class_= "xxx")
Soup.find (' A ', id= "xxx")
Note: The Find method not only soup can be called, the ordinary Div object can also be called, will go to the specified div to find the node that meets the requirements
Find is the first label that meets the requirements.
(5) Find_all
Soup.find_all (' a ')
Soup.find_all ([' A ', ' B '])
Soup.find_all (' A ', limit=2) limits the first two
(6) Select
Selects the specified content according to the selector
Common selectors: Tag Selector, class selector, ID selector, combo selector, hierarchy selector, pseudo class selector, property selector
A #标签选择器
. Dudu class Selector
#lala ID Selector
A,. Dudu, #lala,. Meme #组合选择器
Div. Dudu #lala. Meme. Xixi #包含选择器
div > P > A >. lala can only be the following level #层级选择器
Input[name= ' Lala '] #属性选择器
The Select selector return is always a list, and you need to extract the specified object by subscript, and then get the properties and nodes
This method can also be called by ordinary objects to find all the nodes that meet the requirements under this object
XPath matching principle
Installing the XPath Plugin
Drag the XPath plugin to the Google Chrome extension to install successfully
Starting and closing plugins
CTRL + SHIFT + X
Attribute positioning
input[@id = "KW"]
input[@class = "BG s_btn"]
Level positioning
Index positioning
div[@id = "Head"]/div/div[2]/a[@class = "Toindex"]
"Note" Index starting from 1
div[@id = "Head"]//a[@class = "Toindex"]
The "note" Double slash represents all the a nodes below, regardless of location
Logical operations
input[@class = "S_ipt" and @name = "WD"]
Fuzzy matching
Contains
Input[contains (@class, "s_i")]
All input, with the class attribute, and the node with s_i in the attribute
Input[contains (Text (), "Love")]
Starts-with
Input[starts-with (@class, "s")]
All input, has a class attribute, and the attribute starts with s
Fetching text
div[@id = "U1"]/a[5]/text () gets the contents of the node
div[@id = "U1"]//text () gets all the contents of the node without labels
Just stitch everything together and return it to you.
ret = Tree.xpath ('//div[@class = "song"])
string = Ret[0].xpath (' string (.) ')
Print (String.Replace (' \ n ', '). replace (' \ t ', ')
Take attribute
div[@id = "U1"]/a[5]/@href
Using XPath in your code
From lxml import etree
Two ways to use: To turn an HTML document into an object, and then invoke the object's method to find the specified node
(1) Local file
Tree = etree.parse (file name)
(2) Network files
Tree = etree. HTML (Web page string)
ret = Tree.xpath (path expression)
"Note" RET is a list
Jsonpath matching Rules
Jsonpath: Used to parse JSON data using the
Python handles the functions used in JSON format
Import JSON
Json.dumps (): Converts a dictionary or list to a JSON-formatted string
Json.loads (): Converts a JSON format string to a Python object
Json.dump (): Converts a dictionary or list into a JSON-formatted string and writes to a file
Json.load (): Reading JSON format strings from a file into a Python object
Front-end Processing:
Converts a JSON format string to a JS object
Json.parse (' JSON format string ')
Eval (' (' + JSON format string + ') ')
Installation:
Pip Install lxml
Pip Install Jsonpath
http://blog.csdn.net/luxideyao/article/details/77802389
Comparison of XPath and Jsonpath
/$ root element
. @ Current Element
/ . Child elements
// .. Find Anywhere
* * Wildcard characters
[] ? () filter
XPath index subscript starting from 1
Jsonpath index subscript starting from 0
Regular, BS4, XPath and Jsonpath matching rules