BeautifulSoup supports the most commonly used CSS selectors, which is the. Select () method that converts a string into a tag object or BeautifulSoup itself.
The HTML used in this article is:
Html_doc = "" "<html><head><title>The Dormouse ' s story</title></head><body><p class="title"><b>The Dormouse ' s story</b></P><p class="story">Once upon a time there were three Little sisters; and their names were<a href="Http://example.com/elsie" class="Sister" ID ="Link1">Elsie</a>,<a href="Http://example.com/lacie" class="Sister " id="Link2">Lacie</a>and<a href="Http://example.com/tillie" class="Sister" ID ="Link3">Tillie</a>; and they lived at the bottom of a well.</P><p class="story">...</P>"""
For example, you can search for notes like this:
soup.select("title") #使用select函数# [<title>The Dormouse‘s story</title>]soup.select("p nth-of-type(3)")# [<p class="story">...</p>]
Alternatively, you can search for tags inside other parent tags, that is, through the tag's owning relationship :
Soup.select ("Body A") #搜索在body标签内部的aTags # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>,# <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>,# <aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a>]soup.select ("HTML head title") #搜索在html the label inside the->head tag # [<title>the Dormouse ' s story</title>]
You can directly look for tags inside other tags :
Soup.select ("head > title") # [<title>the dormouse ' s Story</title>]soup.select ("p > a") # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>,# <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>,# <aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a>]soup.select ("p > A:nth-of-type (2)") # [<aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>]soup.select ("p > #link1") # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>]soup.select ("Body > a")# []
get the brothers of an element through tags tags :
Soup.Select("#link1 ~ sister") #获得id为link1,classTagged content for sister's brother (all brother notes) # [<aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>,# <aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a>]soup.Select("#link1 +. Sister") #获得id为link1,classTagged content for sister's brother (next brother note) # [<aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>]
Get tags tags from css classes :
Soup.select (". Sister") #获得所有class为sister的标签 # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>,# <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>,# <aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a>]soup.select ("[Class~=sister]") #效果同上一个 # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>,# <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>,# <aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a;]
Get tags by id:
soup.select("#link1") #通过设置参数为id来获取该id对应的tag# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("a#link2") #这里区别于上一个单纯的使用id,又增添了tag属性,使查找更加具体# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Get tags by setting the parameters of the Select function as a list. It can be captured as long as it matches any one of the list.
Soup.select ("#link1, #link2") #捕获id为link1或link2的标签 # [<a class="sister" href="http:///example.com/ Elsie" id=" Link1 ">Elsie</a>, #<a class="sister" href="http://example.com /Lacie" id=" Link2 ">Lacie</a>]
To obtain a property according to whether the tag exists:
soup.select(‘a[href]‘) #获取a标签中具有href属性的标签# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
To find tags by a specific property value of a tag:
Soup.select ('a[href="Http://example.com/elsie"] ') # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>]soup.select ('a[href^="http://example.com/"] ') # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>,# <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>,# <aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a>]soup.select ('a[href$="Tillie"] ') # [<aclass="Sister"href="Http://example.com/tillie"Id="Link3">tillie</a>]soup.select ('a[href*=". Com/el"] ') # [<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a;]
Here's what you need to explain:
Soup.select (' a[href^= ' http://example.com/"]) means that the find HREF attribute value is a label that starts with the" http://example.com/"value, and you can view the blog introduction.
Soup.select (' a[href$= ' Tillie "]) means that the lookup href attribute value is a label that ends with Tillie.
Soup.select (' a[href*= '. Com/el "]) means that the string". Com/el "is found in the HREF attribute value, so only href=" Http://example.com/elsie "a match.
How to query the first label that meets the criteria for a query:
soup.select_one(".sister") #只查询符合条件的第一个tag# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
BeautifulSoup CSS SELECTORS/CSS selector for advanced applications