Beautiful Soup
Parsing Web pages with features such as the structure and attributes of a Web page eliminates the ability to write complex regular expressions.
Beautiful soup is an HTML or XML parsing library for Python.
1. Parser
| Parser |
How to use |
Advantage |
Disadvantage |
| Python Standard library |
BeautifulSoup (markup, "Html.parser") |
Moderate execution speed and strong document tolerance |
2.7.3 and 3.2.2 versions prior to tolerance |
| lxml HTML Parser |
BeautifulSoup (markup, "lxml") |
High speed and document fault tolerance |
Need to install the C language Library |
| lxml XML Parser |
BeautifulSoup (markup, "XML") |
Fast, the only parser that supports XML |
Need to install the C language Library |
| Html5lib |
BeautifulSoup (markup, "Html5lib") |
Best-in-tolerance, browser-based parsing of documents, generating HTML5-formatted documents |
Slow, not dependent on external expansion |
In summary, the recommended lxml HTML parser
| 123 |
frombs4 import BeautifulSoupsoup =BeautifulSoup(‘<p>Hello World</p>‘,‘lxml‘)print(soup.p.string) |
2. Basic usage:
| 1234567891011 |
html = " <HTML> <BODY> <p class= "title" Name= "Dr" ><b >title example</b></p> <p class= "story" >link Code class= "Python comments" ><a href= "Http://example.com/elsie" class= "sister" id= "Link1" >ELSIE</A> <a href= "Http://example.com/lacie" class= "sister" id= "Link2" >lacie< /a>, <a href= "Http://example.com/tillie" class= "sister" id= "Link3" > Tillie</a>, last Sentence</p> |
| 1234 |
frombs4 import BeautifulSoupsoup =BeautifulSoup(html,‘lxml‘)print(soup.prettify()) # 修复htmlprint(soup.title.string) # 输出title节点的字符串内容 |
3. Node selector:
Select element
Get by using the soup. Element method
Extracting information
(1) Get the name
Use the soup. Element. Name to get the element name
(2) Get Properties
Use the soup. Element. attrs
Use soup. Element. attrs[' name ']
(3) Element content
Use the soup. Element. String to get the content
Nested selection
Use the soup. Parent element. element. String Get Content
Association selection
(1) Child nodes and descendant nodes
| 1234567891011 |
html =‘‘‘<body><p class="title" name="dr"><b>title example</b></p><p class="story">link<a href="http://example.com/elsie" class="sister" id="link1"><span>elsie</span></a>,<a href="http://example.com/lacie" class="sister" id="link2"><span>lacie</span></a>,<a href="http://example.com/tillie" class="sister" id="link3"><span>tillie</span></a>,last sentence</p>‘‘‘ |
| 123456789101112 |
frombs4 importBeautifulSoup# 得到直接子节点,children属性soup =BeautifulSoup(html,‘lxml‘)print(soup.p.children)fori ,child inenumerate(soup.p.children): print(i,child)# 得到所有的子孙节点,descendants属性soup =BeautifulSoup(html,‘lxml‘)print(soup.p.descendants)fori,child inenmuerate(soup.p.descendants): print(i,child) |
(2) Parent and ancestor nodes
Calling the parent node, using the Parent property
Get all ancestor nodes, using the Parents property
(3) Brother node
Next_sibling Next Sibling element
Previous_sibling Previous Sibling element
Next_siblings all front sibling nodes
Previous_siblings All Back sibling nodes
(4) Extracting information
4. Method selector:
Find_all ()
Find_all (Name,attrs,recursize,text,**kwargs)
(1) Name
| 123 |
soup.find_all(name=‘ul‘)forul insoup.find_all(name=‘ul‘): print(ul.find_all(name=‘ul‘)) |
| 1234 |
for ul insoup.find_all(name=‘ul‘): print(ul.find_all(name=‘li‘)) forli in ul.find_all(name=‘li‘): print(li.string) |
(2) Attes
| 1234567 |
# 根据节点名查询print(soup.find_all(attrs={‘id‘:‘list1‘}))print(soup.find_all(attrs={‘name‘:‘elements‘}))# 也可以写成print(soup.find_all(id=‘list1‘))print(soup.find_all(class=‘elements‘)) |
(3) Text
The text parameter can be used to match the literal of the node, the incoming form can be a string, and can be a regular expression object
| 123 |
frombs4 import BeautifulSoupsoup =BeautifulSoup(html,‘lxml‘)print(soup.find_all(text=re.compile(‘link‘))) |
Find ()
Returns an element
Note
Find_parents () and Find_parent ()
Find_next_siblings () and find_next_sibling ()
Find_previous_siblings () and find_previous_sibling ()
Find_all_next () and Find_next ()
Find_all_previous () and find_previous ()
5.CSS selector:
Nested selection
| 12 |
forul insoup.select(‘ul‘): print(ul.select(‘li‘)) |
Get Properties
| 1234 |
forul insoup.select(‘ul‘): print(ul[‘id‘]) # 等价于 print(ul.attrs[‘id‘]) |
Get text
Get text with the Get_text () method In addition to the string property
| 1234 |
forli insoup.select(‘li‘): # 效果一样 print(li.get_text()) print(li.string) |
Python Beautiful Soup Parsing Library usage