Python Beautiful Soup Parsing Library usage

Source: Internet
Author: User

Beautiful Soup

Parsing Web pages with features such as the structure and attributes of a Web page eliminates the ability to write complex regular expressions.

Beautiful soup is an HTML or XML parsing library for Python.

1. Parser

Parser How to use Advantage Disadvantage
Python Standard library BeautifulSoup (markup, "Html.parser") Moderate execution speed and strong document tolerance 2.7.3 and 3.2.2 versions prior to tolerance
lxml HTML Parser BeautifulSoup (markup, "lxml") High speed and document fault tolerance Need to install the C language Library
lxml XML Parser BeautifulSoup (markup, "XML") Fast, the only parser that supports XML Need to install the C language Library
Html5lib BeautifulSoup (markup, "Html5lib") Best-in-tolerance, browser-based parsing of documents, generating HTML5-formatted documents Slow, not dependent on external expansion

In summary, the recommended lxml HTML parser

123 frombs4 import BeautifulSoupsoup =BeautifulSoup(‘<p>Hello World</p>‘,‘lxml‘)print(soup.p.string)

2. Basic usage:

1234567891011 html  =   " <HTML> <BODY> <p class= "title" Name= "Dr" ><b >title example</b></p> <p class= "story" >link Code class= "Python comments" ><a href= "Http://example.com/elsie" class= "sister" id= "Link1" >ELSIE</A> <a href= "Http://example.com/lacie" class= "sister" id= "Link2" >lacie< /a>, <a href= "Http://example.com/tillie" class= "sister" id= "Link3" > Tillie</a>, last Sentence</p>
1234 frombs4 import BeautifulSoupsoup =BeautifulSoup(html,‘lxml‘)print(soup.prettify())    # 修复htmlprint(soup.title.string)    # 输出title节点的字符串内容

3. Node selector:

Select element

Get by using the soup. Element method

Extracting information

(1) Get the name

Use the soup. Element. Name to get the element name

(2) Get Properties

Use the soup. Element. attrs

Use soup. Element. attrs[' name ']

(3) Element content

Use the soup. Element. String to get the content

Nested selection

Use the soup. Parent element. element. String Get Content

Association selection

(1) Child nodes and descendant nodes

1234567891011 html =‘‘‘<body><p class="title" name="dr"><b>title example</b></p><p class="story">link<a href="http://example.com/elsie" class="sister" id="link1"><span>elsie</span></a>,<a href="http://example.com/lacie" class="sister" id="link2"><span>lacie</span></a>,<a href="http://example.com/tillie" class="sister" id="link3"><span>tillie</span></a>,last sentence</p>‘‘‘
123456789101112 frombs4 importBeautifulSoup# 得到直接子节点,children属性soup =BeautifulSoup(html,‘lxml‘)print(soup.p.children)fori ,child inenumerate(soup.p.children):    print(i,child)# 得到所有的子孙节点,descendants属性soup =BeautifulSoup(html,‘lxml‘)print(soup.p.descendants)fori,child inenmuerate(soup.p.descendants):    print(i,child)

(2) Parent and ancestor nodes

Calling the parent node, using the Parent property

Get all ancestor nodes, using the Parents property

(3) Brother node

Next_sibling Next Sibling element

Previous_sibling Previous Sibling element

Next_siblings all front sibling nodes

Previous_siblings All Back sibling nodes

(4) Extracting information

4. Method selector:

Find_all ()

Find_all (Name,attrs,recursize,text,**kwargs)

(1) Name

123 soup.find_all(name=‘ul‘)forul insoup.find_all(name=‘ul‘):    print(ul.find_all(name=‘ul‘))
1234 for ul insoup.find_all(name=‘ul‘):    print(ul.find_all(name=‘li‘))    forli in ul.find_all(name=‘li‘):        print(li.string)

(2) Attes

1234567 # 根据节点名查询print(soup.find_all(attrs={‘id‘:‘list1‘}))print(soup.find_all(attrs={‘name‘:‘elements‘}))# 也可以写成print(soup.find_all(id=‘list1‘))print(soup.find_all(class=‘elements‘))

(3) Text

The text parameter can be used to match the literal of the node, the incoming form can be a string, and can be a regular expression object

123 frombs4 import BeautifulSoupsoup =BeautifulSoup(html,‘lxml‘)print(soup.find_all(text=re.compile(‘link‘)))

Find ()

Returns an element

Note

Find_parents () and Find_parent ()

Find_next_siblings () and find_next_sibling ()

Find_previous_siblings () and find_previous_sibling ()

Find_all_next () and Find_next ()

Find_all_previous () and find_previous ()

5.CSS selector:

Nested selection

12 forul insoup.select(‘ul‘):    print(ul.select(‘li‘))

Get Properties

1234 forul insoup.select(‘ul‘):    print(ul[‘id‘])    # 等价于    print(ul.attrs[‘id‘])

Get text

Get text with the Get_text () method In addition to the string property

1234 forli insoup.select(‘li‘):    # 效果一样    print(li.get_text())    print(li.string)

Python Beautiful Soup Parsing Library usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.