Use of "Python3 crawler" Beautiful Soup Library

Source: Internet
Author: User

Before learning the regular expression, but found that if the use of regular expressions to write web crawler, it is quite complex ah! So there's beautiful Soup.

Simply put, Beautiful soup is a library of Python, and the main function is to fetch data from a Web page.

Beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code.

Installing beautiful Soup

Installing with Commands

Pip Install Beautifulsoup4

The above indicates that a successful installation has occurred

Use of Beautiful soup

1. You must first import the BS4 library

Import BeautifulSoup

2. Define HTML content (prepare for the following example demo)

The following HTML code will be used as an example for many times. This is a section of Alice in Wonderland (hereafter referred to as Alice's document ):

HTML = """ class="title"><b>the Dormouse ' s story</b></p><pclass=" Story">once upon A TimeThere were three Little sisters; And their names Were<a href= "Http://example.com/elsie"class="Sister"Id="Link1">elsie</a>,<a href="Http://example.com/lacie"class="Sister"Id="Link2">Lacie</a> and<a href="Http://example.com/tillie"class="Sister"Id="Link3">tillie</a>;and they lived at the bottom of a well.</p><pclass=" Story">...</p>"""

3. Create a BeautifulSoup object

#创建BeautifulSoup对象soup = beautifulsoup (html)"" If the HTML content exists in the file a.html, then you can create the BeautifulSoup object soup = BeautifulSoup (Open(a.html))"" "

4. Formatted output

#格式化输出 Print (Soup.prettify ())

Output Result:

5.Beautiful Soup transform complex HTML documents into a complex tree structure

Each node is a Python object, and all objects can be summed up into 4 types:

    • Tag
    • Navigablestring
    • BeautifulSoup
    • Comment

(1) Tags

tags are tags in HTML, such as:

<title></title>

<a></a>

<p></p>

...

And all the labels.

Below to feel how to use Beautiful Soup to easily get Tags

#获取tags Print (Soup.title) #运行结果: <title>the dormouse ' s story</title> Print (Soup.head) #运行结果:  Print (SOUP.A) #运行结果: <a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >Elsie</a> Print (SOUP.P) #运行结果: <p class= "title" ><b>the dormouse ' s story</b></p>

One thing, however, is that it looks for the first qualifying label in all the content, and the output of the <a> tab will be clear!

We can use the type to verify the following types of labels

#看获取Tags的数据类型 Print (Type (soup.title)) #运行结果: <class ' Bs4.element.Tag ' >

For tags, there are 2 properties, name and Attrs

#查看Tags的两个属性name, Attrs Print (Soup.a. name) #运行结果: A Print (Soup.a.attrs) #运行结果: {' href ': ' Http://example.com/elsie ', ' class ': [' sister '], ' id ': ' link1 '}

From the above output we can see the label <a> the Attrs property output is a dictionary, we want to get the specific value in the dictionary can be like this

p = soup.a.attrsprint(p['class')#print (P.get (' class ')) is equivalent to the above method #运行结果: [' sister '] 

(2) navigablestring

We have acquired tags, so how do we get the content of tags?

#获取标签内部的文字 (navigablestring) Print (Soup.a. string) #运行结果: Elsie

Similarly, we can also view his type through type

Print (Type (soup.a.  String))#运行结果: <class ' bs4.element.NavigableString ' >

(3) BeautifulSoup

The soup itself has these two attributes, but it's more special.

#查看BeautifulSoup的属性 Print (Soup. name) #运行结果: [Document] Print (Soup.attrs) #运行结果: {}

(4)Comment

Let's change the paragraph in the above HTML to look like this (change the contents of the <a></a> tag to the comment content)

<a href= "http://example.com/elsieclass="sister"id="link1"><!-- Elsie--></a>

We can also extract the annotated content using comment

#获取标签内部的文字print (soup.a.string) #运行结果: Elsie

View its type

Print (Type (soup.a.string)) #运行结果:<class ' bs4.element.Comment ' >

Use of "Python3 crawler" Beautiful Soup Library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.