Use of "Python3 crawler" Beautiful Soup Library

Last Update:2018-03-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before learning the regular expression, but found that if the use of regular expressions to write web crawler, it is quite complex ah! So there's beautiful Soup.

Simply put, Beautiful soup is a library of Python, and the main function is to fetch data from a Web page.

Beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code.

Installing beautiful Soup

Installing with Commands

Pip Install Beautifulsoup4

The above indicates that a successful installation has occurred

Use of Beautiful soup

1. You must first import the BS4 library

Import BeautifulSoup

2. Define HTML content (prepare for the following example demo)

The following HTML code will be used as an example for many times. This is a section of Alice in Wonderland (hereafter referred to as Alice's document ):

HTML = """ class="title"><b>the Dormouse ' s story</b></p><pclass=" Story">once upon A TimeThere were three Little sisters; And their names Were<a href= "Http://example.com/elsie"class="Sister"Id="Link1">elsie</a>,<a href="Http://example.com/lacie"class="Sister"Id="Link2">Lacie</a> and<a href="Http://example.com/tillie"class="Sister"Id="Link3">tillie</a>;and they lived at the bottom of a well.</p><pclass=" Story">...</p>"""

3. Create a BeautifulSoup object

#创建BeautifulSoup对象soup = beautifulsoup (html)"" If the HTML content exists in the file a.html, then you can create the BeautifulSoup object soup = BeautifulSoup (Open(a.html))"" "

4. Formatted output

#格式化输出 Print (Soup.prettify ())

Output Result:

5.Beautiful Soup transform complex HTML documents into a complex tree structure

Each node is a Python object, and all objects can be summed up into 4 types:

Tag
Navigablestring
BeautifulSoup
Comment

(1) Tags

tags are tags in HTML, such as:

<a></a>

<p></p>

...

And all the labels.

Below to feel how to use Beautiful Soup to easily get Tags

#获取tags Print (Soup.title) #运行结果: <title>the dormouse ' s story</title> Print (Soup.head) #运行结果:  Print (SOUP.A) #运行结果: <a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >Elsie</a> Print (SOUP.P) #运行结果: <p class= "title" ><b>the dormouse ' s story</b></p>

One thing, however, is that it looks for the first qualifying label in all the content, and the output of the <a> tab will be clear!

We can use the type to verify the following types of labels

#看获取Tags的数据类型 Print (Type (soup.title)) #运行结果: <class ' Bs4.element.Tag ' >

For tags, there are 2 properties, name and Attrs

#查看Tags的两个属性name, Attrs Print (Soup.a. name) #运行结果: A Print (Soup.a.attrs) #运行结果: {' href ': ' Http://example.com/elsie ', ' class ': [' sister '], ' id ': ' link1 '}

From the above output we can see the label <a> the Attrs property output is a dictionary, we want to get the specific value in the dictionary can be like this

p = soup.a.attrsprint(p['class')#print (P.get (' class ')) is equivalent to the above method #运行结果: [' sister ']

(2) navigablestring

We have acquired tags, so how do we get the content of tags?

#获取标签内部的文字 (navigablestring) Print (Soup.a. string) #运行结果: Elsie

Similarly, we can also view his type through type

Print (Type (soup.a.  String))#运行结果: <class ' bs4.element.NavigableString ' >

(3) BeautifulSoup

The soup itself has these two attributes, but it's more special.

#查看BeautifulSoup的属性 Print (Soup. name) #运行结果: [Document] Print (Soup.attrs) #运行结果: {}

(4)Comment

Let's change the paragraph in the above HTML to look like this (change the contents of the <a></a> tag to the comment content)

<a href= "http://example.com/elsieclass="sister"id="link1"><!-- Elsie--></a>

We can also extract the annotated content using comment

#获取标签内部的文字print (soup.a.string) #运行结果: Elsie

View its type

Print (Type (soup.a.string)) #运行结果:<class ' bs4.element.Comment ' >

Use of "Python3 crawler" Beautiful Soup Library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use of "Python3 crawler" Beautiful Soup Library

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use of "Python3 crawler" Beautiful Soup Library

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support