Before learning the regular expression, but found that if the use of regular expressions to write web crawler, it is quite complex ah! So there's beautiful Soup.
Simply put, Beautiful soup is a library of Python, and the main function is to fetch data from a Web page.
Beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code.
Installing beautiful Soup
Installing with Commands
Pip Install Beautifulsoup4
The above indicates that a successful installation has occurred
Use of Beautiful soup
1. You must first import the BS4 library
Import BeautifulSoup
2. Define HTML content (prepare for the following example demo)
The following HTML code will be used as an example for many times. This is a section of Alice in Wonderland (hereafter referred to as Alice's document ):
HTML = """ class="title"><b>the Dormouse ' s story</b></p><pclass=" Story">once upon A TimeThere were three Little sisters; And their names Were<a href= "Http://example.com/elsie"class="Sister"Id="Link1">elsie</a>,<a href="Http://example.com/lacie"class="Sister"Id="Link2">Lacie</a> and<a href="Http://example.com/tillie"class="Sister"Id="Link3">tillie</a>;and they lived at the bottom of a well.</p><pclass=" Story">...</p>"""
3. Create a BeautifulSoup object
#创建BeautifulSoup对象soup = beautifulsoup (html)"" If the HTML content exists in the file a.html, then you can create the BeautifulSoup object soup = BeautifulSoup (Open(a.html))"" "
4. Formatted output
#格式化输出 Print (Soup.prettify ())
Output Result:
5.Beautiful Soup transform complex HTML documents into a complex tree structure
Each node is a Python object, and all objects can be summed up into 4 types:
- Tag
- Navigablestring
- BeautifulSoup
- Comment
(1) Tags
tags are tags in HTML, such as:
<title></title>
<a></a>
<p></p>
...
And all the labels.
Below to feel how to use Beautiful Soup to easily get Tags
#获取tags Print (Soup.title) #运行结果: <title>the dormouse ' s story</title> Print (Soup.head) #运行结果: Print (SOUP.A) #运行结果: <a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >Elsie</a> Print (SOUP.P) #运行结果: <p class= "title" ><b>the dormouse ' s story</b></p>
One thing, however, is that it looks for the first qualifying label in all the content, and the output of the <a> tab will be clear!
We can use the type to verify the following types of labels
#看获取Tags的数据类型 Print (Type (soup.title)) #运行结果: <class ' Bs4.element.Tag ' >
For tags, there are 2 properties, name and Attrs
#查看Tags的两个属性name, Attrs Print (Soup.a. name) #运行结果: A Print (Soup.a.attrs) #运行结果: {' href ': ' Http://example.com/elsie ', ' class ': [' sister '], ' id ': ' link1 '}
From the above output we can see the label <a> the Attrs property output is a dictionary, we want to get the specific value in the dictionary can be like this
p = soup.a.attrsprint(p['class')#print (P.get (' class ')) is equivalent to the above method #运行结果: [' sister ']
(2) navigablestring
We have acquired tags, so how do we get the content of tags?
#获取标签内部的文字 (navigablestring) Print (Soup.a. string) #运行结果: Elsie
Similarly, we can also view his type through type
Print (Type (soup.a. String))#运行结果: <class ' bs4.element.NavigableString ' >
(3) BeautifulSoup
The soup itself has these two attributes, but it's more special.
#查看BeautifulSoup的属性 Print (Soup. name) #运行结果: [Document] Print (Soup.attrs) #运行结果: {}
(4)Comment
Let's change the paragraph in the above HTML to look like this (change the contents of the <a></a> tag to the comment content)
<a href= "http://example.com/elsieclass="sister"id="link1"><!-- Elsie--></a>
We can also extract the annotated content using comment
#获取标签内部的文字print (soup.a.string) #运行结果: Elsie
View its type
Print (Type (soup.a.string)) #运行结果:<class ' bs4.element.Comment ' >
Use of "Python3 crawler" Beautiful Soup Library