Python's BeautifulSoup library installation and its introduction

Source: Internet
Author: User

I. Preface

In the previous articles I described how to crawl blogs, Wikipedia infobox and images through Python parsing source code, with links to the following articles:
[Python learning] simply crawl the Wikipedia program language message box
[Python learning] simple web crawler Crawl blog post and ideas introduction
[Python learning] simply crawl pictures in the image gallery
The core code is as follows:

# coding=utf-8import urllibimport re# download static HTML page url= ' http://www.csdn.net/' content = Urllib.urlopen (URL). Read () Open (' Csdn.html ', ' w+ '). Write (content) #获取标题title_pat =r ' (?<=<title>). (?=</title>) ' Title_ex=re.compile (title_pat,re. M|re. S) Title_obj=re.search (title_ex, content) Title=title_obj.group () print title# get hyperlink content href = R ' <a href=.*?> (. *?) </a> ' m = Re.findall (href,content,re. S|re. m) for text in M:    print Unicode (text, ' Utf-8 ') break    #只输出一个url

The output results are as follows:

>>>csdn.net-the largest Chinese it community in the world, providing IT professionals with the most comprehensive information dissemination and service platform login >>>

The core code of the image download is as follows:

Import Osimport urllibclass Appurlopener (urllib. Fancyurlopener):    Version = "mozilla/5.0" Urllib._urlopener = appurlopener () url = "http://creatim.allyes.com.cn/ imedia/csdn/20150228/15_41_49_5b9c9e6a.jpg "filename = os.path.basename (URL) urllib.urlretrieve (URL, filename)

But the above method of parsing HTML to crawl site content has many drawbacks, such as:
1. Regular expressions are constrained by the HTML source, rather than by a more abstract structure, and small changes in the structure of the Web page can cause the program to break.
2. The program needs to be based on the actual HTML source analysis content, may encounter character entities such as &amp; and other HTML features, you need to specify processing such as <span></span>, icon hyperlinks, subscript and other different content.
3. Regular expressions are not fully readable, and more complex HTML code and query expressions can become messy.
as with the basic Python Tutorial (2nd edition), there are two solutions: the first is to use the tidy (Python library) program and XHTML parsing, and the second is to use the BeautifulSoup library.


Two. Installation and introduction Beautiful soup Library


Beautiful soup is a html/xml parser written in Python that handles non-canonical markup and generates a parse tree. It provides simple and common navigation navigating, search and modify the parse tree operation. It can greatly save your programming time.
as the book says, "The bad pages aren't written by you, you're just trying to get some data from them." Now you don't have to worry about what HTML looks like, and the parser helps you implement it. "
Download Address:
http://www.php.cn/
http://www.php.cn/
the installation process is as follows: Python setup.py install


The specific use of the method is recommended in Chinese:
http://www.php.cn/
The use of "Alice in Wonderland" is the official example of BeautifulSoup's usage:

#!/usr/bin/python#-*-coding:utf-8-*-from BS4 Import Beautifulsouphtml_doc = "" "

The output is output in the form of a standard indented structure as follows:


below is a quick and easy introduction to the BeautifulSoup Library: (see: official documentation)

If you want to get all the text in the article, the code is as follows:

"' Get all text content from the document ' ' Print Soup.get_text () # The Dormouse ' s story## the dormouse ' s story## Once upon a time there were three L Ittle Sisters; And their names were# elsie,# Lacie and# tillie;# and they lived at the bottom of a well.## ...

You may also encounter two typical error hints during this process:
1.importerror:no module named BeautifulSoup
When you successfully install the BeautifulSoup 4 library, "from BeautifulSoup import BeautifulSoup" may encounter this error.


The reason for this is that the BeautifulSoup 4 library is renamed BS4 and needs to be imported using "from BS4 import BeautifulSoup".
2.typeerror:an integer is required
You may encounter this error when you use "Print soup.title.string" to get the value of the title. As follows:


It should be an idle bug when using command line commands without any errors. Reference: StackOverflow. You can also resolve the problem by using the following code:
print Unicode (soup.title.string)
Print str (soup.title.string)


Three. Introduction of common methods of Beautiful soup


Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object that can be summed up into 4 types: Tag, navigablestring, BeautifulSoup, Comment|
1.Tag label
The tag object is the same as the tag in an XML or HTML document, and it has many methods and properties. One of the most important attributes is name and attribute. Use the following:

#!/usr/bin/python#-*-coding:utf-8-*-from bs4 Import beautifulsouphtml = "" 

Use BeautifulSoup each tag has its own name, which can be obtained by using. Name; the same tag may have many properties, and the properties are manipulated in the same way as the dictionary, and the properties can be obtained directly through ". Attrs". Please refer to the documentation for the modification and deletion operations.
2.NavigableString
Strings are often contained within tags, and Beautiful soup use the Navigablestring class to wrap the string in the tag. A navigablestring string is the same as a Unicode string in Python, and also supports some of the attributes contained in traversing the document tree and searching the document tree , through Unicode () method to convert a Navigablestring object directly into a Unicode string.

Print Unicode (tag.string) # The Dormouse ' s storyprint type (tag.string) # <class ' bs4.element.NavigableString ' > Tag.string.replace_with ("No longer bold") print tag# <p class= "title" id= "Start" ><b>no longer bold</b ></p>

This is get "<p class=" title "id=" Start "><b>the dormouse ' s story</b></p>" in tag = The value of SOUP.P, where the string contained in tag cannot be edited, but can be replaced by the function Replace_with ().
Navigablestring objects support traversing the document tree and searching for most of the attributes defined in the document tree, not all. In particular, a string cannot contain other content (the tag can contain a string or other tag), and the string does not support the. Contents or. String property or the Find () method.
If you want to use a Navigablestring object outside of beautiful soup, you need to call the Unicode () method to convert the object to a normal Unicode string, or even if the beautiful soup method has already executed the end, The output of the object will also have a reference address for the object. This will waste memory.

3.beautiful Soup Object
The object represents the entire contents of a document, and most of the time it can be used as a tag object, which supports traversing the document tree and most of the methods in the search document tree.
note: Because the BeautifulSoup object is not a real HTML or XML tag, it does not have a name and attribute attribute, But sometimes the. Name property can be viewed by a BeautifulSoup object that contains a value for the special implementation of [document]. Name to implement--soup.name.
Beautiful the other types defined in soup may appear in the XML document: CData, ProcessingInstruction, Declaration, Doctype. Similar to the Comment object, these classes are subclasses of navigablestring, but only strings that add some extra methods are exclusive.
4.command comments
Tag, navigablestring, BeautifulSoup almost all the content in HTML and XML, But there are some special objects that are easy to worry about--comments. The comment object is a special type of navigablestring object.

Markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b> "soup = beautifulsoup (markup) Comment = Soup.b.stringprint type (comment) # <CL The ' bs4.element.Comment ' >print Unicode (Comment) # Hey, buddy. Want to buy a used parser?

After describing these four objects, the following is a brief introduction to traversing the document tree and searching the document tree and commonly used functions.
5. Traverse the document Tree
A tag may contain multiple strings or other tags, all of which are child nodes of this tag. BeautifulSoup provides a number of operations and traversal properties for child nodes. Examples of Alice citing official documents:
The simplest way to manipulate a document is to tell you the name of the tag you want to get, as follows:

soup.head# 

Note: By using the Point-and-click Property, you can only get the first tag of the current name, as well as call the method multiple times in the tag of the document tree, such as soup.body.b get the first <b> tag in the <body> tag.
If you want to get all the <a> tags, use the method Find_all (), in the previous Python crawl Wikipedia and other HTML we often use it + regular expression method.

Soup.find_all (' a ') # [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,#  < A class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>,#  <a class= "sister" href= " Http://example.com/tillie "id=" Link3 ">TILLIE</A>]

child nodes : The child nodes of the tag are usually analyzed during HTML parsing, and the tag's. Contents property can output the child nodes of the tag as a list. The string does not have a. Contents property because the string has no child nodes.

Head_tag = soup.headhead_tag# 

Through the tag's. Children generator, you can loop through the child nodes of the tag:

For child in Title_tag.children:    print (child)    # the Dormouse's story

descendant nodes : the same. The Descendants property allows recursive looping of all tag descendants:

For child in Head_tag.descendants:    print (child)    # <title>the dormouse ' s story</title>    # The Dormouse ' s Story

parent Node : Gets the parent node of an element by using the. Parent property. In the example "Alice" document the,Note: The top node of the document, such as the parent node of

Title_tag = soup.titletitle_tag# <title>the dormouse ' s story</title>title_tag.parent# 

sibling nodes : Because <b> tags and <c> tags are the same layer: they are child nodes of the same element, so <b> and <c> can be called sibling nodes. When a document is output in a standard format, the sibling nodes have the same indentation level. You can also use this relationship in your code.

Sibling_soup = BeautifulSoup ("<a><b>text1</b><c>text2</c></b></a>") Print (Sibling_soup.prettify ()) # 

in the document tree, use the. next_sibling and. Previous_sibling properties to query sibling nodes. The <b> tag has the. Next_sibling property, but there is no. previous_sibling property, because the <b> tag is the first in the sibling node. Similarly <c> tags have the. previous_sibling property, but no. next_sibling property:

sibling_soup.b.next_sibling# <c>text2</c>sibling_soup.c.previous_sibling# <b>text1</b>

Introduction to the basic can be implemented in our BeautifulSoup Library crawl Web content, and Web page modification, deletion and other content suggest you read the document. The next article will crawl the content of Wikipedia's programming language again! Hope that the article is helpful to everyone, if there are errors or shortcomings, please Haihan! It is recommended that you read the official documentation and the basic Python tutorial book.
(by:eastmount 2015-3-25 6 o'clock in the afternoon http://www.php.cn/)


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.