This article uses the BeautifulSoup 3, now has BEAUTIFULSOUP4, the name changed to BS4
(1) Download and install
| 12 |
# BeautifulSoup 的下载与安装pip install BeautifulSoup |
Alternatively, you can download the installation package for installation
(2) Quick Start
| 1234 |
# BeautifulSoup 快速开始html_doc =urllib2.urlopen(‘http://baike.baidu.com/view/1059363.htm‘)soup = BeautifulSoup(html_doc)printsoup.title |
Results:
| 12 |
# BeautifulSoup 结果<title>前门大街_百度百科</title> |
(3) BeautifulSoup Object IntroductionThere are three types of objects that are mainly contained in BeautifulSoup:
- Beautifulsoup.beautifulsoup
- Beautifulsoup.tag
- Beautifulsoup.navigablestring
Use the following example to understand the above three types of data:
| 1234567891011121314 |
# BeautifulSoup 示例fromBeautifulSoup import BeautifulSoupimport urllib2 html_doc = urllib2.urlopen(‘http://www.baidu.com‘) soup = BeautifulSoup(html_doc) print type(soup)print type(soup.title)print type(soup.title.string) print soup.titleprintsoup.title.string |
Result is
| 12345678 |
# BeautifulSoup 示例结果<class‘BeautifulSoup.BeautifulSoup‘><class ‘BeautifulSoup.Tag‘><class ‘BeautifulSoup.NavigableString‘><title>百度一下,你就知道</title>百度一下,你就知道print soup.titleprintsoup.title.string |
From the above example can be relatively clear see BeautifulSoup mainly includes three kinds of objects.
- Beautifulsoup.beautifulsoup//beautifulsoup Object
- Beautifulsoup.tag//Tag Object
- beautifulsoup.navigablestring//navigation string text object
(4) BeautifulSoup parse tree1. Beautifulsoup.tag object method Get Tag object, get tag object by dot number
| 12345678910 |
# BeautifulSoup 示例title =soup.titleprint type(title.contents)print title.contentsprint title.contents[0] # BeautifulSoup 示例结果<type‘list‘>[u‘\u767e\u5ea6\u4e00\u4e0b\uff0c\u4f60\u5c31\u77e5\u9053‘]百度一下,你就知道 |
Contents MethodGets the contents of the current label list, if the label does not have child tags, then the string method and Contents[0] get the same content. See the example above
Next,parent MethodGets the current label's child label and parent tag
| 123456789101112131415161718192021 |
# BeautifulSoup 示例html =soup.htmlprinthtml.nextprint‘‘printhtml.next.nextprinthtml.next.next.nextSibling# BeautifulSoup 示例结果-equiv="content-type"content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible"content="IE=Edge"/><meta content="always"name="referrer"/><meta name="theme-color" content="#2932e1"/><link rel="shortcut icon"href="/favicon.ico"type="image/x-icon"/><link rel="icon"sizes="any"mask="mask"href="//www.baidu.com/img/baidu.svg"/><link rel="dns-prefetch"href="//s1.bdstatic.com"/><link rel="dns-prefetch"href="//t1.baidu.com"/><link rel="dns-prefetch"href="//t2.baidu.com"/><link rel="dns-prefetch"href="//t3.baidu.com"/><link rel="dns-prefetch"href="//t10.baidu.com"/><link rel="dns-prefetch"href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch"href="//b1.bdstatic.com"/><title>百度一下,你就知道</title>......</head><meta http-equiv="content-type"content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible"content="IE=Edge"/> |
nextsibling,previoussiblingGet the next sibling label for the current label and the previous sibling tag
BeautifulSoup Study Notes