Boautiful Soup
BeautifulSoup Official Introduction:
Beautiful Soup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve idiomatic document navigation, find, modify the way the document.
Official website: https://www.crummy.com/software/BeautifulSoup/
1. Installation
Find "cmd.exe" in "C:\Windows\System32", run as Administrator, and enter: "Pip install BEAUTIFULSOUP4" on the command line.
Tip the PIP version is too low to use python -m pip install --upgrade pip
for upgrade.
C:\Windows\system32>pip install beautifulsoup4Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.eggYou are using pip version 8.1.1, however version 9.0.1 is available.You should consider upgrading via the ‘python -m pip install --upgrade pip‘ command.
Beautiful installation test for Soup Library:
Presentation HTML page address: Http://www.cnblogs.com/yan-lei
>>> import requests>>> r = requests.get("http://www.cnblogs.com/yan-lei/")>>> r.text‘\r\n\r\n\r\n\r\n\r\n\r\nPython学习者 - 博客园\r\n>> demo = r.text>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(demo,"html.parser")>>> soupPython学习者 - 博客园......
from bs4 import BeautifulSoup soup = BeautifulSoup(‘
Data
‘,‘html.parser‘)
2, the use of Beautiful soup Library
In HTML, for example, any HTML file has a set of "<>" organized, in fact, is the label, the label formed a top-down relationship between the formation of a tag tree. BeautifulSoup Library is a library of functions that parse, traverse, and maintain the "tag tree"
<p>, .... </p>: Tag tag
- Label name usually appears in pairs
- Property attributes 0 or more
Beautiful References to Soup libraries
Beautiful Soup Library, also known as BEAUTFULSOUP4 or BS4. The Convention refers to the following way, that is, mainly with the BeautifulSoup class.
From BS4 import beautifulsoup import BS4
Beautiful Soup Class
The tag tree is converted to the BeautifulSoup class, at which point we use HTML, tag tree, BeautifulSoup class equivalent
From BS4 import BeautifulSoup soup1 = BeautifulSoup ("Data", "Html.parser") soup2 = BeautifulSoup (Open ("d://demo.html", " Html.parser "))
The beautifulsoup corresponds to the entire contents of a html/xml document.
Beautiful Soup Library Parser
Soup = beautifulsoup (' Data ', ' Html.parser ')
Parser |
How to use |
conditions |
HTML parser for BS4 |
BeautifulSoup (MK, ' Html.parser ') |
Installing the BS4 Library |
HTML parser for lxml |
BeautifulSoup (MK, ' lxml ') |
Pip Install lxml |
XML parser for lxml |
BeautifulSoup (MK, ' xml ') |
Pip Install lxml |
Newsoup = BeautifulSoup ("
This is a comment
"," Html.parser ") Beautiful the basic elements of the Soup class
Basic Elements |
Description |
Tag |
tags, the most basic information organizational unit, with <> and </> marked the beginning and end |
Name |
The name of the label,<p>...</p> is ' P ', format: <tag>.name |
Attributes |
Label properties, dictionary form of organization, format: <tag>.attrs |
Naviglestring |
,<>...</> string in non-attribute string in tag, format <tag>.string |
Comment |
The annotation part of a string within a tag, a special type of comment |
Python web crawler and Information extraction (II.)--beautifulsoup