Python web crawler and information extraction (2) -- BeautifulSoup,
BeautifulSoup official introduction:
Beautiful Soup is a Python library that can extract data from HTML or XML files. It can implement the usual document navigation, searching, and modifying methods through your favorite converter.
Https://www.crummy.com/software/BeautifulSoup/
Install BeautifulSoup
Find "cmd.exe" in "C: \ Windows \ System32", run it as an administrator, and enter "pip install beautifulsoup4" in the command line.
C:\Windows\system32>pip install beautifulsoup4Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.eggYou are using pip version 8.1.1, however version 9.0.1 is available.You should consider upgrading via the 'python -m pip install --upgrade pip' command.
Prompt pip version is too low, usepython -m pip install --upgrade pip
.
Beautiful Soup library installation test:
from bs4 import BeautifulSoupsoup = BeautifulSoup('<p>data</p>','html.parser')
Demo HTML page address: http://www.cnblogs.com/yan-lei
>>> import requests>>> from bs4 import BeautifulSoup>>> r = requests.get("http://www.cnblogs.com/yan-lei/")>>> demo = r.text>>> soup = BeautifulSoup(demo,"html.parser")>>> soup
Use of Beautiful Soup Library
Take HTML as an example. Any HTML file is organized by a group of "<>" tags, which form upstream-downstream relationships and a Tag Tree.BeautifulSoup is a functional library for parsing, traversing, and maintaining the "Tag Tree ".
<P>... </p>: Tag
- Tag names are usually paired.
- Attribute Attributes 0 or more
Reference of Beautiful Soup Library
Beautiful Soup library, also known as beautfulsoup4 or bs4. The Convention reference method is as follows, that is, the BeautifulSoup class is used.
from bs4 import BeautifulSoupimport bs4
Beautiful Soup class
Convert the label tree to the BeautifulSoup class. In this case, we will equivalent the HTML, label tree, And BeautifulSoup classes.
from bs4 import BeautifulSoupsoup1 = BeautifulSoup("
Use soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")
Error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\lei\AppData\Local\Programs\Python\Python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.egg\bs4\__init__.py", line 191, in __init__UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence
BeautifulSoup corresponds to all the content of an HTML/XML document.
Beautiful Soup library parser
Parser |
Usage |
Condition |
Bs4 HTML Parser |
BeautifulSoup (mk, 'html. parser ') |
Install bs4 Library |
Lxml HTML Parser |
BeautifulSoup (mk, 'lxml ') |
Pip install lxml |
XML Parser of lxml |
BeautifulSoup (mk, 'xml ') |
Pip install lxml |
Html5lib parser |
BeautifulSoup (mk, 'html5lib ') |
Pip install html5lib |
Basic elements of the Beautiful Soup class
Basic Elements |
Description |
Tag |
Tags, the most basic information organization unit, respectively, with <> and </> to indicate the beginning and end |
Name |
The tag name. <p>... </p> is 'P'. Format: <tag>. name |
Attributes |
Attribute of a tag, which is a dictionary-based organization. Format: <tag>. attrs |
NavigleString |
A non-attribute string in the tag, in the format of <tag>. string <>... </> |
Comment |
Comment part of the string in the tag, a special Comment type |
- Tag: any Tag that exists in the HTML syntax can be soup. <tag> access is obtained. When multiple identical <tag> contents exist in the HTML document. <tag> the first entry is returned.
- Tag name: Each <tag> has its own name, obtained through <tag>. name, string type.
- Tag attrs (attribute): A <tag> can have 0 or more attributes, Dictionary type.
- NavigableString: NavigableString of a Tag can span multiple layers.
- Comment of a Tag: Comment is a special type.
>>> Import requests >>> from bs4 import BeautifulSoup >>> r = requests. get ('HTTP: // www.cnblogs.com/yan-lei/') >>> html = r. text >>> soup = BeautifulSoup (html, 'html. parser ')> soup. title <title> Python learner-blog </title> soup. a <a name = "top"> </a> soup. a. name 'A'> soup. a. parent. name 'body'> soup. a. attrs {'name': 'top'} >>> type (soup. a) <class 'bs4. element. tag '>>>> type (soup. a. attrs) <class 'dict '>>> soup. h1.string 'python learner '>>> type (soup. h1.string) <class 'bs4. element. navigableString '>
HTML content Traversal method based on bs4 Library
In HTML, <...> forms the ownership relationship and forms a tree structure of tags. There are three traversal methods.
Use the following HTML for testing: E: \ BeautifulSoupTest.html
<Html> Downlink traversal of the label tree
Attribute |
Description |
. Contents |
Sub-node list, saving all <tag> sub-nodes to the list |
. Contents |
Sub-node list, saving all <tag> sub-nodes to the list |
. Children |
The iteration type of the subnode, similar to. contents, used to traverse the subnode cyclically. |
. Descendants |
The iteration type of child nodes, including all child nodes, used for loop traversal. |
BeautifulSoup class is the root node of the label tree
>>> From bs4 import BeautifulSoup >>> soup = BeautifulSoup (open ('e: \ BeautifulSoupTest.html ', 'rb'), 'html. parser ')> soup. head. contents # returns the list ['\ n', <meta charset = "UTF-8"> <title> BeautifulSoup </title> </meta>]> len (soup. body. contents) 9 >>> for child in soup. body. children: # traverse the son node... print (child )... <div id = "header">
For child in soup. body. children: # traverse the son node print (child) for child in soup. body. descendants: # traverse the child node print (child)
Uplink traversal of the label tree
Attribute |
Description |
. Parent |
Parent label of a node |
. Parents |
The iteration type of the node parent label, which is used to traverse the parent node cyclically. |
>>> for parent in soup.a.parents:... if parent is None:... print(parent)... else:... print(parent.name)...pimgdivbodyhtml[document]
# Judge all the advanced nodes, including soup itself, so we need to differentiate and judge for parent in soup. a. parents: if parent is None: print (parent) else: print (parent. name)
Parallel traversal of the label tree
Attribute |
Description |
. Next_sibling |
Returns the next parallel node tag in HTML text order. |
. Previus_sibling |
Returns the label of the last parallel node in the HTML text order. |
. Next_siblings |
Iteration type. All subsequent parallel node labels in HTML text order are returned. |
. Previus_siblings |
Iteration type, returns all the parallel node labels that are prefixed according to the HTML text order. |
* All parallel traversal occurs between nodes under the same parent node.
# Div label: The next parallel node label soup. div. next_sibling # A parallel node label soup on the div label. div. previus_sibling # traverse the subsequent nodes for sibling in soup. div. next_sibling: print (sibling) # traverse the previous node for sibling in soup. div. previus_sibling: print (sibling)
HTML output based on bs4 Library
Pretiterator () method of bs4 Library
. Pretloads () is the HTML text <> and add '\ n' to the content'
. Prettify () can be used for tags. Method: <tag>. prettify ()
print(soup.prettify())
The bs4 library converts any HTML input into UTF-8 encoding. By default, Python 3.x supports UTF-8 encoding, which makes parsing accessible.
Information tags:
- The labeled information can form an information organization structure, and the information dimension is added.
- The marked information can be used for communication, storage, or display.
- The structure and information of tags are equally important.
- The labeled information is more conducive to the understanding and application of the program.
HTML information Tag:
HTML is the information organization method of WWW (World Wide Web.
HTML organizes different types of information using predefined <>... </> tags.
XML eXtensible Markup Language
XML is a common information format developed based on HTML.
- Basic XML format: <name>... </name>
- Abbreviated empty element form: <name/>
- Annotation writing format: <! -->
JSON JavaScript Object Notation
Key: value
The expression "" is a string type, and the numeric type is not a string.
YAML Ain't Markup Language
Non-type key-value Pair key: value
Express the ownership by indentation
- -Express the parallel relationship
- | Represents the entire data block
- # Annotation
key : valuekey : #Comment-value1-value2key : subkey : subvalue
Comparison of Three Types of information Tag:
XML is the earliest common information markup language, which is highly scalable but cumbersome. Information exchange and transmission over the Internet.
JSON information is of type and suitable for processing (js) programs, which is more concise than XML. Information communication between the cloud and nodes of mobile applications without comments.
The YAML information has no type, and has the highest proportion of text information and good readability. The configuration files of various systems are easy to read with annotations.
General method 1 of information extraction: complete parsing of the Mark form of information, and then extract key information.
XML JSON YAML
The parser, for example, bs4 library, needs to be marked as a tag tree traversal.
Advantage: accurate information Parsing
Disadvantage: The extraction process is cumbersome and slow.
Method 2: Ignore the Mark Form and directly search for key information.
Search
The text search function of the information.
Advantage: Quick extraction process introduction.
Disadvantage: The accuracy of the extracted results is related to the information content.
Method 3: Integration
Fusion Method: Combine form analysis and search methods to extract key information.
Method for searching HTML content based on bs4 library <>. find_all (name, attrs, recursive, string, ** kwargs)
A list type is returned to store the search results.
- Name: The search string for the tag name.
- Attrs: The search string for tag attribute values. It can be used to search tag attributes.
- Recursive: whether to search for all descendants. The default value is True.
- String: <>... </>.
<Tag> (...) is equivalent to <tag>. find_all (..)
Soup (...) is equivalent to soup. find_all (..)
>>> Soup. div () [Extension Method
Method |
Description |
<>. Find () |
Only one result is returned for the search. The string type is at your own discretion, which is the same as the. find_all () parameter. |
<>. Find_parents () |
Search in the origin node and return the list type, which is the same as the find_all () parameter. |
<>. Find_parent () |
Returns a result of the string type in the origin node, which is the same as the. find () parameter. |
<>. Find_next_siblings () |
In the subsequent parallel node search, the list type is returned, which is the same as the. find_all () parameter. |
<>. Find_next_sibling () |
Returns a result of the string type in a subsequent parallel node, which is the same as the. find () parameter. |
<>. Find_previus_siblings () |
Search in the previous node and return the list type, same as the. find_all () parameter. |
<>. Find_previus_sibling () |
Returns a result of the same string type as the. find () parameter in the previous node. |