Python web crawler and information extraction (2) -- BeautifulSoup,

Last Update:2017-10-03 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python web crawler and information extraction (2) -- BeautifulSoup,

BeautifulSoup official introduction:

Beautiful Soup is a Python library that can extract data from HTML or XML files. It can implement the usual document navigation, searching, and modifying methods through your favorite converter.

Https://www.crummy.com/software/BeautifulSoup/

Install BeautifulSoup

Find "cmd.exe" in "C: \ Windows \ System32", run it as an administrator, and enter "pip install beautifulsoup4" in the command line.

C:\Windows\system32>pip install beautifulsoup4Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.eggYou are using pip version 8.1.1, however version 9.0.1 is available.You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Prompt pip version is too low, usepython -m pip install --upgrade pip.

Beautiful Soup library installation test:

from bs4 import BeautifulSoupsoup = BeautifulSoup('<p>data</p>','html.parser')

Demo HTML page address: http://www.cnblogs.com/yan-lei

>>> import requests>>> from bs4 import BeautifulSoup>>> r = requests.get("http://www.cnblogs.com/yan-lei/")>>> demo = r.text>>> soup = BeautifulSoup(demo,"html.parser")>>> soup

Use of Beautiful Soup Library

Take HTML as an example. Any HTML file is organized by a group of "<>" tags, which form upstream-downstream relationships and a Tag Tree.BeautifulSoup is a functional library for parsing, traversing, and maintaining the "Tag Tree ".

<P>... </p>: Tag

Tag names are usually paired.
Attribute Attributes 0 or more

Reference of Beautiful Soup Library

Beautiful Soup library, also known as beautfulsoup4 or bs4. The Convention reference method is as follows, that is, the BeautifulSoup class is used.

from bs4 import BeautifulSoupimport bs4

Beautiful Soup class

Convert the label tree to the BeautifulSoup class. In this case, we will equivalent the HTML, label tree, And BeautifulSoup classes.

from bs4 import BeautifulSoupsoup1 = BeautifulSoup("
Use soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")Error:
Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "C:\Users\lei\AppData\Local\Programs\Python\Python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.egg\bs4\__init__.py", line 191, in __init__UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence
BeautifulSoup corresponds to all the content of an HTML/XML document.Beautiful Soup library parser

 
 
   
    
    Parser 
    Usage 
    Condition 
    
    
    Bs4 HTML Parser 
    BeautifulSoup (mk, 'html. parser ') 
    Install bs4 Library 
    
    
    Lxml HTML Parser 
    BeautifulSoup (mk, 'lxml ') 
    Pip install lxml 
    
    
    XML Parser of lxml 
    BeautifulSoup (mk, 'xml ') 
    Pip install lxml 
    
    
    Html5lib parser 
    BeautifulSoup (mk, 'html5lib ') 
    Pip install html5lib 
    
  

 Basic elements of the Beautiful Soup class

 
 
   
    
    Basic Elements 
    Description 
    
    
    Tag 
    Tags, the most basic information organization unit, respectively, with <> and </> to indicate the beginning and end 
    
    
    Name 
    The tag name. <p>... </p> is 'P'. Format: <tag>. name 
    
    
    Attributes 
    Attribute of a tag, which is a dictionary-based organization. Format: <tag>. attrs 
    
    
    NavigleString 
    A non-attribute string in the tag, in the format of <tag>. string <>... </> 
    
    
    Comment 
    Comment part of the string in the tag, a special Comment type 
    
  

 

 
 
  Tag: any Tag that exists in the HTML syntax can be soup. <tag> access is obtained. When multiple identical <tag> contents exist in the HTML document. <tag> the first entry is returned.
 
  Tag name: Each <tag> has its own name, obtained through <tag>. name, string type.
 
  Tag attrs (attribute): A <tag> can have 0 or more attributes, Dictionary type.
 
  NavigableString: NavigableString of a Tag can span multiple layers.
 
  Comment of a Tag: Comment is a special type.

 
>>> Import requests >>> from bs4 import BeautifulSoup >>> r = requests. get ('HTTP: // www.cnblogs.com/yan-lei/') >>> html = r. text >>> soup = BeautifulSoup (html, 'html. parser ')> soup. title <title> Python learner-blog </title> soup. a <a name = "top"> </a> soup. a. name 'A'> soup. a. parent. name 'body'> soup. a. attrs {'name': 'top'} >>> type (soup. a) <class 'bs4. element. tag '>>>> type (soup. a. attrs) <class 'dict '>>> soup. h1.string 'python learner '>>> type (soup. h1.string) <class 'bs4. element. navigableString '>HTML content Traversal method based on bs4 Library
In HTML, <...> forms the ownership relationship and forms a tree structure of tags. There are three traversal methods.
Use the following HTML for testing: E: \ BeautifulSoupTest.html
<Html> Downlink traversal of the label tree

  
 
    
     
     Attribute 
     Description 
     
     
     . Contents 
     Sub-node list, saving all <tag> sub-nodes to the list 
     
     
     . Contents 
     Sub-node list, saving all <tag> sub-nodes to the list 
     
     
     . Children 
     The iteration type of the subnode, similar to. contents, used to traverse the subnode cyclically. 
     
     
     . Descendants 
     The iteration type of child nodes, including all child nodes, used for loop traversal. 
     
   

  
BeautifulSoup class is the root node of the label tree
>>> From bs4 import BeautifulSoup >>> soup = BeautifulSoup (open ('e: \ BeautifulSoupTest.html ', 'rb'), 'html. parser ')> soup. head. contents # returns the list ['\ n', <meta charset = "UTF-8"> <title> BeautifulSoup </title> </meta>]> len (soup. body. contents) 9 >>> for child in soup. body. children: # traverse the son node... print (child )... <div id = "header"> 
For child in soup. body. children: # traverse the son node print (child) for child in soup. body. descendants: # traverse the child node print (child)Uplink traversal of the label tree

   
 
     
      
      Attribute 
      Description 
      
      
      . Parent 
      Parent label of a node 
      
      
      . Parents 
      The iteration type of the node parent label, which is used to traverse the parent node cyclically. 
      
    

   
>>> for parent in soup.a.parents:...     if parent is None:...             print(parent)...     else:...             print(parent.name)...pimgdivbodyhtml[document]
# Judge all the advanced nodes, including soup itself, so we need to differentiate and judge for parent in soup. a. parents: if parent is None: print (parent) else: print (parent. name)Parallel traversal of the label tree

   
 
     
      
      Attribute 
      Description 
      
      
      . Next_sibling 
      Returns the next parallel node tag in HTML text order. 
      
      
      . Previus_sibling 
      Returns the label of the last parallel node in the HTML text order. 
      
      
      . Next_siblings 
      Iteration type. All subsequent parallel node labels in HTML text order are returned. 
      
      
      . Previus_siblings 
      Iteration type, returns all the parallel node labels that are prefixed according to the HTML text order. 
      
    

   
* All parallel traversal occurs between nodes under the same parent node.
# Div label: The next parallel node label soup. div. next_sibling # A parallel node label soup on the div label. div. previus_sibling # traverse the subsequent nodes for sibling in soup. div. next_sibling: print (sibling) # traverse the previous node for sibling in soup. div. previus_sibling: print (sibling)HTML output based on bs4 Library
Pretiterator () method of bs4 Library
. Pretloads () is the HTML text <> and add '\ n' to the content'
. Prettify () can be used for tags. Method: <tag>. prettify ()
print(soup.prettify())
The bs4 library converts any HTML input into UTF-8 encoding. By default, Python 3.x supports UTF-8 encoding, which makes parsing accessible.Information tags:

   
 
    The labeled information can form an information organization structure, and the information dimension is added.
 
    The marked information can be used for communication, storage, or display.
 
    The structure and information of tags are equally important.
 
    The labeled information is more conducive to the understanding and application of the program.

   HTML information Tag:
HTML is the information organization method of WWW (World Wide Web.
HTML organizes different types of information using predefined <>... </> tags.XML eXtensible Markup Language
XML is a common information format developed based on HTML.

   
 
    Basic XML format: <name>... </name>
 
    Abbreviated empty element form: <name/>
 
    Annotation writing format: <! -->

   JSON JavaScript Object Notation
Key: value
The expression "" is a string type, and the numeric type is not a string.YAML Ain't Markup Language
Non-type key-value Pair key: value
Express the ownership by indentation

   
 
    -Express the parallel relationship
 
    | Represents the entire data block
 
    # Annotation

   
key : valuekey : #Comment-value1-value2key :    subkey : subvalueComparison of Three Types of information Tag:
XML is the earliest common information markup language, which is highly scalable but cumbersome. Information exchange and transmission over the Internet.
JSON information is of type and suitable for processing (js) programs, which is more concise than XML. Information communication between the cloud and nodes of mobile applications without comments.
The YAML information has no type, and has the highest proportion of text information and good readability. The configuration files of various systems are easy to read with annotations.General method 1 of information extraction: complete parsing of the Mark form of information, and then extract key information.
XML JSON YAML
The parser, for example, bs4 library, needs to be marked as a tag tree traversal.
Advantage: accurate information Parsing
Disadvantage: The extraction process is cumbersome and slow.Method 2: Ignore the Mark Form and directly search for key information.
Search
The text search function of the information.
Advantage: Quick extraction process introduction.
Disadvantage: The accuracy of the extracted results is related to the information content.Method 3: Integration
Fusion Method: Combine form analysis and search methods to extract key information.Method for searching HTML content based on bs4 library <>. find_all (name, attrs, recursive, string, ** kwargs)
A list type is returned to store the search results.

   
 
    Name: The search string for the tag name.
 
    Attrs: The search string for tag attribute values. It can be used to search tag attributes.
 
    Recursive: whether to search for all descendants. The default value is True.
 
    String: <>... </>.

   
<Tag> (...) is equivalent to <tag>. find_all (..)
Soup (...) is equivalent to soup. find_all (..)
>>> Soup. div () [Extension Method

    
 
      
       
       Method 
       Description 
       
       
       <>. Find () 
       Only one result is returned for the search. The string type is at your own discretion, which is the same as the. find_all () parameter. 
       
       
       <>. Find_parents () 
       Search in the origin node and return the list type, which is the same as the find_all () parameter. 
       
       
       <>. Find_parent () 
       Returns a result of the string type in the origin node, which is the same as the. find () parameter. 
       
       
       <>. Find_next_siblings () 
       In the subsequent parallel node search, the list type is returned, which is the same as the. find_all () parameter. 
       
       
       <>. Find_next_sibling () 
       Returns a result of the string type in a subsequent parallel node, which is the same as the. find () parameter. 
       
       
       <>. Find_previus_siblings () 
       Search in the previous node and return the list type, same as the. find_all () parameter. 
       
       
       <>. Find_previus_sibling () 
       Returns a result of the same string type as the. find () parameter in the previous node.

Parser	Usage	Condition
Bs4 HTML Parser	BeautifulSoup (mk, 'html. parser ')	Install bs4 Library
Lxml HTML Parser	BeautifulSoup (mk, 'lxml ')	Pip install lxml
XML Parser of lxml	BeautifulSoup (mk, 'xml ')	Pip install lxml
Html5lib parser	BeautifulSoup (mk, 'html5lib ')	Pip install html5lib

Basic Elements	Description
Tag	Tags, the most basic information organization unit, respectively, with <> and </> to indicate the beginning and end
Name	The tag name. <p>... </p> is 'P'. Format: <tag>. name
Attributes	Attribute of a tag, which is a dictionary-based organization. Format: <tag>. attrs
NavigleString	A non-attribute string in the tag, in the format of <tag>. string <>... </>
Comment	Comment part of the string in the tag, a special Comment type

Attribute	Description
. Contents	Sub-node list, saving all <tag> sub-nodes to the list
. Contents	Sub-node list, saving all <tag> sub-nodes to the list
. Children	The iteration type of the subnode, similar to. contents, used to traverse the subnode cyclically.
. Descendants	The iteration type of child nodes, including all child nodes, used for loop traversal.

Attribute	Description
. Parent	Parent label of a node
. Parents	The iteration type of the node parent label, which is used to traverse the parent node cyclically.

Attribute	Description
. Next_sibling	Returns the next parallel node tag in HTML text order.
. Previus_sibling	Returns the label of the last parallel node in the HTML text order.
. Next_siblings	Iteration type. All subsequent parallel node labels in HTML text order are returned.
. Previus_siblings	Iteration type, returns all the parallel node labels that are prefixed according to the HTML text order.

Method	Description
<>. Find ()	Only one result is returned for the search. The string type is at your own discretion, which is the same as the. find_all () parameter.
<>. Find_parents ()	Search in the origin node and return the list type, which is the same as the find_all () parameter.
<>. Find_parent ()	Returns a result of the string type in the origin node, which is the same as the. find () parameter.
<>. Find_next_siblings ()	In the subsequent parallel node search, the list type is returned, which is the same as the. find_all () parameter.
<>. Find_next_sibling ()	Returns a result of the string type in a subsequent parallel node, which is the same as the. find () parameter.
<>. Find_previus_siblings ()	Search in the previous node and return the list type, same as the. find_all () parameter.
<>. Find_previus_sibling ()	Returns a result of the same string type as the. find () parameter in the previous node.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More