Python web crawler and information extraction (2) -- BeautifulSoup,

Source: Internet
Author: User
Tags python web crawler

Python web crawler and information extraction (2) -- BeautifulSoup,

BeautifulSoup official introduction:

Beautiful Soup is a Python library that can extract data from HTML or XML files. It can implement the usual document navigation, searching, and modifying methods through your favorite converter.

Https://www.crummy.com/software/BeautifulSoup/

Install BeautifulSoup

Find "cmd.exe" in "C: \ Windows \ System32", run it as an administrator, and enter "pip install beautifulsoup4" in the command line.

C:\Windows\system32>pip install beautifulsoup4Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.eggYou are using pip version 8.1.1, however version 9.0.1 is available.You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Prompt pip version is too low, usepython -m pip install --upgrade pip.

Beautiful Soup library installation test:
from bs4 import BeautifulSoupsoup = BeautifulSoup('<p>data</p>','html.parser')

Demo HTML page address: http://www.cnblogs.com/yan-lei

>>> import requests>>> from bs4 import BeautifulSoup>>> r = requests.get("http://www.cnblogs.com/yan-lei/")>>> demo = r.text>>> soup = BeautifulSoup(demo,"html.parser")>>> soup
Use of Beautiful Soup Library

Take HTML as an example. Any HTML file is organized by a group of "<>" tags, which form upstream-downstream relationships and a Tag Tree.BeautifulSoup is a functional library for parsing, traversing, and maintaining the "Tag Tree ".

<P>... </p>: Tag

  • Tag names are usually paired.
  • Attribute Attributes 0 or more
Reference of Beautiful Soup Library

Beautiful Soup library, also known as beautfulsoup4 or bs4. The Convention reference method is as follows, that is, the BeautifulSoup class is used.

from bs4 import BeautifulSoupimport bs4
Beautiful Soup class

Convert the label tree to the BeautifulSoup class. In this case, we will equivalent the HTML, label tree, And BeautifulSoup classes.

from bs4 import BeautifulSoupsoup1 = BeautifulSoup("

Use soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")Error:

Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "C:\Users\lei\AppData\Local\Programs\Python\Python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.egg\bs4\__init__.py", line 191, in __init__UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence

BeautifulSoup corresponds to all the content of an HTML/XML document.

Beautiful Soup library parser
Parser Usage Condition
Bs4 HTML Parser BeautifulSoup (mk, 'html. parser ') Install bs4 Library
Lxml HTML Parser BeautifulSoup (mk, 'lxml ') Pip install lxml
XML Parser of lxml BeautifulSoup (mk, 'xml ') Pip install lxml
Html5lib parser BeautifulSoup (mk, 'html5lib ') Pip install html5lib
Basic elements of the Beautiful Soup class
Basic Elements Description
Tag Tags, the most basic information organization unit, respectively, with <> and </> to indicate the beginning and end
Name The tag name. <p>... </p> is 'P'. Format: <tag>. name
Attributes Attribute of a tag, which is a dictionary-based organization. Format: <tag>. attrs
NavigleString A non-attribute string in the tag, in the format of <tag>. string <>... </>
Comment Comment part of the string in the tag, a special Comment type
  • Tag: any Tag that exists in the HTML syntax can be soup. <tag> access is obtained. When multiple identical <tag> contents exist in the HTML document. <tag> the first entry is returned.
  • Tag name: Each <tag> has its own name, obtained through <tag>. name, string type.
  • Tag attrs (attribute): A <tag> can have 0 or more attributes, Dictionary type.
  • NavigableString: NavigableString of a Tag can span multiple layers.
  • Comment of a Tag: Comment is a special type.
>>> Import requests >>> from bs4 import BeautifulSoup >>> r = requests. get ('HTTP: // www.cnblogs.com/yan-lei/') >>> html = r. text >>> soup = BeautifulSoup (html, 'html. parser ')> soup. title <title> Python learner-blog </title> soup. a <a name = "top"> </a> soup. a. name 'A'> soup. a. parent. name 'body'> soup. a. attrs {'name': 'top'} >>> type (soup. a) <class 'bs4. element. tag '>>>> type (soup. a. attrs) <class 'dict '>>> soup. h1.string 'python learner '>>> type (soup. h1.string) <class 'bs4. element. navigableString '>
HTML content Traversal method based on bs4 Library

In HTML, <...> forms the ownership relationship and forms a tree structure of tags. There are three traversal methods.

Use the following HTML for testing: E: \ BeautifulSoupTest.html

<Html> Downlink traversal of the label tree

  
Attribute Description
. Contents Sub-node list, saving all <tag> sub-nodes to the list
. Contents Sub-node list, saving all <tag> sub-nodes to the list
. Children The iteration type of the subnode, similar to. contents, used to traverse the subnode cyclically.
. Descendants The iteration type of child nodes, including all child nodes, used for loop traversal.

BeautifulSoup class is the root node of the label tree

>>> From bs4 import BeautifulSoup >>> soup = BeautifulSoup (open ('e: \ BeautifulSoupTest.html ', 'rb'), 'html. parser ')> soup. head. contents # returns the list ['\ n', <meta charset = "UTF-8"> <title> BeautifulSoup </title> </meta>]> len (soup. body. contents) 9 >>> for child in soup. body. children: # traverse the son node... print (child )... <div id = "header"> 

For child in soup. body. children: # traverse the son node print (child) for child in soup. body. descendants: # traverse the child node print (child)
Uplink traversal of the label tree
Attribute Description
. Parent Parent label of a node
. Parents The iteration type of the node parent label, which is used to traverse the parent node cyclically.
>>> for parent in soup.a.parents:...     if parent is None:...             print(parent)...     else:...             print(parent.name)...pimgdivbodyhtml[document]
# Judge all the advanced nodes, including soup itself, so we need to differentiate and judge for parent in soup. a. parents: if parent is None: print (parent) else: print (parent. name)
Parallel traversal of the label tree
Attribute Description
. Next_sibling Returns the next parallel node tag in HTML text order.
. Previus_sibling Returns the label of the last parallel node in the HTML text order.
. Next_siblings Iteration type. All subsequent parallel node labels in HTML text order are returned.
. Previus_siblings Iteration type, returns all the parallel node labels that are prefixed according to the HTML text order.

* All parallel traversal occurs between nodes under the same parent node.

# Div label: The next parallel node label soup. div. next_sibling # A parallel node label soup on the div label. div. previus_sibling # traverse the subsequent nodes for sibling in soup. div. next_sibling: print (sibling) # traverse the previous node for sibling in soup. div. previus_sibling: print (sibling)
HTML output based on bs4 Library

Pretiterator () method of bs4 Library

. Pretloads () is the HTML text <> and add '\ n' to the content'

. Prettify () can be used for tags. Method: <tag>. prettify ()

print(soup.prettify())

The bs4 library converts any HTML input into UTF-8 encoding. By default, Python 3.x supports UTF-8 encoding, which makes parsing accessible.

Information tags:
  • The labeled information can form an information organization structure, and the information dimension is added.
  • The marked information can be used for communication, storage, or display.
  • The structure and information of tags are equally important.
  • The labeled information is more conducive to the understanding and application of the program.
HTML information Tag:

HTML is the information organization method of WWW (World Wide Web.

HTML organizes different types of information using predefined <>... </> tags.

XML eXtensible Markup Language

XML is a common information format developed based on HTML.

  • Basic XML format: <name>... </name>
  • Abbreviated empty element form: <name/>
  • Annotation writing format: <! -->
JSON JavaScript Object Notation

Key: value

The expression "" is a string type, and the numeric type is not a string.

YAML Ain't Markup Language

Non-type key-value Pair key: value

Express the ownership by indentation

  • -Express the parallel relationship
  • | Represents the entire data block
  • # Annotation
key : valuekey : #Comment-value1-value2key :    subkey : subvalue
Comparison of Three Types of information Tag:

XML is the earliest common information markup language, which is highly scalable but cumbersome. Information exchange and transmission over the Internet.

JSON information is of type and suitable for processing (js) programs, which is more concise than XML. Information communication between the cloud and nodes of mobile applications without comments.

The YAML information has no type, and has the highest proportion of text information and good readability. The configuration files of various systems are easy to read with annotations.

General method 1 of information extraction: complete parsing of the Mark form of information, and then extract key information.

XML JSON YAML

The parser, for example, bs4 library, needs to be marked as a tag tree traversal.

Advantage: accurate information Parsing

Disadvantage: The extraction process is cumbersome and slow.

Method 2: Ignore the Mark Form and directly search for key information.

Search

The text search function of the information.

Advantage: Quick extraction process introduction.

Disadvantage: The accuracy of the extracted results is related to the information content.

Method 3: Integration

Fusion Method: Combine form analysis and search methods to extract key information.

Method for searching HTML content based on bs4 library <>. find_all (name, attrs, recursive, string, ** kwargs)

A list type is returned to store the search results.

  • Name: The search string for the tag name.
  • Attrs: The search string for tag attribute values. It can be used to search tag attributes.
  • Recursive: whether to search for all descendants. The default value is True.
  • String: <>... </>.

<Tag> (...) is equivalent to <tag>. find_all (..)

Soup (...) is equivalent to soup. find_all (..)

>>> Soup. div () [

Extension Method

Method Description
<>. Find () Only one result is returned for the search. The string type is at your own discretion, which is the same as the. find_all () parameter.
<>. Find_parents () Search in the origin node and return the list type, which is the same as the find_all () parameter.
<>. Find_parent () Returns a result of the string type in the origin node, which is the same as the. find () parameter.
<>. Find_next_siblings () In the subsequent parallel node search, the list type is returned, which is the same as the. find_all () parameter.
<>. Find_next_sibling () Returns a result of the string type in a subsequent parallel node, which is the same as the. find () parameter.
<>. Find_previus_siblings () Search in the previous node and return the list type, same as the. find_all () parameter.
<>. Find_previus_sibling () Returns a result of the same string type as the. find () parameter in the previous node.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.