Python web crawler and Information extraction (II.)--beautifulsoup

Source: Internet
Author: User
Tags xml parser python web crawler

Boautiful Soup

BeautifulSoup Official Introduction:

Beautiful Soup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve idiomatic document navigation, find, modify the way the document.

Official website: https://www.crummy.com/software/BeautifulSoup/

1. Installation

Find "cmd.exe" in "C:\Windows\System32", run as Administrator, and enter: "Pip install BEAUTIFULSOUP4" on the command line.

Tip the PIP version is too low to use python -m pip install --upgrade pip for upgrade.

C:\Windows\system32>pip install beautifulsoup4Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.eggYou are using pip version 8.1.1, however version 9.0.1 is available.You should consider upgrading via the ‘python -m pip install --upgrade pip‘ command.

Beautiful installation test for Soup Library:

Presentation HTML page address: Http://www.cnblogs.com/yan-lei

>>> import requests>>> r = requests.get("http://www.cnblogs.com/yan-lei/")>>> r.text‘\r\n\r\n\r\n\r\n\r\n\r\nPython学习者 - 博客园\r\n>> demo = r.text>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(demo,"html.parser")>>> soupPython学习者 - 博客园......

from bs4 import BeautifulSoup soup = BeautifulSoup(‘

Data

‘,‘html.parser‘)

2, the use of Beautiful soup Library

In HTML, for example, any HTML file has a set of "<>" organized, in fact, is the label, the label formed a top-down relationship between the formation of a tag tree. BeautifulSoup Library is a library of functions that parse, traverse, and maintain the "tag tree"

<p>, .... </p>: Tag tag

    • Label name usually appears in pairs
    • Property attributes 0 or more
Beautiful References to Soup libraries

Beautiful Soup Library, also known as BEAUTFULSOUP4 or BS4. The Convention refers to the following way, that is, mainly with the BeautifulSoup class.

From BS4 import beautifulsoup import BS4

Beautiful Soup Class

The tag tree is converted to the BeautifulSoup class, at which point we use HTML, tag tree, BeautifulSoup class equivalent

From BS4 import BeautifulSoup soup1 = BeautifulSoup ("Data", "Html.parser") soup2 = BeautifulSoup (Open ("d://demo.html", " Html.parser "))

The beautifulsoup corresponds to the entire contents of a html/xml document.

Beautiful Soup Library Parser

Soup = beautifulsoup (' Data ', ' Html.parser ')

Parser How to use conditions
HTML parser for BS4 BeautifulSoup (MK, ' Html.parser ') Installing the BS4 Library
HTML parser for lxml BeautifulSoup (MK, ' lxml ') Pip Install lxml
XML parser for lxml BeautifulSoup (MK, ' xml ') Pip Install lxml

Newsoup = BeautifulSoup ("

This is a comment

"," Html.parser ") Beautiful the basic elements of the Soup class

Basic Elements Description
Tag tags, the most basic information organizational unit, with <> and </> marked the beginning and end
Name The name of the label,<p>...</p> is ' P ', format: <tag>.name
Attributes Label properties, dictionary form of organization, format: <tag>.attrs
Naviglestring ,<>...</> string in non-attribute string in tag, format <tag>.string
Comment The annotation part of a string within a tag, a special type of comment

Python web crawler and Information extraction (II.)--beautifulsoup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.