Python web crawler and Information extraction (II.)--beautifulsoup

Last Update:2017-09-30 Source: Internet

Author: User

Tags xml parser python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Boautiful Soup

BeautifulSoup Official Introduction:

Beautiful Soup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve idiomatic document navigation, find, modify the way the document.

Official website: https://www.crummy.com/software/BeautifulSoup/

1. Installation

Find "cmd.exe" in "C:\Windows\System32", run as Administrator, and enter: "Pip install BEAUTIFULSOUP4" on the command line.

Tip the PIP version is too low to use python -m pip install --upgrade pip for upgrade.

C:\Windows\system32>pip install beautifulsoup4Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in c:\users\lei\appdata\local\programs\python\python35\lib\site-packages\beautifulsoup4-4.5.0-py3.5.eggYou are using pip version 8.1.1, however version 9.0.1 is available.You should consider upgrading via the ‘python -m pip install --upgrade pip‘ command.

Beautiful installation test for Soup Library:

Presentation HTML page address: Http://www.cnblogs.com/yan-lei

>>> import requests>>> r = requests.get("http://www.cnblogs.com/yan-lei/")>>> r.text‘\r\n\r\n\r\n\r\n\r\n\r\nPython学习者 - 博客园\r\n>> demo = r.text>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(demo,"html.parser")>>> soupPython学习者 - 博客园......

from bs4 import BeautifulSoup soup = BeautifulSoup(‘

Data

‘,‘html.parser‘)

2, the use of Beautiful soup Library

In HTML, for example, any HTML file has a set of "<>" organized, in fact, is the label, the label formed a top-down relationship between the formation of a tag tree. BeautifulSoup Library is a library of functions that parse, traverse, and maintain the "tag tree"

<p>, .... </p>: Tag tag

Label name usually appears in pairs
Property attributes 0 or more

Beautiful References to Soup libraries

Beautiful Soup Library, also known as BEAUTFULSOUP4 or BS4. The Convention refers to the following way, that is, mainly with the BeautifulSoup class.

From BS4 import beautifulsoup import BS4

Beautiful Soup Class

The tag tree is converted to the BeautifulSoup class, at which point we use HTML, tag tree, BeautifulSoup class equivalent

From BS4 import BeautifulSoup soup1 = BeautifulSoup ("Data", "Html.parser") soup2 = BeautifulSoup (Open ("d://demo.html", " Html.parser "))

The beautifulsoup corresponds to the entire contents of a html/xml document.

Beautiful Soup Library Parser

Soup = beautifulsoup (' Data ', ' Html.parser ')

Parser	How to use	conditions
HTML parser for BS4	BeautifulSoup (MK, ' Html.parser ')	Installing the BS4 Library
HTML parser for lxml	BeautifulSoup (MK, ' lxml ')	Pip Install lxml
XML parser for lxml	BeautifulSoup (MK, ' xml ')	Pip Install lxml

Newsoup = BeautifulSoup ("

This is a comment

"," Html.parser ") Beautiful the basic elements of the Soup class

Basic Elements	Description
Tag	tags, the most basic information organizational unit, with <> and </> marked the beginning and end
Name	The name of the label,<p>...</p> is ' P ', format: <tag>.name
Attributes	Label properties, dictionary form of organization, format: <tag>.attrs
Naviglestring	,<>...</> string in non-attribute string in tag, format <tag>.string
Comment	The annotation part of a string within a tag, a special type of comment

Python web crawler and Information extraction (II.)--beautifulsoup

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More