Python crawler tool: BeautifulSoup library,

Source: Internet
Author: User
Tags python web crawler

Python crawler tool: BeautifulSoup library,

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you.

BeautifulSoup is a functional library for parsing, traversing, and maintaining the "Tag Tree ".(Traversal means that each node in the tree is accessed once and only once along a search route ). Https://www.crummy.com/software/BeautifulSoup

BeautifulSoup is often called bs4. The imported database is from bs4 import BeautifulSoup. Specifically, import BeautifulSoup mainly uses the BeautifulSoup class in bs4.

Bs4 library parser

Basic elements of the BeautifulSoup class

1 import requests 2 from bs4 import BeautifulSoup 3 4 res = requests. get ('HTTP: // www.pmcaff.com/site/selection') 5 soup = BeautifulSoup (res. text, 'lxml') 6 print (soup. a) 7 # any tag in the HTML syntax can be soup. <tag> obtained by access, soup when multiple identical <tag> contents exist in the HTML document. <tag> the first entry is returned. 8 9 print (soup. a. name) 10 # Each <tag> has its own name. You can use the <tag>. name, string type 11 12 print (soup. a. attrs) 13 print (soup. a. attrs ['class']) 14 # A <tag> may have one or more attributes, which are dictionary type 15 16 print (soup. a. string) 17 # <tag>. string can be obtained from the non-attribute string 18 19 soup1 = BeautifulSoup ('<p> <! -- Here is the comment --> </p> ', 'lxml') 20 print (soup1.p. string) 21 print (type (soup1.p. string) 22 # comment is a special type. You can also use <tag>. string

Running result:

<A class = "no-login" href = ""> logon </a>

A

{'Href ': '', 'class': ['no-login']} ['no-login']

Login

Here is the comment

<Class 'bs4. element. comment'>

HTML content traversal in bs4 Library

Basic HTML Structure

Downlink traversal of the label tree

BeautifulSoup is the root node of the label tree.

1 # traverse son Node 2 for child in soup. body. children: 3 print (child. name) 4 5 # traverse the child node 6 for child in soup. body. descendants: 7 print (child. name)

Uplink traversal of the label tree

1 # When traversing all the advanced nodes, including soup itself, so if... else... judge 2 for parent in soup. a. parents: 3 if parent is None: 4 print (parent) 5 else: 6 print (parent. name)

Running result:

Div

Div

Body

Html

[Document]

Parallel traversal of the label tree

1 # traverse subsequent nodes 2 for sibling in soup. a. next_sibling: 3 print (sibling) 4 5 # traverse the previous node 6 for sibling in soup. a. previus_sibling: 7 print (sibling)

Pretiterator () method of bs4 Library

The pretpipeline () method can set some standard code formats, which are expressed by soup. pretpipeline. In PyCharm, print (soup. pretloads () is used for output.

Operating Environment: Mac, Python 3.6, PyCharm 2016.2

Reference: MOOC course "Python web crawler and information extraction" of Chinese University

----- End -----

More highlights follow my public account: du wangdan

Author: du wangdan, Internet product manager

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.