Python crawler tool: BeautifulSoup library,
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you.
BeautifulSoup is a functional library for parsing, traversing, and maintaining the "Tag Tree ".(Traversal means that each node in the tree is accessed once and only once along a search route ). Https://www.crummy.com/software/BeautifulSoup
BeautifulSoup is often called bs4. The imported database is from bs4 import BeautifulSoup. Specifically, import BeautifulSoup mainly uses the BeautifulSoup class in bs4.
Bs4 library parser
Basic elements of the BeautifulSoup class
1 import requests 2 from bs4 import BeautifulSoup 3 4 res = requests. get ('HTTP: // www.pmcaff.com/site/selection') 5 soup = BeautifulSoup (res. text, 'lxml') 6 print (soup. a) 7 # any tag in the HTML syntax can be soup. <tag> obtained by access, soup when multiple identical <tag> contents exist in the HTML document. <tag> the first entry is returned. 8 9 print (soup. a. name) 10 # Each <tag> has its own name. You can use the <tag>. name, string type 11 12 print (soup. a. attrs) 13 print (soup. a. attrs ['class']) 14 # A <tag> may have one or more attributes, which are dictionary type 15 16 print (soup. a. string) 17 # <tag>. string can be obtained from the non-attribute string 18 19 soup1 = BeautifulSoup ('<p> <! -- Here is the comment --> </p> ', 'lxml') 20 print (soup1.p. string) 21 print (type (soup1.p. string) 22 # comment is a special type. You can also use <tag>. string
Running result:
<A class = "no-login" href = ""> logon </a>
A
{'Href ': '', 'class': ['no-login']} ['no-login']
Login
Here is the comment
<Class 'bs4. element. comment'>
HTML content traversal in bs4 Library
Basic HTML Structure
Downlink traversal of the label tree
BeautifulSoup is the root node of the label tree.
1 # traverse son Node 2 for child in soup. body. children: 3 print (child. name) 4 5 # traverse the child node 6 for child in soup. body. descendants: 7 print (child. name)
Uplink traversal of the label tree
1 # When traversing all the advanced nodes, including soup itself, so if... else... judge 2 for parent in soup. a. parents: 3 if parent is None: 4 print (parent) 5 else: 6 print (parent. name)
Running result:
Div
Div
Body
Html
[Document]
Parallel traversal of the label tree
1 # traverse subsequent nodes 2 for sibling in soup. a. next_sibling: 3 print (sibling) 4 5 # traverse the previous node 6 for sibling in soup. a. previus_sibling: 7 print (sibling)
Pretiterator () method of bs4 Library
The pretpipeline () method can set some standard code formats, which are expressed by soup. pretpipeline. In PyCharm, print (soup. pretloads () is used for output.
Operating Environment: Mac, Python 3.6, PyCharm 2016.2
Reference: MOOC course "Python web crawler and information extraction" of Chinese University
----- End -----
More highlights follow my public account: du wangdan
Author: du wangdan, Internet product manager