Beautiful Soup parses anything you give it, and does the tree traversal stuff for you.
The BeautifulSoup Library is a library of functions that parse, traverse, and maintain the "tag tree" (traversal, which is a search route that is done once and only once for each node in the tree). Https://www.crummy.com/software/BeautifulSoup
BeautifulSoup Library We often call BS4, import the library is: from BS4 import BeautifulSoup. Among them, import BeautifulSoup is mainly used in BS4 in the BeautifulSoup class.
BS4 Library Parser
Basic elements of the BeautifulSoup class
1 ImportRequests2 fromBs4ImportBeautifulSoup3 4res = Requests.get ('http://www.pmcaff.com/site/selection')5Soup = BeautifulSoup (Res.text,'lxml')6 Print(SOUP.A)7 #any tag that exists in HTML syntax can be accessed with soup.<tag>,,soup.<tag> returns the first when there are multiple identical <tag> corresponding content in the HTML document. 8 9 Print(Soup.a.name)Ten #each <tag> has its own name, which can be obtained by <tag>.name, String type One A Print(soup.a.attrs) - Print(soup.a.attrs['class']) - #a <tag> may have one or more properties, which is a dictionary type the - Print(soup.a.string) - #<tag>.string can be taken to a non-attribute string within a tag - +Soup1 = BeautifulSoup ('<p><!--Here is the note--></p>','lxml') - Print(soup1.p.string) + Print(Type (soup1.p.string)) A #Comment is a special type that can also be obtained by <tag>.string
Operation Result:
<a class= "No-login" href= "" > Login </a>
A
{' href ': ', ' class ': [' No-login ']} [' No-login ']
Login
Here's the note.
<class ' Bs4.element.Comment ' >
HTML content traversal of the BS4 library
The basic structure of HTML
Downlink traversal of the tag tree
Where the BeautifulSoup type is the root node of the tag tree.
1 # Traverse son node 2 for inch Soup.body.children: 3 Print (Child.name) 4 5 # Traverse descendant Nodes 6 for inch soup.body.descendants: 7 Print (Child.name)
Upstream traversal of the tag tree
1 # Traverse all ancestors nodes, including soup itself, so if...else ... Judging 2 for in soup.a.parents:3 if is None:4 print(parent)5 else: 6 Print (Parent.name)
Operation Result:
Div
Div
Body
Html
[Document]
Parallel traversal of the tag tree
1 # Traverse subsequent nodes 2 for inch soup.a.next_sibling: 3 Print (sibling) 4 5 # traversing a previous node 6 for inch soup.a.previous_sibling: 7 Print (sibling)
The Prettify () method of the BS4 library
The Prettify () method can be used as a standard for code formatting, denoted by soup.prettify (). In Pycharm, print (Soup.prettify ()) is used to output.
Operating environment: Mac,python 3.6,pycharm 2016.2
Reference: Chinese University MOOC course "Python web crawler and Information extraction"
-----End-----
More exciting content Follow me public number: Du Wangdan
Du Wangdan, Internet Product Manager
Python crawler tool: BeautifulSoup Library