Parsing html "Go" with Python's BeautifulSoup

Last Update:2016-09-29 Source: Internet

Author: User

Tags tagname

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://www.cnblogs.com/twinsclover/archive/2012/04/26/2471704.html

Preface

Before using Python to crawl the Web page, always use the regex or the Sgmlparser in the library sgmllib. But when faced with a complicated situation, sgmlparser often does not give the force! (Ha, say I too native? After all, BeautifulSoup is inherited Sgmlparser ~) So, I look for search and find, found beautifulsoup such a thing. BeautifulSoup provides a very humane parser tree, with it, we can simply extract the tagname, attrs, text and so on ...

Install something, see here. http://www.crummy.com/software/BeautifulSoup/

Entry

(PS: In fact, the introduction of the official document is the best, here just record a simple usage.) ）

(Official document Link:http://www.crummy.com/software/beautifulsoup/bs3/documentation.zh.html)

First, we introduce some of the most commonly used methods in practical work:

HTML code for example (just use the official example):

1 
0. Initialize :
1 soup = beautifulsoup (HTML) # HTML for HTML source code string, type (html) = = str
1. use tag to get the parse tree of the corresponding code block:
Since we want to analyze HTML, we first find the tag block that is useful to us, beautiful provides a very convenient way.
#当用tag作为搜索条件时, we get the parse tree that contains this tag block:#<tag><xxx>ooo</xxx></tag> #这里获取head这个块head = Soup.find (' Head ') # or# head = soup.head# or# head = soup.contents[0].contents[0]
After running, we will get:
1 
I still recommend the second method, the Find method finds a subtree that matches the criteria in the current tag parse tree (the current HTML code block) and returns. The Find method provides a variety of query methods, including the use of loved Regex Oh ~ after the detailed introduction.
The Contents property is a list that holds the direct son of the parse tree.
Such as:
1 html = soup.contents[0] # 
2, with contents[], parent, nextSibling, previoussibling look for father and son brother tag

For a more convenient and flexible parsing of HTML code blocks, BeautifulSoup provides several simple ways to get directly to the parent-child sibling of the current tag block.
Suppose we have obtained the body of this tag block, we want to look for 
# BODY = soup.bodyhtml = body.parent # HTML is the father of the body
Head = body.previoussibling # Head and body on the same level, is the body of the former brother
P1 = body.contents[0] # P1, p2 is the son of the body, we use contents[0] to obtain P1P2 = p1.nextsibling # P2 and P1 on the same layer, is P1 's after a brother, of course body.content[1] You can also get print p1.text# u ' this is Paragraphone. ' Print p2.text# u ' this is paragraphtwo. '  Note: 1, the text of each tag includes it and the text of its descendants. 2, all text has been automatically turned # for Unicode, if necessary, can be self-transcoding encode (XXX)
However, what if we are looking for ancestors or grandchildren tag?? With a while loop? No, BeautifulSoup has provided the method.
3, with Find, Findparent, findnextsibling, findprevioussibling search for ancestors or descendants tag:
　　With the base above, it should be well understood, such as the Find method (which I understand is the same as Findchild), that is, starting with the current node, traversing the entire subtree, and returning after finding it.
And the plural form of these methods, will find all the matching requirements of the tag, put back in the form of a list. Their correspondence is: Find->findall, findparent->findparents, findnextsibling->findnextsiblings ...
Such as:
1 print soup.findall (' P ') 2 # [<p id= "Firstpara" align= "center" >this is paragraph <b>one</b>.</p <p id= "Secondpara" align= "blah" >this is paragraph <b>two</b>.</p>]
　　
Here we focus on several uses of find, other analogies:
Find (Name=none, attrs={}, Recursive=true, Text=none, **kwargs)
(PS: Only a few uses, complete please see the official link:http://www.crummy.com/software/beautifulsoup/bs3/documentation.zh.html#the%20basic% 20find%20method:%20findall%28name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs%29)
1) Search tag:
1 Find (tagname)        # Direct search for tag named tagname such as: Find (' head ') 2 find (list)           # Search in the list of tags, such as: find ([' head ', ' body ']) 3 find ( dict)           # Search for tags in dict, such as: Find ({' head ': true, ' body ': true}) 4 Find (Re.compile (")) # Search for a regular tag, such as: Find (Re.compile (' ^ P ')) search for a Tag5 find (lambda) # search function that begins with P to         return a tag with a True result, such as: Find (lambda name:if len (name) = = 1) search for a length of 1 Tag6 find (True)           # Search All Tags
2) Search properties (attrs):
1 find (id= ' xxx ')                                  # Look for the id attribute for XXX's 2 find (attrs={id=re.compile (' xxx '), algin= ' xxx '}) # Look for the id attribute to match the regular and Algin property for XXX 3 find ( Attrs={id=true, Algin=none})               # Looking for an id attribute but no algin attribute
3) search text (text):
Note that the search for text results in other search-giving values such as: Tag, attrs are invalidated.
method is consistent with search tag
4) Recursive, limit:
Recursive=false indicates that only the immediate son is searched, otherwise the entire subtree is searched, and the default is true.
When using FindAll or a method similar to returning a list, the Limit property is used to limit the number of returns, such as FindAll (' P ', limit=2): Returns the first two tags found
* 4, use next,previous to find the context tag (less)
Here we mainly look at next, next is to get the current tag of the next (in order of code from top to bottom) tag block. This is not the same as contents, do not confuse the oh ^ ^
Let's take a look at Li Zilai.
1 <a>2     a3     <b>b</b>4     <c>c</c>5 </a>
Let's look at the actual effect of next:
1 A = soup.a2 b = soup.b3 N1 = b.next4 N2 = N1.next
Output a bit:
1 Print a.next2 # u ' a ' 3 print n14 # u ' B ' 5 print N26 # <c>c</c>
So next is simply to get the "next" tag on the document, regardless of the location in the parse tree.
Of course there are findnext and Findallnext methods.
As for previous, which represents the last tag block, just analogy ~^ ^
Parsing html "Go" with Python's BeautifulSoup

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More