Read the HTML tree of BeautifulSoup Official document search (1)

Last Update:2016-06-16 Source: Internet

Author: User

Tags tag name

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous four objects and their properties were described, but in general it is more complicated to extract the tags we need in the messy HTML (the information contained in the tag), and now we can look at some of the search methods.

The main two methods, of course, are find_all () and find (), which are roughly the same, except that one returns all tags that match the criteria, which returns only one tag. Let's take a closer look at Find_all.

Signature:find_all (name, attrs, recursive, string, limit, **kwargs )

Find_all () will automatically look for all descendants of the calling tag (in the document, in fact, also included in the self), after testing, the following if the regular expression filter is the default case-sensitive ...

The first parameter is name, it refers to the tag name, you can pass a string, then he will look for the tag name equals the string tag, you can also pass a set of strings, then it will look for any one of the strings of the tag; You can also pass in a function, but the function must have only one parameter tag, and the return value must be true or False, then the label with the true result is selected, and even a regular expression can be passed.

Soup.find_all ('b')#[<b>the dormouse ' s story</b>]ImportRe forTaginchSoup.find_all (Re.compile ("^b")):    Print(Tag.name)#Body#bSoup.find_all (["a","b"])#[<b>the dormouse ' s story</b>#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]defhas_class_but_no_id (tag):returnTag.has_attr ('class') and  notTag.has_attr ('ID') Soup.find_all (has_class_but_no_id)#[<p class= "title" ><b>the dormouse ' s story</b></p>#<p class= "story" >once upon a time there were...</p>#<p class= "story" >...</P>]

Any unrecognized parameter will be converted to the tag's property filter (some HTML5 except some), of course, you can also use string, regular, function to filter, but here the function must be aware that its requirement is the only parameter must be the value of the property you want to filter, and no longer the entire tag , you can also filter multiple properties of tag ... This is called the keyword parameter.

1Soup.find_all (id='Link2')2 #[<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>]3 4Soup.find_all (Href=re.compile ("Elsie"))5 #[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>]6 7Data_soup = BeautifulSoup ('<div data-foo= "value" >foo!</div>')8Data_soup.find_all (data-foo="value")9 #Syntaxerror:keyword can ' t be an expressionTen  OneSoup.find_all (Href=re.compile ("Elsie"), id='Link1') A #[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >THREE</A>] -  - defnot_lacie (HREF): the     returnHref and  notRe.compile ("Lacie"). Search (HREF) -Soup.find_all (href=Not_lacie) - #[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A> - #<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]

Of course the filter attribute can also be attrs with its second parameter, unlike the first argument, which must be a dictionary:

Data_soup.find_all (attrs={"data-foo""value"}) # [<div data-foo= "value" >FOO!</DIV>]

A property in the original tag is commonly called class, but class is the keyword of Python, so changed to Class_, usage and keyword parameter usage is the same, it is worth mentioning that if a tag class has multiple values (a tag belongs to more than one class is reasonable ), so long as one of the properties matches the tag, the tag is returned, but if you try to match more than one value, then be sure to follow the order, if the order is different from the tag's class, then the match fails and you need to use Select instead of Find_all ():

1 Soup.find_all (Class_=re.compile ("ITL")) 2#[<p class= "title" ><b>the dormouse ' s story</b></p>]3 4defhas_six_characters (css_class):5returnCss_class is  notNone andLen (css_class) = = 6 6 7 soup.find_all (class_=has_six_characters)8#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>9#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>10#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]Css_soup.find_all ("P", class_="Body")13#[<p class= "body strikeout" ></P>]Css_soup.find_all ("P", class_="Strikeout body")16# []Css_soup.select ("P.strikeout.body")19#[<p class= "body strikeout" ></P>]

The fourth parameter is a string, the older version is also called the text ... The simple use of this parameter will look for the tag in the strings, can be used together with the tag to use, you can find. String matching this condition of the tag ...

Soup.find_all (string="Elsie")#[u ' Elsie ']Soup.find_all (String=["Tillie","Elsie","Lacie"])#[u ' Elsie ', U ' Lacie ', U ' Tillie ']Soup.find_all (String=re.compile ("Dormouse")) [u"The dormouse ' s story"+ R"The dormouse ' s story"]defIs_the_only_string_within_a_tag (s):"""Return True If this string is the only child of its parent tag."""    return(s = =s.parent.string) Soup.find_all (String=Is_the_only_string_within_a_tag)#[u "the Dormouse ' s story", U "the Dormouse's Story", U ' Elsie ', U ' Lacie ', U ' Tillie ', u ' ... ')Soup.find_all ("a", string="Elsie")#[<a href= "Http://example.com/elsie" class= "Sister " id= "Link1" >ELSIE</A>]

The limit parameter is simple, which is to give the method a search for the upper limit, such as limit=2, then return after finding 2, no longer looking, so when limit=1, Find_all () and find () are the same.

The recursive parameter defaults to True, at which point the function matches both its own and descendants tag, if set to false, then only matches itself and the child ...

Because Find_all () is so often used, it also provides a shorthand method.

Soup.find_all ("a") Soup ("a")

Find () and Find_all () in addition to the above mentioned limit, the other no difference, here will not repeat ...

Read the HTML tree of BeautifulSoup Official document search (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More