The previous four objects and their properties were described, but in general it is more complicated to extract the tags we need in the messy HTML (the information contained in the tag), and now we can look at some of the search methods.
The main two methods, of course, are find_all () and find (), which are roughly the same, except that one returns all tags that match the criteria, which returns only one tag. Let's take a closer look at Find_all.
Signature:find_all (name, attrs, recursive, string, limit, **kwargs )
Find_all () will automatically look for all descendants of the calling tag (in the document, in fact, also included in the self), after testing, the following if the regular expression filter is the default case-sensitive ...
The first parameter is name, it refers to the tag name, you can pass a string, then he will look for the tag name equals the string tag, you can also pass a set of strings, then it will look for any one of the strings of the tag; You can also pass in a function, but the function must have only one parameter tag, and the return value must be true or False, then the label with the true result is selected, and even a regular expression can be passed.
Soup.find_all ('b')#[<b>the dormouse ' s story</b>]ImportRe forTaginchSoup.find_all (Re.compile ("^b")): Print(Tag.name)#Body#bSoup.find_all (["a","b"])#[<b>the dormouse ' s story</b>#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]defhas_class_but_no_id (tag):returnTag.has_attr ('class') and notTag.has_attr ('ID') Soup.find_all (has_class_but_no_id)#[<p class= "title" ><b>the dormouse ' s story</b></p>#<p class= "story" >once upon a time there were...</p>#<p class= "story" >...</P>]
Any unrecognized parameter will be converted to the tag's property filter (some HTML5 except some), of course, you can also use string, regular, function to filter, but here the function must be aware that its requirement is the only parameter must be the value of the property you want to filter, and no longer the entire tag , you can also filter multiple properties of tag ... This is called the keyword parameter.
1Soup.find_all (id='Link2')2 #[<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>]3 4Soup.find_all (Href=re.compile ("Elsie"))5 #[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>]6 7Data_soup = BeautifulSoup ('<div data-foo= "value" >foo!</div>')8Data_soup.find_all (data-foo="value")9 #Syntaxerror:keyword can ' t be an expressionTen OneSoup.find_all (Href=re.compile ("Elsie"), id='Link1') A #[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >THREE</A>] - - defnot_lacie (HREF): the returnHref and notRe.compile ("Lacie"). Search (HREF) -Soup.find_all (href=Not_lacie) - #[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A> - #<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]
Of course the filter attribute can also be attrs with its second parameter, unlike the first argument, which must be a dictionary:
Data_soup.find_all (attrs={"data-foo""value"}) # [<div data-foo= "value" >FOO!</DIV>]
A property in the original tag is commonly called class, but class is the keyword of Python, so changed to Class_, usage and keyword parameter usage is the same, it is worth mentioning that if a tag class has multiple values (a tag belongs to more than one class is reasonable ), so long as one of the properties matches the tag, the tag is returned, but if you try to match more than one value, then be sure to follow the order, if the order is different from the tag's class, then the match fails and you need to use Select instead of Find_all ():
1 Soup.find_all (Class_=re.compile ("ITL")) 2#[<p class= "title" ><b>the dormouse ' s story</b></p>]3 4defhas_six_characters (css_class):5returnCss_class is notNone andLen (css_class) = = 6 6 7 soup.find_all (class_=has_six_characters)8#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>9#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>10#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]Css_soup.find_all ("P", class_="Body")13#[<p class= "body strikeout" ></P>]Css_soup.find_all ("P", class_="Strikeout body")16# []Css_soup.select ("P.strikeout.body")19#[<p class= "body strikeout" ></P>]
The fourth parameter is a string, the older version is also called the text ... The simple use of this parameter will look for the tag in the strings, can be used together with the tag to use, you can find. String matching this condition of the tag ...
Soup.find_all (string="Elsie")#[u ' Elsie ']Soup.find_all (String=["Tillie","Elsie","Lacie"])#[u ' Elsie ', U ' Lacie ', U ' Tillie ']Soup.find_all (String=re.compile ("Dormouse")) [u"The dormouse ' s story"+ R"The dormouse ' s story"]defIs_the_only_string_within_a_tag (s):"""Return True If this string is the only child of its parent tag.""" return(s = =s.parent.string) Soup.find_all (String=Is_the_only_string_within_a_tag)#[u "the Dormouse ' s story", U "the Dormouse's Story", U ' Elsie ', U ' Lacie ', U ' Tillie ', u ' ... ')Soup.find_all ("a", string="Elsie")#[<a href= "Http://example.com/elsie" class= "Sister " id= "Link1" >ELSIE</A>]
The limit parameter is simple, which is to give the method a search for the upper limit, such as limit=2, then return after finding 2, no longer looking, so when limit=1, Find_all () and find () are the same.
The recursive parameter defaults to True, at which point the function matches both its own and descendants tag, if set to false, then only matches itself and the child ...
Because Find_all () is so often used, it also provides a shorthand method.
Soup.find_all ("a") Soup ("a")
Find () and Find_all () in addition to the above mentioned limit, the other no difference, here will not repeat ...
Read the HTML tree of BeautifulSoup Official document search (1)