In the previous section on tag, here goes on, tag has a property called string, tag.string is actually we have to master the four objects in the second----navigablestring, which represents the tag inside the text (even including white space characters, If there is another tag inside the tag, it must be close to no space, or return none, the reason for this is mentioned below. In fact, this navigablestring is the encapsulation of ordinary Unicode strings, In addition to providing some way to search the HTML tree for convenience, we can use. Replace_with () to replace the contents of the tag, and we can use Unicode () to convert it to a normal string.
1 tag.string2 #u ' extremely bold '3 type (tag.string)4 #<class ' bs4.element.NavigableString ' >5 6Unicode_string =Unicode (tag.string)7 unicode_string8 #u ' extremely bold '9 type (unicode_string)Ten #<type ' Unicode ' > One ATag.string.replace_with ("No longer Bold") - Tag - #<blockquote>no Longer bold</blockquote>
The last object to be said is called Comment, Comment is actually a special navigablestring object. After implementation found that only close to the parent tag write, only effect, otherwise it will return none. (The reason is also mentioned below)
1 " <b><!--Hey, buddy. Want to buy a used parser?--></b>"2#<b> <!--preceded by a space--></ B>---this will directly return none3 soup = beautifulsoup (markup)4 comment = soup.b.string
5type (comment)6# <class ' bs4.element.Comment ' >
Knowing these four types of objects, we can then explore how to find what you need ...
Here's the simplest way:
Soup.head # to the first soup.body.b # <b>the Dormouse's story</b> find the first body, then find the first B in body soup.b # if B is the first occurrence in the example above, it can also be indexed directly.
. Contents and. Children:
The difference between the two is that. Contents returns the list,. Children returns the generator (but the content is the same) ...
For example, for
1 < body > 2 < b > 3 AABBCCDD 4 b > 5 </ body >
1 soup = beautifulsoup (open ('test.html'lxml') 2print(soup.body.contents)
The result is:
['\ n', <b> aabbccdd'\ n']
It is worth mentioning that BeautifulSoup also has his own contents.
. Descendants:
The only difference between this and. Children is that the former returns all descendants and the latter only returns to the immediate child, not much.
. String:
If a tag has only one children and that children is navigablestring (in which case the contents of the tag are plain text), then we can use the. String to get it.
1 < b > 2 AABBCCDD3</b>
The b.contents of the above HTML is this:
[' \ n aabbccdd\n ']
For this, is in line with the above mentioned situation, you can use b.string to get it ...
At the same time if a tag has only one children and its children is another tag, and the other tag has a. String, then this tag's. String equals its child's. String.
This is not possible for this example:
1 < Body > 2 < b > 3 AABBCCDD4</b>5</ Body >
For this example, the result of calling. Contents is this:
< b > AABBCCDD</b>, ' \ n ']
Unless the above example is written like this:
< Body >< b > </ b ></ Body >
That's why I said I wanted to. String must be a tag close to the reason that there is no space.
Python crawler (2) ...