Python crawler (2) ...

Source: Internet
Author: User

In the previous section on tag, here goes on, tag has a property called string, tag.string is actually we have to master the four objects in the second----navigablestring, which represents the tag inside the text (even including white space characters, If there is another tag inside the tag, it must be close to no space, or return none, the reason for this is mentioned below. In fact, this navigablestring is the encapsulation of ordinary Unicode strings, In addition to providing some way to search the HTML tree for convenience, we can use. Replace_with () to replace the contents of the tag, and we can use Unicode () to convert it to a normal string.

1 tag.string2 #u ' extremely bold '3 type (tag.string)4 #<class ' bs4.element.NavigableString ' >5 6Unicode_string =Unicode (tag.string)7 unicode_string8 #u ' extremely bold '9 type (unicode_string)Ten #<type ' Unicode ' > One  ATag.string.replace_with ("No longer Bold") - Tag - #<blockquote>no Longer bold</blockquote>

The last object to be said is called Comment, Comment is actually a special navigablestring object. After implementation found that only close to the parent tag write, only effect, otherwise it will return none. (The reason is also mentioned below)

1 " <b><!--Hey, buddy. Want to buy a used parser?--></b>"2#<b> <!--preceded by a space--></ B>---this will directly return none3 soup = beautifulsoup (markup)4 comment = soup.b.string
    5type (comment)6#  <class ' bs4.element.Comment ' >

Knowing these four types of objects, we can then explore how to find what you need ...

Here's the simplest way:

Soup.head #  to the first soup.body.b # <b>the Dormouse's story</b> find the first body, then find the first B in body soup.b # if B is the first occurrence in the example above, it can also be indexed directly.

. Contents and. Children:

The difference between the two is that. Contents returns the list,. Children returns the generator (but the content is the same) ...

For example, for

 1  <  body  >  2  <  b  >  3   AABBCCDD  4   b  >  5  </ body  >  
1 soup = beautifulsoup (open ('test.html'lxml')  2print(soup.body.contents)

The result is:

['\ n', <b> aabbccdd'\ n']

It is worth mentioning that BeautifulSoup also has his own contents.

. Descendants:

The only difference between this and. Children is that the former returns all descendants and the latter only returns to the immediate child, not much.

. String:

If a tag has only one children and that children is navigablestring (in which case the contents of the tag are plain text), then we can use the. String to get it.

1 < b > 2     AABBCCDD3</b>

The b.contents of the above HTML is this:

[' \ n    aabbccdd\n ']

For this, is in line with the above mentioned situation, you can use b.string to get it ...

At the same time if a tag has only one children and its children is another tag, and the other tag has a. String, then this tag's. String equals its child's. String.

This is not possible for this example:

1 < Body > 2 < b > 3     AABBCCDD4</b>5</ Body >

For this example, the result of calling. Contents is this:

< b >     AABBCCDD</b>, ' \ n ']

Unless the above example is written like this:

< Body >< b > </ b ></ Body >

That's why I said I wanted to. String must be a tag close to the reason that there is no space.

Python crawler (2) ...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.