Python Learning----navigation tree

Source: Internet
Author: User

The FindAll function finds labels by name and property of the label. But if you need to pass a bit in the document by tag

To find the label, what should I do? This is the role of the Navigation tree (Navigating Trees). In the 1th chapter, we

See the navigation of the BeautifulSoup tag tree in a single direction:

BsObj.tag.subTag.anotherSubTag

Now we use the virtual online shopping site http://www.pythonscraping.com/pages/page3.html as a sample page to crawl

This HTML page can be mapped to a tree (for brevity, some tags are omitted), as shown below:

? Html

-body

-div.wrapper

-h1

-div.content

-table#giftlist

-tr

-th

-th

-th

-th

-tr.gift#gift1

-td

-td

-span.excitingnote

-td

-td

-img

— ...... Other table rows omitted ...

-div.footer

In the following sections, we still take this HTML tag structure as an example.

1. handling sub-labels and other descendant tags

In computer science and some mathematical fields, you will often hear the "Child abuse" event (metaphor for some sub-events

): Move them, store them, delete them, or even kill them. Thankfully, in

In BeautifulSoup, the sub-labels are not treated as cruelly.

Like many other libraries, in the BeautifulSoup Library, Children (child) and descendants (descendant) have significant

Different: Like the human family tree, a child tag is the next level of a parent tag, and the descendant tag refers to a parent tag

Label for all levels below. For example, theTR tag is a sub-label of the Tabel label, while tr, TH, TD, IMG, and span

tags are descendants of the Tabel label (as is the case in our example page). All child tags are descendants of the label

But not all descendant tags are sub-labels.

In general, the BeautifulSoup function always handles descendant labels for the current label. For example, BSOBJ.BODY.H1 selected

Select the first H1 tag in the body tag's descendants, not to find the label outside the body.

Similarly, BSOBJ.DIV.FINDALL ("img") finds the first DIV tag in the document and then gets the div

A list of all IMG tags in the generation.

If you just want to find a child tag, you can use the. Children tag:

 fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://www.pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML) forChildinchBsobj.find ("Table",{"ID":"giftlist"}). Children:Print(child)

This code prints the data rows for all the products in the Giftlist table. If you use the Descendants () function instead of

Children () function, then there will be more than 20 labels printed, including the IMG tag, the span tag, and each

a TD label . It's important to master the difference between a child label and a descendant label!

2. Handling Brother Tags

BeautifulSoup's next_siblings () function makes it easy to collect tabular data, especially to handle

Table with header row:

 fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://www.pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML) forSiblinginchBsobj.find ("Table",{"ID":"giftlist"}). Tr.next_siblings:Print(sibling)

This code prints the product for all rows in the product list, with the exception of the first row of table headings. Why is the header row skipped

, huh? There are two reasons. First, the object cannot label itself as a brother. Any time you get a label for the brother

The label itself is not included in the tag itself. Second, this function only calls the following sibling tags. For example, if

We select a label in the middle of a set of labels, and then use the Next_siblings () function, then it

Only the sibling tag that is behind it will be returned. Therefore, select the label row and then call Next_siblings, you can select the table

All rows except the header row in the grid.

As with next_siblings, if you can easily find the last label in a group of brothers tags,

The previous_siblings function can also be useful.

Of course, there are also next_sibling and previous_sibling functions, with next_siblings and previous_siblings

are similar, except that they return a single label instead of a set of labels.

3. Parent tag Processing

When crawling Web pages, finding the parent tag is a lot less demanding than finding child tags and sibling tags. Usual situation

, if you are looking at HTML pages for the purpose of crawling Web content, we start with the top

Think about how to locate the block where we want the data to be located. However, occasionally in special cases you will also use

BeautifulSoup The parent tag lookup function, parent and parents. For example:

 fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://www.pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML)Print(Bsobj.find ("img",{"src":".. /img/gifts/img1.jpg"}). Parent.previous_sibling.get_text ())

This code will print: /img/gifts/img1.jpg This image corresponds to the price of the product (the price in this example is

$15.00).

How is this implemented? The following figure is part of the structure of the HTML page we are working on, with a numeric representation of the step

Words:

? <tr>

-<td>

-<td>

-<td> (3)

-"$15.00" (4)

-<td> (2)

- (1)

(1) Select the picture label src= "... /img/gifts/img1.jpg ";

(2) Select the parent tag of the picture tag (in the example, the <td> tag);

(3) Select the <td> label of the previous sibling tag previous_sibling (in the example is the <td> that contains the dollar price

label);

(4) Select the text in the label, "$15.00".

Python Learning----navigation tree

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.