The FindAll function finds labels by name and property of the label. But if you need to pass a bit in the document by tag
To find the label, what should I do? This is the role of the Navigation tree (Navigating Trees). In the 1th chapter, we
See the navigation of the BeautifulSoup tag tree in a single direction:
BsObj.tag.subTag.anotherSubTag
Now we use the virtual online shopping site http://www.pythonscraping.com/pages/page3.html as a sample page to crawl
This HTML page can be mapped to a tree (for brevity, some tags are omitted), as shown below:
? Html
-body
-div.wrapper
-h1
-div.content
-table#giftlist
-tr
-th
-th
-th
-th
-tr.gift#gift1
-td
-td
-span.excitingnote
-td
-td
-img
— ...... Other table rows omitted ...
-div.footer
In the following sections, we still take this HTML tag structure as an example.
1. handling sub-labels and other descendant tags
In computer science and some mathematical fields, you will often hear the "Child abuse" event (metaphor for some sub-events
): Move them, store them, delete them, or even kill them. Thankfully, in
In BeautifulSoup, the sub-labels are not treated as cruelly.
Like many other libraries, in the BeautifulSoup Library, Children (child) and descendants (descendant) have significant
Different: Like the human family tree, a child tag is the next level of a parent tag, and the descendant tag refers to a parent tag
Label for all levels below. For example, theTR tag is a sub-label of the Tabel label, while tr, TH, TD, IMG, and span
tags are descendants of the Tabel label (as is the case in our example page). All child tags are descendants of the label
But not all descendant tags are sub-labels.
In general, the BeautifulSoup function always handles descendant labels for the current label. For example, BSOBJ.BODY.H1 selected
Select the first H1 tag in the body tag's descendants, not to find the label outside the body.
Similarly, BSOBJ.DIV.FINDALL ("img") finds the first DIV tag in the document and then gets the div
A list of all IMG tags in the generation.
If you just want to find a child tag, you can use the. Children tag:
fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://www.pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML) forChildinchBsobj.find ("Table",{"ID":"giftlist"}). Children:Print(child)
This code prints the data rows for all the products in the Giftlist table. If you use the Descendants () function instead of
Children () function, then there will be more than 20 labels printed, including the IMG tag, the span tag, and each
a TD label . It's important to master the difference between a child label and a descendant label!
2. Handling Brother Tags
BeautifulSoup's next_siblings () function makes it easy to collect tabular data, especially to handle
Table with header row:
fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://www.pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML) forSiblinginchBsobj.find ("Table",{"ID":"giftlist"}). Tr.next_siblings:Print(sibling)
This code prints the product for all rows in the product list, with the exception of the first row of table headings. Why is the header row skipped
, huh? There are two reasons. First, the object cannot label itself as a brother. Any time you get a label for the brother
The label itself is not included in the tag itself. Second, this function only calls the following sibling tags. For example, if
We select a label in the middle of a set of labels, and then use the Next_siblings () function, then it
Only the sibling tag that is behind it will be returned. Therefore, select the label row and then call Next_siblings, you can select the table
All rows except the header row in the grid.
As with next_siblings, if you can easily find the last label in a group of brothers tags,
The previous_siblings function can also be useful.
Of course, there are also next_sibling and previous_sibling functions, with next_siblings and previous_siblings
are similar, except that they return a single label instead of a set of labels.
3. Parent tag Processing
When crawling Web pages, finding the parent tag is a lot less demanding than finding child tags and sibling tags. Usual situation
, if you are looking at HTML pages for the purpose of crawling Web content, we start with the top
Think about how to locate the block where we want the data to be located. However, occasionally in special cases you will also use
BeautifulSoup The parent tag lookup function, parent and parents. For example:
fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://www.pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML)Print(Bsobj.find ("img",{"src":".. /img/gifts/img1.jpg"}). Parent.previous_sibling.get_text ())
This code will print: /img/gifts/img1.jpg This image corresponds to the price of the product (the price in this example is
$15.00).
How is this implemented? The following figure is part of the structure of the HTML page we are working on, with a numeric representation of the step
Words:
? <tr>
-<td>
-<td>
-<td> (3)
-"$15.00" (4)
-<td> (2)
- (1)
(1) Select the picture label src= "... /img/gifts/img1.jpg ";
(2) Select the parent tag of the picture tag (in the example, the <td> tag);
(3) Select the <td> label of the previous sibling tag previous_sibling (in the example is the <td> that contains the dollar price
label);
(4) Select the text in the label, "$15.00".
Python Learning----navigation tree