Python crawler BEAUTIFULSOUP4 Series 4-sub-node "reprint"

Source: Internet
Author: User

From the blog: Shanghai-leisurely

Original address: http://www.cnblogs.com/yoyoketang/tag/beautifulsoup4/

Objective

Many times we cannot directly navigate to an element, we can first locate its parent element, it is easier to find the child element through the parent element.

One, child nodes

1. Take the blog Home page Summary For example:<div class= "c_b_p_desc"> This tag as the starting point

2. So the div tag is the parent node

3. " Abstract: Preface this article in detail ... "This string is the child node of the upper div (string is usually considered a child of a tag)

4. "<a class="c_b_p_desc_readmore " href="http://www.cnblogs.com/yoyoketang/p/ 6906558.html "> Read full </a>" This is also a sub-node of Div

Second,. Contents

The 1.tag object contents can get all the child nodes and returns a list

2.len () function counts the number of child nodes

3. The corresponding sub-node can be removed by subscript

1 # coding:utf-8 2 from BS4 import beautifulsoup 3 Import requests 4  5 R = Requests.get ("Http://www.cnblogs.com/yoyoke tang/") 6 # Request home after getting the entire HTML interface 7 blog = r.content 8 # parsing HTML with html.parser 9 soup = beautifulsoup (blog," Html.parser ")
The # Find method finds the tag object that matches the first property on a page Tag_soup = Soup.find (class_= "C_b_p_desc")
# len function Gets the number of child nodes print Len (tag_soup.contents)
14 # Loop Print out child nodes for I in Tag_soup.contents:16 print i17 18 # Remove 1th String sub-node by subscript print Tag_soup.contents[0]20 # by subscript Out 2nd a sub-node print tag_soup.contents[1]

Iii.. Children

1. Point children this generates a list object, just like the dot contents function above

2. Just here is the list object, only for loop read out, cannot be obtained by subscript

(generally above that contents use more, may children performance faster, I guess hey! )

Iv.. Descendants

1. The above contents can only get the direct child nodes of the element, if the child nodes of this element have child nodes (that is, the Sun node), this time to get all the descendants of the node can be used. Descendants method

2. Get Div has two child nodes, there are three descendant nodes, because there is a "read full text" This string child node under a tag

1 # coding:utf-8 2 from BS4 import beautifulsoup 3 Import requests 4  5 R = Requests.get ("Http://www.cnblogs.com/yoyoke tang/") 6 # Request home after getting the entire HTML interface 7 blog = r.content 8 # parsing HTML with html.parser 9 soup = beautifulsoup (blog," Html.parser ") # Find method to find the tag object that matches the first property on the page Tag_soup = Soup.find (class_= "C_b_p_desc") the number of child nodes obtained by the Len function (list (tag_ Soup.children)) 15 16 # Gets the number of descendants node. Print Len (list (tag_soup.descendants)) in tag_soup.descendants:20     print I

Crawl the tag content of the blog home page

1. The label on the left side of the blog is not this link: http://www.cnblogs.com/yoyoketang/

2. By grasping the package can see, this URL address is: Http://www.cnblogs.com/yoyoketang/mvc/blog/sidecolumn.aspx?blogApp=yoyoketang

2. You can locate the parent element first:<div class= "catlisttag">

Six, reference code:

1 # coding:utf-8 2 from BS4 import beautifulsoup 3 Import requests 4  5 R = Requests.get ("Http://www.cnblogs.com/yoyoke Tang/mvc/blog/sidecolumn.aspx?blogapp=yoyoketang ") 6 # Request home after getting the entire HTML interface 7 blog = r.content 8 # parsing HTML with html.parser 9 soup = BeautifulSoup (blog, "Html.parser") Tag_soup = Soup.find (class_= "Catlisttag") # Print body.prettify () ul_soup = Tag_soup.find_all ("a") print ul_soup16 for i in ul_soup:17     print i.string

Python crawler BEAUTIFULSOUP4 Series 4-sub-node "reprint"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.