Objective
Many times we cannot directly navigate to an element, we can first locate its parent element, it is easier to find the child element through the parent element.
One, child nodes
1. Take the blog Home page Summary For example:<div class= "c_b_p_desc"> This tag as the starting point
2. So the div tag is the parent node
3. " Abstract: Preface this article in detail ... "This string is the child node of the upper div (string is usually considered a child of a tag)
4. "<a class="c_b_p_desc_readmore " href="http://www.cnblogs.com/yoyoketang/p/ 6906558.html "> Read full </a>" This is also a sub-node of Div
Second,. Contents
The 1.tag object contents can get all the child nodes and returns a list
2.len () function counts the number of child nodes
3. The corresponding sub-node can be removed by subscript
1 #Coding:utf-82 fromBs4ImportBeautifulSoup3 ImportRequests4 5R = Requests.get ("http://www.cnblogs.com/yoyoketang/")6 #get the entire HTML interface after requesting the homepage7Blog =r.content8 #parsing HTML with Html.parser9Soup = beautifulsoup (blog,"Html.parser")
Ten #The Find method finds the tag object that matches the first property on a page OneTag_soup = Soup.find (class_="C_b_p_desc")
A #The Len function gets the number of child nodes - PrintLen (tag_soup.contents)
- #Loop print out child nodes the forIinchtag_soup.contents: - PrintI - - #Remove the 1th string child node by subscript + PrintTag_soup.contents[0] - #Remove the 2nd a sub-node by subscript + PrintTAG_SOUP.CONTENTS[1]
Iii.. Children
1. Point children this generates a list object, just like the dot contents function above
2. Just here is the list object, only for loop read out, cannot be obtained by subscript
(generally above that contents use more, may children performance faster, I guess hey! )
Iv.. Descendants
1. The above contents can only get the direct child nodes of the element, if the child nodes of this element have child nodes (that is, the Sun node), this time to get all the descendants of the node can be used. Descendants method
2. Get Div has two child nodes, there are three descendant nodes, because there is a "read full text" This string child node under a tag
1 #Coding:utf-82 fromBs4ImportBeautifulSoup3 ImportRequests4 5R = Requests.get ("http://www.cnblogs.com/yoyoketang/")6 #get the entire HTML interface after requesting the homepage7Blog =r.content8 #parsing HTML with Html.parser9Soup = beautifulsoup (blog,"Html.parser")Ten #The Find method finds the tag object that matches the first property on a page OneTag_soup = Soup.find (class_="C_b_p_desc") A - #The Len function gets the number of child nodes - Printlen (List (tag_soup.children)) the - #get the number of descendant nodes - Printlen (List (tag_soup.descendants)) - + forIinchtag_soup.descendants: - PrintI
Crawl the tag content of the blog home page
1. The label on the left side of the blog is not this link: http://www.cnblogs.com/yoyoketang/
2. By grasping the package can see, this URL address is: Http://www.cnblogs.com/yoyoketang/mvc/blog/sidecolumn.aspx?blogApp=yoyoketang
2. You can locate the parent element first:<div class= "catlisttag">
Six, reference code:
1 #Coding:utf-82 fromBs4ImportBeautifulSoup3 ImportRequests4 5R = Requests.get ("Http://www.cnblogs.com/yoyoketang/mvc/blog/sidecolumn.aspx?blogApp=yoyoketang")6 #get the entire HTML interface after requesting the homepage7Blog =r.content8 #parsing HTML with Html.parser9Soup = beautifulsoup (blog,"Html.parser")TenTag_soup = Soup.find (class_="Catlisttag") One A #print body.prettify () - -Ul_soup = Tag_soup.find_all ("a") the PrintUl_soup - forIinchUl_soup: - PrintI.string
Interested in Python interface automation, you can add Python interface Automation QQ Group: 226296743
You can also pay attention to my personal public number:
Python crawler BEAUTIFULSOUP4 Series 4-child nodes