BeautifulSoup's advanced application find FindAll

Source: Internet
Author: User
Tags html tags tag name

BeautifulSoup is an important part of Python learning that can be used to help resolve content such as html/xml, especially when crawling specific page information, to parse and examine the messy and irregular HTML pages that are seen on the web. As for the BeautifulSoup module installation can refer to the blog

As for how to get the content of the Web page, you can view the blog content summary.

The singular form of these methods corresponds to a complex number, and all tags that match the requirements are found and put back in the list form. Their corresponding relationship is: Find->findall, findparent->findparents, findnextsibling->findnextsiblings ...
Take a simple example of crawling Baidu page:

>import urllib
>from bs4 import beautifulsoup
>url= "http://www.baidu.com" #这里是需要爬取的网页
> Content=urllib.open (URL). Read () #这里是使用urllib模块的open函数打开url use the Read function to read the contents of a Web page assignment to the content
>soup=beautifulsoup ( Content) #这里是将content内容转化为BeautifulSoup格式的数据
>print content #这里是输出网页html的内容

The BeautifulSoup (content) function is interpreted on the official web as: Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse. It commonly saves programmers hours or days of work.
Converts complex HTML documents into a complex tree structure. Each node is a Python object.

the main function of BeautifulSoup is to use

from bs4 import beautifulsoup html = "" 

This is the process of reading an HTML tag and then outputting the label via the Prettify () function. There are several ways to output HTML tags for soup objects:
1 soup.prettify ()
2 soup.html
3 soup.contents
4 Soup
Also use the soup+ tag name to get the first matching label content in the HTML tag, for example:
Print SOUP.P output results are: <p class= "title" ><b>the dormouse ' s story</b></p>
The print soup.p.string output label results in the following: The Dormouse ' s story
In addition, you can also use the Get_text () function for the output label content:

PID = Soup.find (Href=re.compile ("^http:")) #使用re正则匹配 behind it is said
P1=soup.p.get_text () The
dormouse ' s story

to obtain the properties of a label through the Get function:

Soup=beautifulsoup (HTML, ' Html.parser ')
pid = Soup.findall (' A ', {' class ': ' sister '}) for
i-PID:
    Print i.get (' href ') #对每项使用get函数取得tag属性值
http://example.com/elsie
http://example.com/lacie
http:// Example.com/tillie

The other labels are also available, and the output is the first matching object in the document, and the Find FindAll function is required if you want to search for additional tags.
BeautifulSoup provides a powerful search function for Find and FindAll, where the two methods (FindAll and found) are valid only for tag objects as well as for top-level profiling objects.

findall (name, Attrs, recursive, text, limit, **kwargs)

For link in Soup.find_all (' a '): #soup. Find_all returns the list
    print (link.get (' href '))
# Http://example.com/elsie
# http://example.com/lacie
# Http://example.com/tillie

FindAll can also use tag properties to search for tags, look for the p tag of id= "Secondpara", and return a result set:

> Pid=soup.findall (' P ', id= ' hehe ')  #通过tag的id属性搜索标签
> Print PID
[<p class= "title" id= "hehe" > <b>the Dormouse ' s story</b></p>]
>pid = Soup.findall (' p ', {' id ': ' hehe '}) #通过字典的形式搜索标签内容, Returned as a list []
>print pid
[<p class= "title" id= "hehe" ><b>the dormouse ' s story</b></p "]

use regular Expressions to search the tag label content:

>pid=soup.findall (Id=re.compile ("he$")) #正则表达式的使用
>print pid
[<p class= "title" id= "hehe" ><b >the Dormouse ' s story</b></p>]

to search by using multiple attribute values of a label:

Pp=soup.findall (' A ', attrs={' href ': re.compile (' ^http '), ' id ': ' link1 '}) #标签多个属性值进行搜索 the attrs here is not to be omitted, and the note ' A ' can be omitted Equivalent to a qualifying tab character
print pp
#[<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A>]  #输出结果为list

limit the number of search results: limit=n

PID = Soup.findall (' A ', limit=2) #限制搜索前两个匹配的结果
#[<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,
 <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>]

use Find_all Search to return a list:

Soup.find_all (["A", "B"])   
# [<b>the dormouse ' s story</b>,
#  <a class= ' sister ' href= ' http ://example.com/elsie "id=" Link1 ">elsie</a>, <a class=" sister "href=" http://example.com/  Lacie "id=" "Link2" >LACIE</A> <a class= "sister" href= "Http://example.com/tillie" id= "Link3"  >tillie</a>]

Here the Find_all function argument is set in the form of a list that contains a and b two tags, which returns the result as a list.

Read and modify properties :

> p1 = soup.p
> P1 #输出p1内容
<p id= "Firstpara" align= "center" >this is paragraph<b>one</b> .</p>
> p1[' id ' #输出p1的id属性
hehe
>p1[' id ']= ' haha '  #修改p1的id属性值
>print ' ID ']
haha

The Find and findall usages in BeautifulSoup are the same, and the difference is that finding returns the first value of the FindAll search value. Example:

>soup=beautifulsoup (HTML, ' Html.parser ')
>pid = Soup.find (Href=re.compile ("^http:")) #这里也是使用re正则匹配
>print pid
<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >Elsie</a>

Recommended:

The BeautifulSoup installation can view blogs.

Module installation and update methods in Python, you can view the blog (pip,easy_install).

To resolve garbled problems in Python, you can view the blog.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.