Python parses HTML using BeautifulSoup

Source: Internet
Author: User

: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz

Description: This version uses Python 2.7 better.

Install: Unzip, then run Python setup.py install

Linux systems can also:sudo apt-get install Python-bs4

Official documents:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

(You can also use pyquery)

Output document
 with Open (  " test.html   ", "  w      utf-8   ") 

when you call __str__ ,prettify or renderContents , you can specify the encoding of the output. The default encoding ( str used) is UTF-8. The following is an example of processing iso-8851-1 strings and outputting the same string with different encodings. soup.__str__("ISO-8859-1")

Four types of objects

Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object that can be summed up into 4 types:

    • Tag: For tag, it has two important properties, name and Attrs
    • Navigablestring: Gets the text inside the label
    • Beautifulsoup:you can treat it as a Tag object
    • Comment: Get comments <!--Comment

Tag:

    •  print   type (SOUP.A)  #  <class ' Bs4.element.Tag ';  
       print   Soup.p.attrs  # 
       {' class ': [' title '], ' name ': ' Dromouse '}  
    •  css_soup = BeautifulSoup ( " <p class = "Body strikeout" ></P>    ) Css_ soup.p[  " class   "  #   

       

Navigablestring:

    • Print soup.p.string # The dormouse ' s story
Useful enough:
Soup.title#<title>the dormouse ' s story</title>Soup.title.name#u ' title 'soup.title.string#u ' the Dormouse ' story 'Soup.title.parent.name#u ' head 'SOUP.P#<p class= "title" ><b>the dormouse ' s story</b></p>soup.p['class']#u ' title 'Soup.a#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>Soup.find_all ('a')#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]
Print Soup.find_all ("A", attrs={"class": "Sister"}, limit=2)
Resoup.  Find(string=re.  Compile("Sisters"))           
soup.find (id="link3") # <a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>

head_tag.contents[<title>the dormouse's story</title>] head_tag.children[<title>the dormouse's story</title>] title_tag.parent #  sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # <b>text1</b>

Find_all = = FindAll

Find_all (name, attrs, recursive, string, limit, **kwargs)

My program:

defparse_html (text): Soup= BeautifulSoup (text, from_encoding="UTF-8")    #Find the table of id= "historytable", locate the first table inside it, get all the TRtarget = Soup.find (id="historytable"). Find ('Table'). FindAll ('TR') Results=[] Rec= []     forTrinchTarget[1:]:#Ignore thTDS = Tr.findall ('TD')#get all the TDBuild_no = str (Tds[1].span.string.strip ())#find the second TD span node and take out its text contentPatch = str (tds[0].a.string)#The text of the A node of the first TDStatus_node = Tds[2].find ('a') Status= Str (Status_node.find ('span'). String) Status_link='%s/%s'% (Teamcity_home, status_node.attrs['href'])#Propertiesstarted = str (tds[5].string.replace (U'\xa0',' '))#Remove unresolved characters        Print '-'*10Print '%s\t'%Patch,Print '%s\t'%Build_no,Print '%s\t'%status,Print '%s\t'%started

Python parses HTML using BeautifulSoup

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.