: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz
Description: This version uses Python 2.7 better.
Install: Unzip, then run Python setup.py install
Linux systems can also:sudo apt-get install Python-bs4
Official documents:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
(You can also use pyquery)
Output document
with Open ( " test.html ", " w utf-8 ")
when you call __str__
,prettify
or renderContents
, you can specify the encoding of the output. The default encoding ( str
used) is UTF-8. The following is an example of processing iso-8851-1 strings and outputting the same string with different encodings. soup.__str__("ISO-8859-1")
Four types of objects
Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object that can be summed up into 4 types:
- Tag: For tag, it has two important properties, name and Attrs
- Navigablestring: Gets the text inside the label
- Beautifulsoup:you can treat it as a Tag object
- Comment: Get comments <!--Comment
Tag:
-
print type (SOUP.A) # <class ' Bs4.element.Tag ';
print Soup.p.attrs #
{' class ': [' title '], ' name ': ' Dromouse '}
-
css_soup = BeautifulSoup ( " <p class = "Body strikeout" ></P> ) Css_ soup.p[ " class " #
Navigablestring:
Useful enough:
Soup.title#<title>the dormouse ' s story</title>Soup.title.name#u ' title 'soup.title.string#u ' the Dormouse ' story 'Soup.title.parent.name#u ' head 'SOUP.P#<p class= "title" ><b>the dormouse ' s story</b></p>soup.p['class']#u ' title 'Soup.a#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>Soup.find_all ('a')#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]
Print Soup.find_all ("A", attrs={"class": "Sister"}, limit=2)
Resoup. Find(string=re. Compile("Sisters"))
soup.find (id="link3") # <a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>
head_tag.contents[<title>the dormouse's story</title>] head_tag.children[<title>the dormouse's story</title>] title_tag.parent # sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # <b>text1</b>
Find_all = = FindAll
Find_all (name, attrs, recursive, string, limit, **kwargs)
My program:
defparse_html (text): Soup= BeautifulSoup (text, from_encoding="UTF-8") #Find the table of id= "historytable", locate the first table inside it, get all the TRtarget = Soup.find (id="historytable"). Find ('Table'). FindAll ('TR') Results=[] Rec= [] forTrinchTarget[1:]:#Ignore thTDS = Tr.findall ('TD')#get all the TDBuild_no = str (Tds[1].span.string.strip ())#find the second TD span node and take out its text contentPatch = str (tds[0].a.string)#The text of the A node of the first TDStatus_node = Tds[2].find ('a') Status= Str (Status_node.find ('span'). String) Status_link='%s/%s'% (Teamcity_home, status_node.attrs['href'])#Propertiesstarted = str (tds[5].string.replace (U'\xa0',' '))#Remove unresolved characters Print '-'*10Print '%s\t'%Patch,Print '%s\t'%Build_no,Print '%s\t'%status,Print '%s\t'%started
Python parses HTML using BeautifulSoup