#Xiaodeng#Python3#parsing HTML source with beautiful soupHtml_doc="""""" fromBs4ImportBeautifulsoupsoup= BeautifulSoup (Html_doc,"Html.parser")#1. Full HTML code formatted output#print (Soup.prettify ())#2. Get the HTML page title#print (soup.title.string)"""there are other similar usages: 1) print (soup.title.title) #result: title2) print (soup.title.name) #result: Title"""#3. Find the contents of all P tags and p tags, and return a list#Print (Soup.find_all ("P"))#Print (Soup.find_all ("a"))#Print (Soup.find_all ("title"))"""find a P tag record, print (SOUP.P)"""#4. Check the value of attribute class under P tag#Print (soup.p["class"])#5, find all id= "XXX" label and label content#Print (Soup.find_all (id= "Link3"))#6. Find links to all <a> tags from the documentation"""For key in Soup.find_all ("a"): Print (Key.get ("href"))"""#7. Get all the text content from the document#print (Soup.get_text ())#8. Explore Tag Data typesSoup = BeautifulSoup ('<b class= "boldest" >extremely bold</b>',"Html.parser") Tag=soup.b#print (type tag) #<class ' Bs4.element.Tag ' >#9, get the tag name, each tag has its own name, through the. Name to get#print (soup.b.name)#10. Operation Label Properties#a tag can have a number of attributes.#tag <b class= "Boldest" > Has a "Class" property with a value of "boldest". The properties of the tag are manipulated in the same way as the dictionary#Print (soup.b["class"])#11. Perform actions such as delete tag properties#del tag[' class ']#12. Regular Expressions#Find all tags starting with B in the example, which means that <body> and <b> tags should be found"""Import Resoup = BeautifulSoup (Html_doc, "Html.parser") for tag in Soup.find_all (Re.compile ("^b")): Print (Tag.name) /c5>"""ImportResoup= BeautifulSoup (Html_doc,"Html.parser")#Print (Soup.find_all (Href=re.compile ("Tillie"))) #href中包含tillie的超链接#13. Match in list form (match A and P tags)Soup = BeautifulSoup (Html_doc,"Html.parser")#Print (Soup.find_all (["A", "P"] )#14. Find the label and label contents of id= "XXX" under a tag#Find_all (name, Attrs, recursive, text, **kwargs)#The Find_all () method searches all the tag child nodes of the current tag and determines whether the filter is eligible. Here are a few examples:#Print (Soup.find_all ("A", id= "Link3"))#15. Find the contents of class_= "sister" under a tag#Print (Soup.find_all ("A", class_= "Sister"))#16, through the text parameter can search the document string content.#As with the optional value of the name parameter, the text parameter accepts a string, regular expression, list, True#Print (Soup.find_all (text= "Elsie"))#Print (Soup.find_all (text=["Tillie", "Elsie", "Lacie"] )#17. Limit the number of search labels#Print (Soup.find_all ("a", limit=2))#18, want to search the direct child node tag, you can use the parameter Recursive=falseDoc=""""""Soup= BeautifulSoup (Doc,"Html.parser")#Print (Soup.find_all ("title", Recursive=false))#19, find the parent node, sibling nodes and other methods (to be studied)#20. Find the title tagSoup = BeautifulSoup (Html_doc,"Html.parser")#Print (Soup.select ("title"))#21. Find the direct sub-label under a tag tag#Note: A space before P and B, this "p>b" error#Print (Soup.select ("p > B")) #查找p标签下的直接子标签b#Print (Soup.select ("body > B"))#22. Find the label of Class= "Sister" through the class name of CSSResult=soup.select (". Sister")#print (Result)#23. Search by Tag ID, soup.select ("#link1")Result=soup.select ("#link1")#print (Result) #[<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A>]
Parsing HTML source with beautiful soup