BeautifulSoup is a tool for parsing crawled content, and its find and Find_all methods are useful. And after parsing, it will form a tree structure, for the Web page form a similar to the JSON format of the Key-value, it is easier and more convenient for the content of the Web page operation.
Download the library without much to say, using the python pip, directly inside the cmd execute pip install BeautifulSoup can
First copy the document description, the code is copied over, as follows
fromBs4ImportBeautifulsouphtml_doc=""""""Soup= BeautifulSoup (Html_doc,'Html.parser')PrintSoup.find_all ('a')
Html_doc is what we crawl down here, which makes it easy to directly use the content provided in the document.
We perform the parsing directly on the Html_doc, using the Html.parser parser.
After Sublime knock Ctrl+b can run (recommended download Python sublimepythonide This plug-in package, can be compiled directly without the use of CMD)
[<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>, <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>, <aclass="Sister"href="Http://example.com/tillie"Id="Link3">Tillie</a>][finishedinch0.2S]
The result of the code execution is as follows, and the number of rows with a is executed.
We rewrite the document to rewrite the contents of the soup and agree to the results. (Directly paste the website content, not duplicates)
Soup.title#<title>the dormouse ' s story</title>Soup.title.name#u ' title 'soup.title.string#u ' the Dormouse ' story 'Soup.title.parent.name#u ' head 'SOUP.P#<p class= "title" ><b>the dormouse ' s story</b></p>soup.p['class']#u ' title 'Soup.a#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>Soup.find_all ('a')#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]Soup.find (ID="Link3")#<a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>
As above, can be very obvious to see, parsing completed soup, formed the key-value format of data, using Soup.title and other methods can be printed separately the content required. (#为打出内容)
There are other ways to do it.
for in Soup.find_all ('a'): Print(link.get (' href'))# http://example.com/elsie# / http Example.com/lacie# Http://example.com/tillie
Using foreach makes it easy to manipulate the child controls of a complex parent container. (#为打出内容)
The last part of the website is to remove all the contents of the page and display the contents directly. Methods are as follows
Print (Soup.get_text ()) # The dormouse ' s story ## The dormouse ' s story ## Once Upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and They lived at the bottom for a well. ## ...
It is also very convenient to directly put the content of the text out.
The above is a relatively simple use of beautifulsoup.
Python crawler---beautifulsoup (1)