Python crawler---beautifulsoup (1)

Source: Internet
Author: User

BeautifulSoup is a tool for parsing crawled content, and its find and Find_all methods are useful. And after parsing, it will form a tree structure, for the Web page form a similar to the JSON format of the Key-value, it is easier and more convenient for the content of the Web page operation.

  

Download the library without much to say, using the python pip, directly inside the cmd execute pip install BeautifulSoup can

  

First copy the document description, the code is copied over, as follows

 fromBs4ImportBeautifulsouphtml_doc=""""""Soup= BeautifulSoup (Html_doc,'Html.parser')PrintSoup.find_all ('a')

Html_doc is what we crawl down here, which makes it easy to directly use the content provided in the document.

We perform the parsing directly on the Html_doc, using the Html.parser parser.

After Sublime knock Ctrl+b can run (recommended download Python sublimepythonide This plug-in package, can be compiled directly without the use of CMD)

  

[<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a>, <aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a>, <aclass="Sister"href="Http://example.com/tillie"Id="Link3">Tillie</a>][finishedinch0.2S]

The result of the code execution is as follows, and the number of rows with a is executed.

We rewrite the document to rewrite the contents of the soup and agree to the results. (Directly paste the website content, not duplicates)

  

Soup.title#<title>the dormouse ' s story</title>Soup.title.name#u ' title 'soup.title.string#u ' the Dormouse ' story 'Soup.title.parent.name#u ' head 'SOUP.P#<p class= "title" ><b>the dormouse ' s story</b></p>soup.p['class']#u ' title 'Soup.a#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>Soup.find_all ('a')#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]Soup.find (ID="Link3")#<a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>

As above, can be very obvious to see, parsing completed soup, formed the key-value format of data, using Soup.title and other methods can be printed separately the content required. (#为打出内容)

There are other ways to do it.

 for  in Soup.find_all ('a'):    Print(link.get ('  href'))#  http://example.com/elsie# / http Example.com/lacie#  Http://example.com/tillie

Using foreach makes it easy to manipulate the child controls of a complex parent container. (#为打出内容)

The last part of the website is to remove all the contents of the page and display the contents directly. Methods are as follows

  

Print (Soup.get_text ()) # The dormouse ' s story ## The dormouse ' s story ## Once Upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and They lived at the bottom for a well. ##  ...

It is also very convenient to directly put the content of the text out.

The above is a relatively simple use of beautifulsoup.

Python crawler---beautifulsoup (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.