Python's BeautifulSoup tag lookup and information extraction

Source: Internet
Author: User

First, find a label


(1) Find all a tags

>>> forXinchSoup.find_all ('a'): Print (x)<aclass="Sister"href="Http://example.com/elsie"Id="Link1">elsie</a><aclass="Sister"href="Http://example.com/lacie"Id="Link2">lacie</a><aclass="Sister"href="Http://example.com/tillie"Id="Link3">Tillie</a>


(2) Find all a tags, and the attribute value href need to protect the keyword ""

 for  in Soup.find_all ('a', href = re.compile ('lacie'  ):    print (x)class="sister" href="/http/ Example.com/lacie "id="link2">Lacie</a>


(3) Find all a tags, and the string content contains the keyword "Elsie"

 >>> for  x in  Soup.find_all ( Span style= "COLOR: #800000" > " a  " , Span style= "COLOR: #0000ff" >string  = re.compile ( " elsie   "  <a class  ="  sister   " href=< Span style= "COLOR: #800000" > " http://example.com/elsie  "  id= " link1   >elsie</a> 


(4) Find all the child tags of the body tag and cycle the printout

>>> forXinchSoup.find ('Body'). Children:ifisinstance (x,bs4.element.tag): #使用isinstance过滤掉空行内容 print (x)<pclass="title"><b>the dormouse's story</b></p><pclass=" Story">Once Upon a time there were three little sisters; and their names were<aclass="Sister"href="Http://example.com/elsie"Id="Link1">Elsie</a>,<aclass="Sister"href="Http://example.com/lacie"Id="Link2">Lacie</a> and<aclass="Sister"href="Http://example.com/tillie"Id="Link3">Tillie</a>; and they lived at the bottom of a well.</p>


Ii. Information Extraction (link extraction)


(1) Parse the information label structure, find all a tags, and extract the value of the href attribute in each a tag (that is, the link), and then there is an empty list;

>>> linklist = []>>> forXinchSoup.find_all ('a'): Link= x.Get('href')    iflink:linklist.append (link)>>> forXinchlinklist: #验证: Ring print out the link in the linklist list print (x) http://Example.com/elsiehttp//Example.com/laciehttp//Example.com/tillie


Summary: Link Extraction <---> attribute Content extraction <---> x.get (' href ')

(2) Parse the information label structure, find all a tags, and each a tag in the href contains the keyword "Elsie", and then into the empty list;

>>> Linklst = []>>> forXinchSoup.find_all ('a', href = Re.compile ('Elsie')): Link= x.Get('href')    iflink:linklst.append (link)>>> forXinchlinklst: #验证: Loop print out the link in the linklist list print (x) http://Example.com/elsie

Summary: When a tag is searched, the regular match content of the href content of the attribute value is added <---> href = re.compile (' Elsie ')

(3) Parsing the information label structure, querying all a tags, and each A-tag string content contains the keyword "Elsie", and the output structure into an empty list;

 for  in Soup.find_all ('a'):    string = X.get_text ( )    Print (string)   Elsielacietillie

Python's BeautifulSoup tag lookup and information extraction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.