Python's beautifulsoup used in detail

Source: Internet
Author: User

Due to job needs, there is a lot of time in daily work is used in the repeated landing intranet.

Therefore, the detailed study tested the use of BeautifulSoup, summed up the need for alternate crawl pages.

The first is to import the module and initialize it:

 from Import Beautifulsoupsoup=beautifulsoup (opener)

#1, Tag Tag method
If a layer of labels goes down, just take each layer label first, or only one, you can use

Soup.head.title


However, multiple parallel tags with the same name cannot be found title[2]

#2 and Contents method
Search according to the document tree, return a list of tagged objects (tag), note that directly. Contents, returns a list, not a single element
Use contents to traverse the tree backwards, using the parent to traverse the tree forward

A total of two uses:

Soup.contentssoup.contents[x].contents


The return value is a list that contains the entire contents of the HTML tag. For example, it could be ternary: [u ' \ n ', ' SOUP.CONTENTS[X] is the acquisition of the value of each element of the list.

Soup.contents[x].contents


The return value is the list of all labels for the next level of the target label (that is, the list that the tag is the parent, and the child is crawled to). If x is wrong here, it may cause
Error, because the wrong is not born a list, the back contents will be wrong.
Such as

1 soup.contents[1]=u'HTML'

2 soup.contents[2]=u'\ n'

3 soup.contents[3]=
and

Soup.contents[3].contents=[u'\ n', '\ n  ', <body>...</body>,u'\ n']


And so on, soup.contents[3].contents[3]= must be the fourth element body in the list above.

#3,. Next method
The. Next, or the contents list element, can be counted only for a single element.
Like what

Soup.contents[1]=u'HTML'soup.contents[2]=u'\ n '

Then soup.contents[1].next equivalent to soup.contents[2]

#搜索法

Find (Name=none, attrs={}, Recursive=true, Text=none, **kwargs)

Main 2:. Find (' P '),. findAll (' P ')
Find returns a string value and is the first tag pair to return from scratch. But if the first tag pair includes a lot of content, the parent level is high, and the inside of it contains

This level label is also all find

The FindAll return value is a list, and if a label with the same name is found with more than one label, the inner label is displayed to the parent label, and the other elements of the list no longer reflect those contained

Child label with the same name.
Like what:

Soup.findall (onclick='document.location ... ' )    Soup.findall (attrs={'style': R'outline:none; ' # used to find the label body with Style= ' outline:none in the attribute. 


#4,. attrs[x] Get property value method
After finding a unique label body in the various methods described above, you can get a list of internal properties by implementing Attrs on the label. The
adds a. attrs[' ID ' after the label and so on to get the property value of the tag's internal ID
for example:

 soup.contents[3]==<meta abc= " god   " href="   /  >soup.contents[ 3].attrs= ={ " :  " god  ,  '  href   ' :  '  /  }soup.contents[ 3].attrs[1]= '  /   



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.