Due to job needs, there is a lot of time in daily work is used in the repeated landing intranet.
Therefore, the detailed study tested the use of BeautifulSoup, summed up the need for alternate crawl pages.
The first is to import the module and initialize it:
from Import Beautifulsoupsoup=beautifulsoup (opener)
#1, Tag Tag method
If a layer of labels goes down, just take each layer label first, or only one, you can use
Soup.head.title
However, multiple parallel tags with the same name cannot be found title[2]
#2 and Contents method
Search according to the document tree, return a list of tagged objects (tag), note that directly. Contents, returns a list, not a single element
Use contents to traverse the tree backwards, using the parent to traverse the tree forward
A total of two uses:
Soup.contentssoup.contents[x].contents
The return value is a list that contains the entire contents of the HTML tag. For example, it could be ternary: [u ' \ n ', ' SOUP.CONTENTS[X] is the acquisition of the value of each element of the list.
Soup.contents[x].contents
The return value is the list of all labels for the next level of the target label (that is, the list that the tag is the parent, and the child is crawled to). If x is wrong here, it may cause
Error, because the wrong is not born a list, the back contents will be wrong.
Such as
1 soup.contents[1]=u'HTML'
2 soup.contents[2]=u'\ n'
3 soup.contents[3]=
and
Soup.contents[3].contents=[u'\ n', '\ n ', <body>...</body>,u'\ n']
And so on, soup.contents[3].contents[3]= must be the fourth element body in the list above.
#3,. Next method
The. Next, or the contents list element, can be counted only for a single element.
Like what
Soup.contents[1]=u'HTML'soup.contents[2]=u'\ n '
Then soup.contents[1].next equivalent to soup.contents[2]
#搜索法
Find (Name=none, attrs={}, Recursive=true, Text=none, **kwargs)
Main 2:. Find (' P '),. findAll (' P ')
Find returns a string value and is the first tag pair to return from scratch. But if the first tag pair includes a lot of content, the parent level is high, and the inside of it contains
This level label is also all find
The FindAll return value is a list, and if a label with the same name is found with more than one label, the inner label is displayed to the parent label, and the other elements of the list no longer reflect those contained
Child label with the same name.
Like what:
Soup.findall (onclick='document.location ... ' ) Soup.findall (attrs={'style': R'outline:none; ' # used to find the label body with Style= ' outline:none in the attribute.
#4,. attrs[x] Get property value method
After finding a unique label body in the various methods described above, you can get a list of internal properties by implementing Attrs on the label. The
adds a. attrs[' ID ' after the label and so on to get the property value of the tag's internal ID
for example:
soup.contents[3]==<meta abc= " god " href=" / >soup.contents[ 3].attrs= ={ " : " god , ' href ' : ' / }soup.contents[ 3].attrs[1]= ' /