1. Import beautifulsoup
from BeautifulSoup import BeautifulSoup
2. instantiate a soup object
html="
The HTML string can be obtained by opening a local file or capturing the HTML of the network.
The HTML used for testing is:
3. beautifulsoup object
There are three beautifulsoup objects
1). Soup object
print type(soup)<class ‘BeautifulSoup.BeautifulSoup‘>
2). Tag object (TAG object)
print type(soup.html)<class ‘BeautifulSoup.Tag‘>
3). String object
print type(soup.div.string)<class ‘BeautifulSoup.NavigableString‘>
4. Analyze soup
1). Use tags
soup=BeautifulSoup(htm)print soup.htmlprint soup.bodyprint soup.p
HTML can be obtained directly through tags, but only the first matched tag is returned. For example, if there are two <p> tags, soup. P will only return the first tag.
This method returns the tag object.
2) contents, parent
You can also use contents to obtain sub-elements and return a list, such as soup. contents [0], returns the HTML node, soup. contents [0]. centents is a list of head and body tags.
The strange thing is that Len (soup. Contents [0]. Contents) is equal to 5. Besides the head tag and body tag, there are three empty elements.
Usecontents
Traverse the tree backward and useparent
Forward traversal tree
3) Next returns the child element.
print soup.div.nexti am div1
4) findall
Search provides two methods: Find and findall. The two methods (findall and find) here are only valid for tag objects and top-level profiling objects, but navigablestring is not available.
Findall (name, attrs, recursive, text, limit, ** kwargs)
soup.findAll(‘div‘)[<div id="div1">i am div1</div>, <div id="div2">i am div2</div>]
print soup.findAll(‘div‘,id=‘div1‘)[<div id="div1">i am div1</div>]
print soup.findAll(‘div‘,{‘id‘:‘div2‘})[<div id="div2">i am div2</div>]
Attrs can be imported as a dictionary.
pat=re.compile(‘div\d+‘)print soup.findAll(‘div‘,{‘id‘:pat})[<div id="div1">i am div1</div>, <div id="div2">i am div2</div>]
You can also use regular expression matching.
5. modify attributes
pat=re.compile(‘div\d+‘)a=soup.findAll(‘div‘,{‘id‘:pat})[0]a[‘id‘]=‘ddd‘print a
Modify the ID attribute of tag
6. Access attributes
a[‘id‘]
You can use this method to access attributes.
Reference: http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html