Python uses the BeautifulSoup library to parse the basic HTML tutorial, pythonbeautifulsoup

Last Update:2016-04-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

BeautifulSoup is a third-party Python library that can help parse html/XML and other content to capture specific webpage information. The latest version is v4. Here we will summarize some common methods for parsing html in v3.

Preparation

1. Install Beautiful Soup

This article uses Beautiful Soup to parse the content on the page. Of course, the example in this article is simple and can be analyzed using strings.

Run

sudo easy_install beautifulsoup4

You can install it.

2. Install the requests Module

The requests module is used to load the web page to be requested.

Enter import requests in the python command line. If an error is returned, the requests module is not installed.

I plan to install the easy_install tool online. I found that the easy_install command does not exist in the system. Enter sudo apt-get install python-setuptools to install the easy_install tool.

Run sudo easy_install requests to install the requests module.

Basic

1. Initialization
Import Module

#!/usr/bin/env pythonfrom BeautifulSoup import BeautifulSoup    #process html#from BeautifulSoup import BeautifulStoneSoup #process xml#import BeautifulSoup             #all

Create object: str initialization. The BeautifulSoup object is often initialized by urllib2 or html returned by browser.

doc = ['hello',    'This is paragraph one of ptyhonclub.org.',    'This is paragraph two of pythonclub.org.',    '']soup = BeautifulSoup(''.join(doc))

Encoding: When html is of another type (non-UTF-8 and asc ii), such as GB2312, the corresponding character encoding must be specified. BeautifulSoup can be correctly parsed.

htmlCharset = "GB2312"soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset)

2. Get tag content
Search for the content of the tag block that you are interested in and return the analysis tree of the corresponding tag block.

head = soup.find('head')#head = soup.head#head = soup.contents[0].contents[0]print head

Returned content: hello
Description: The contents attribute is a list that stores the direct son of the profiling tree.

html = soup.contents[0]    # 
3. Get link nodes
Use parent to obtain the parent node

Body = soup. bodyhtml = body. parent # html is the father of the body.
Use nextSibling and previussibling to obtain the front and back siblings.

Head = body. previussibling # head and body are on the same layer. It is the first brother of the body, p1 = body. contents [0] # p1, p2 are the son of the body. We use contents [0] to obtain p1p2 = p1.nextSibling # p2 and p1 on the same layer. It is the last brother of p1, of course, body. content [1] can also be obtained
The flexible use of contents [] can also be used to find relational nodes. For the ancestor or descendant, you can use findParent (s), findNextSibling (s), findpreviussibling (s)
4. detailed usage of find/findAll
Function prototype: find (name = None, attrs = {}, recursive = True, text = None, ** kwargs). findAll returns all results that meet the requirements, and return with list.
Tag search

Find (tagname) # directly search for a tag named tagname, such as: find ('head') find (list) # search for a tag in the list, such as: find (['head ', 'body']) find (dict) # search for tags in dict, such as: find ({'head': True, 'body': True}) find (re. compile ('') # search for regular tags, such as find (re. compile ('^ p') searches for tagfind (lambda) (which starts with p) # A tag with true returned results returned by the search function, for example, find (lambda name: if len (name) = 1) Search for tagfind (True) with a length of 1 # search for all tags
Attrs search
Find (id = 'xxx') # find (attrs = {id = re. compile ('xxx'), algin = 'xxx'}) # find (attrs = {id = True, algin = None}) where the id attribute conforms to the regular expression and the algin attribute is xxx }) # search for resp1 = soup with the id attribute but no algin attribute. findAll ('A', attrs = {'href ': mattings}) resp2 = soup. findAll ('h1 ', attrs = {'class': match2}) resp3 = soup. findAll ('img ', attrs = {'id': match3 })
Text Search
Text search results in the failure of other search values, such as tag and attrs. Method consistent with search tag

Print p1.text # u'this is paragraphone. 'print p2.text # U' This is paragraphtwo. '# Note: 1. The text of each tag includes the text of the tag and its descendants. 2. All texts have been automatically converted to unicode. If needed, you can manually transcode The encode (xxx)
Recursive and limit attributes
Recursive = False indicates that only the direct son is searched. Otherwise, the entire subtree is searched. The default value is True. When you use findAll or similar methods to return a list, the limit attribute is used to limit the number of returned results, for example, findAll ('P', limit = 2): returns the first two tags found.
Instance
This article uses the document list page of a blog as an example to extract the article name from the page using python.
The url of the article list on the article list page is as follows:
<Ul class = "listing"> <li class = "listing-item"> <span class = "date"> 2014-12-03 </span> <a href = "/post/linux_funtion_advance_feature "title =" advanced features of Linux functions "> advanced features of Linux functions </a> </li> <li class =" listing-item "> <span class =" date "> 2014-12-02 </span> <a href = "/post/cgdb" title = "cgdb usage"> cgdb usage </a> </li>... </ul>
Code:
#! /Usr/bin/env python #-*-coding: UTF-8-*-'a http parse test programe' _ author _ = 'kuring lv' import requestsimport bs4archives_url = "http://kuring.me/archive" def start_parse (url ): print "start to get (% s) content" % url response = requests. get (url) print "webpage content retrieved" soup = bs4.BeautifulSoup (response. content. decode ("UTF-8") # soup = bs4.BeautifulSoup (response. text); # To prevent missing calling the close method, the with statement is used here # The encoding in the written file is UTF-8 with open('archives.txt ', 'w') as f: for archive in soup. select ("li. listing-item a "): f. write (archive. get_text (). encode ('utf-8') + "\ n") print archive. get_text (). encode ('utf-8') # When the command line runs this module, __name _ equals to '_ main _' # When other modules import this module, __name _ equals to 'parse _ html 'if _ name _ =' _ main _ ': start_parse (archives_url)

Articles you may be interested in:

 
 
  Python uses BeautifulSoup to capture specified webpage content
 
  How to Use BeautifulSoup to pagate superlinks on a webpage in python
 
  Python uses BeautifulSoup to analyze webpage Information
 
  Python BeautifulSoup page encoding method
 
  Python web parsing tool BeautifulSoup installation and usage
 
  Install Python BeautifulSoup in Windows 8
 
  Python uses beautifulSoup to implement Crawler
 
  Python network programming learning notes (7): HTML and XHTML parsing (HTMLParser, BeautifulSoup)
 
  Python BeautifulSoup Chinese garbled problem two solutions
 
  Python uses beautifulsoup to capture video playback from aiqiyi
 
  Capture 58 mobile phone repair information using python BeautifulSoup Library
 
  Python BeautifulSoup usage
 
  Parse html with python: BeautifulSoup

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python uses the BeautifulSoup library to parse the basic HTML tutorial, pythonbeautifulsoup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python uses the BeautifulSoup library to parse the basic HTML tutorial, pythonbeautifulsoup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support