Python uses the BeautifulSoup library to parse the basic HTML tutorial, pythonbeautifulsoup

Source: Internet
Author: User

Python uses the BeautifulSoup library to parse the basic HTML tutorial, pythonbeautifulsoup

BeautifulSoup is a third-party Python library that can help parse html/XML and other content to capture specific webpage information. The latest version is v4. Here we will summarize some common methods for parsing html in v3.

Preparation

1. Install Beautiful Soup

This article uses Beautiful Soup to parse the content on the page. Of course, the example in this article is simple and can be analyzed using strings.

Run

sudo easy_install beautifulsoup4

You can install it.

2. Install the requests Module

The requests module is used to load the web page to be requested.

Enter import requests in the python command line. If an error is returned, the requests module is not installed.

I plan to install the easy_install tool online. I found that the easy_install command does not exist in the system. Enter sudo apt-get install python-setuptools to install the easy_install tool.

Run sudo easy_install requests to install the requests module.

Basic

1. Initialization
Import Module

#!/usr/bin/env pythonfrom BeautifulSoup import BeautifulSoup    #process html#from BeautifulSoup import BeautifulStoneSoup #process xml#import BeautifulSoup             #all

Create object: str initialization. The BeautifulSoup object is often initialized by urllib2 or html returned by browser.

doc = ['hello',    'This is paragraph one of ptyhonclub.org.',    'This is paragraph two of pythonclub.org.',    '']soup = BeautifulSoup(''.join(doc))

Encoding: When html is of another type (non-UTF-8 and asc ii), such as GB2312, the corresponding character encoding must be specified. BeautifulSoup can be correctly parsed.

htmlCharset = "GB2312"soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset)

2. Get tag content
Search for the content of the tag block that you are interested in and return the analysis tree of the corresponding tag block.

head = soup.find('head')#head = soup.head#head = soup.contents[0].contents[0]print head

Returned content: hello
Description: The contents attribute is a list that stores the direct son of the profiling tree.

html = soup.contents[0]    # 

3. Get link nodes
Use parent to obtain the parent node

Body = soup. bodyhtml = body. parent # html is the father of the body.

Use nextSibling and previussibling to obtain the front and back siblings.

Head = body. previussibling # head and body are on the same layer. It is the first brother of the body, p1 = body. contents [0] # p1, p2 are the son of the body. We use contents [0] to obtain p1p2 = p1.nextSibling # p2 and p1 on the same layer. It is the last brother of p1, of course, body. content [1] can also be obtained

The flexible use of contents [] can also be used to find relational nodes. For the ancestor or descendant, you can use findParent (s), findNextSibling (s), findpreviussibling (s)

4. detailed usage of find/findAll
Function prototype: find (name = None, attrs = {}, recursive = True, text = None, ** kwargs). findAll returns all results that meet the requirements, and return with list.
Tag search

Find (tagname) # directly search for a tag named tagname, such as: find ('head') find (list) # search for a tag in the list, such as: find (['head ', 'body']) find (dict) # search for tags in dict, such as: find ({'head': True, 'body': True}) find (re. compile ('') # search for regular tags, such as find (re. compile ('^ p') searches for tagfind (lambda) (which starts with p) # A tag with true returned results returned by the search function, for example, find (lambda name: if len (name) = 1) Search for tagfind (True) with a length of 1 # search for all tags

Attrs search

Find (id = 'xxx') # find (attrs = {id = re. compile ('xxx'), algin = 'xxx'}) # find (attrs = {id = True, algin = None}) where the id attribute conforms to the regular expression and the algin attribute is xxx }) # search for resp1 = soup with the id attribute but no algin attribute. findAll ('A', attrs = {'href ': mattings}) resp2 = soup. findAll ('h1 ', attrs = {'class': match2}) resp3 = soup. findAll ('img ', attrs = {'id': match3 })

Text Search
Text search results in the failure of other search values, such as tag and attrs. Method consistent with search tag

Print p1.text # u'this is paragraphone. 'print p2.text # U' This is paragraphtwo. '# Note: 1. The text of each tag includes the text of the tag and its descendants. 2. All texts have been automatically converted to unicode. If needed, you can manually transcode The encode (xxx)

Recursive and limit attributes
Recursive = False indicates that only the direct son is searched. Otherwise, the entire subtree is searched. The default value is True. When you use findAll or similar methods to return a list, the limit attribute is used to limit the number of returned results, for example, findAll ('P', limit = 2): returns the first two tags found.

Instance
This article uses the document list page of a blog as an example to extract the article name from the page using python.

The url of the article list on the article list page is as follows:

<Ul class = "listing"> <li class = "listing-item"> <span class = "date"> 2014-12-03 </span> <a href = "/post/linux_funtion_advance_feature "title =" advanced features of Linux functions "> advanced features of Linux functions </a> </li> <li class =" listing-item "> <span class =" date "> 2014-12-02 </span> <a href = "/post/cgdb" title = "cgdb usage"> cgdb usage </a> </li>... </ul>

Code:

#! /Usr/bin/env python #-*-coding: UTF-8-*-'a http parse test programe' _ author _ = 'kuring lv' import requestsimport bs4archives_url = "http://kuring.me/archive" def start_parse (url ): print "start to get (% s) content" % url response = requests. get (url) print "webpage content retrieved" soup = bs4.BeautifulSoup (response. content. decode ("UTF-8") # soup = bs4.BeautifulSoup (response. text); # To prevent missing calling the close method, the with statement is used here # The encoding in the written file is UTF-8 with open('archives.txt ', 'w') as f: for archive in soup. select ("li. listing-item a "): f. write (archive. get_text (). encode ('utf-8') + "\ n") print archive. get_text (). encode ('utf-8') # When the command line runs this module, __name _ equals to '_ main _' # When other modules import this module, __name _ equals to 'parse _ html 'if _ name _ =' _ main _ ': start_parse (archives_url)

Articles you may be interested in:
  • Python uses BeautifulSoup to capture specified webpage content
  • How to Use BeautifulSoup to pagate superlinks on a webpage in python
  • Python uses BeautifulSoup to analyze webpage Information
  • Python BeautifulSoup page encoding method
  • Python web parsing tool BeautifulSoup installation and usage
  • Install Python BeautifulSoup in Windows 8
  • Python uses beautifulSoup to implement Crawler
  • Python network programming learning notes (7): HTML and XHTML parsing (HTMLParser, BeautifulSoup)
  • Python BeautifulSoup Chinese garbled problem two solutions
  • Python uses beautifulsoup to capture video playback from aiqiyi
  • Capture 58 mobile phone repair information using python BeautifulSoup Library
  • Python BeautifulSoup usage
  • Parse html with python: BeautifulSoup

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.