Python uses the BeautifulSoup library to parse HTML basic usage Tutorials

Source: Internet
Author: User
Tags tagname
BeautifulSoup is a third-party library of Python that can be used to help parse content such as html/xml to crawl specific page information. The latest is the V4 version, here is the main summary of the V3 version I used to parse HTML some common methods.

Get ready

1.Beautiful Soup Installation

In order to be able to parse the content in the page, this article uses beautiful Soup. Of course, the sample requirements for this article are simple and can be used to parse strings.

Perform

sudo easy_install beautifulsoup4

can be installed.

Installation of 2.requests modules

The requests module is used to load the Web page to be requested.

Enter import requests on the command line of Python, error stating that the requests module is not installed.

I'm going to install it using Easy_install's online installation, and find out that the Easy_install command is not present in the system, and the Easy_install tool is installed with sudo apt-get install Python-setuptools.

Execute the sudo easy_install requests to install the requests module.

Basis

1. Initialization
Import Module

#!/usr/bin/env pythonfrom beautifulsoup Import beautifulsoup    #process html#from beautifulsoup Import Beautifulstonesoup #process xml#import beautifulsoup             #all

Create object: str initialization, commonly used URLLIB2 or browser returns the HTML initialization BeautifulSoup object.

doc = [' Hello ', ' this is paragraph one of    the ptyhonclub.org. ', ' this is paragraph and the    pythonclub.org. ',    ']sou p = BeautifulSoup (". Join (DOC)")

Specify encoding: When HTML is encoded for other types (non-utf-8 and ASC II), such as GB2312, you need to specify the appropriate character encoding for BeautifulSoup to parse correctly.

Htmlcharset = "GB2312" soup = BeautifulSoup (resphtml, Fromencoding=htmlcharset)

2. Get tag Content
Find the content of the tag block of interest, return the parse tree corresponding to the tag block

Head = Soup.find (' head ') #head = Soup.head#head = Soup.contents[0].contents[0]print Head

What to return: Hello
To illustrate, the Contents property is a list that holds the direct son of the parse tree.

html = soup.contents[0]    #  ... head = html.contents[0]    #  ... body = html.contents[1]    #  

3. Get the Relationship node
Get parent node using parent

BODY = soup.bodyhtml = body.parent       # HTML is the father of the body

Using NextSibling, previoussibling get the brothers before and after

Head = body.previoussibling  # Head and body in the same layer, is the body of the former brother P1 = body.contents[0]     # p1, p2 are the body of the son, we use contents[0] Get p1p2 = p1.nextsibling      # P2 with P1 on the same layer, is P1 's latter brother, of course body.content[1] can also get

Contents[] 's flexible use can also look for relationship nodes, looking for ancestors or descendants can be used Findparent (s), findnextsibling (s), findprevioussibling (s)

4.find/findall Usage Explanation
Function prototypes: Find (Name=none, attrs={}, Recursive=true, Text=none, **kwargs), FindAll returns all results that meet the requirements and returns it as a list.
Tag Search

Find (tagname)                 # searches directly for tags named tagname such as: Find (' head ') find (list)                   # Search for tags in list, such as: find ([' head ', ' body ']) find ( dict)                   # Search for tags in dict such as: Find ({' head ': true, ' body ': true}) Find (Re.compile ("))              # Search for a regular tag, such as: Find ( Re.compile (' ^p ')) searches for a tagfind (lambda) # search function that begins with p to return a tag with a            true result, such as: Find (lambda name:if len (name) = = 1) search for a length of 1 tagfind (True)                   # Search All Tags

Attrs Search

Find (id= ' xxx ')                 # Look for the id attribute for XXX's find (attrs={id=re.compile (' xxx '), algin= ' xxx '}) # Look for the id attribute to match the regular and Algin property for the XXX's find ( Attrs={id=true, Algin=none})        # look for resp1 = Soup.findall with id attribute but no Algin property (' A ', attrs = {' href ': match1}) RESP2 = Soup.findall (' h1 ', attrs = {' class ': Match2}) resp3 = Soup.findall (' img ', attrs = {' id ': MATCH3})

Text Search
The search for text results in other search-giving values such as: Tag, attrs are invalidated. method is consistent with search tag

Print p1.text# u ' this was paragraphone. ' Print p2.text# u ' this is paragraphtwo. ' Note: 1, the text of each tag includes it and the text of its descendants. 2, all text has been automatically converted to Unicode, if necessary, can be self-transcoding encode (XXX)

Recursive and Limit properties
Recursive=false indicates that only the immediate son is searched, otherwise the entire subtree is searched, and the default is true. When using FindAll or a method similar to returning a list, the Limit property is used to limit the number of returns, such as FindAll (' P ', limit=2): Returns the first two tags found.

Instance
This article takes the blog's document List page as an example and uses Python to extract the name of the article in the page.

The URLs in the Articles list section of the articles list page are as follows:

 
  
  
  • 2014-12-03 Linux functions Advanced Features
  • Use of 2014-12-02cgdb
  • ...

Code:

#!/usr/bin/env python                                                                              #-*-coding:utf-8-*-' a HTTP parse test programe ' __author__ = ' kuring lv ' import requestsimpo RT Bs4archives_url = "Http://kuring.me/archive" def start_parse (URL):  print "Start getting (%s) content"% URL  response = Requests.get (URL)  print "Get page content complete"  soup = bs4. BeautifulSoup (Response.content.decode ("Utf-8"))  #soup = bs4. BeautifulSoup (response.text);  # to prevent missing calls to the Close method, here the WITH statement  # is written into the file encoded as Utf-8 with  open (' Archives.txt ', ' W ') as F: for    archive in Soup.select ("Li.listing-item a"):      f.write (Archive.get_text (). Encode (' utf-8 ') + "\ n")      print archive.get_ Text (). Encode (' Utf-8 ') # when the command line runs the module, __name__ equals ' __main__ ' # when other modules import the module, __name__ equals ' parse_html ' if __name__ = = ' __main__ ' :  start_parse (Archives_url)
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.