Python crawler (14) _BEAUTIFULSOUP4 Parser

Source: Internet
Author: User
Tags tag name xml parser virtual environment

CSS selector: BEAUTIFULSOUP4

Like lxml, Beautiful soup is also a html/xml parser, the main function is how to parse and extract html/xml data.

lxml only local traversal, and beautiful soup is based on the HTML DOM, will load the entire document, parsing the entire DOM tree, so the time and memory overhead will be much larger, so performance is lower than lxml.
BeautifulSoup used to parse HTML is simple, the API is very user-friendly, support CSS selectors, Python standard library of the HTML parser, but also support the lxml XML parser.
Beautiful Soup3 has now discontinued development, recommending the current project using Beautiful Soup. Install with PIP:pip install beautifulsoup4

Official Document: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
| Grab Tools | speed | use | installation Difficulty |
|---|---|--|----|
| regular | fastest | difficult | None (Built-in) |
| beautifulsoup| Slow | simplest | simple |
|lxml| Fast | simple | general |

Instance:

You must first import the BS4 library

# 07-urllib2_beautipulsoup_prettify fromBs4ImportBeautifulsouphtml= """<body><p class= "title" Name= "Dromouse" ><b>the dormouse ' s story</b></p><p class= "story" >once upon a time there were three Little Sisters; and their names were<a href= "Http://example.com/elsie" class= "sister" id= "Link1" ><!--Elsie--></a><a href= "Http://example.com/lacie" class= "sister" id= "Link2" >Lacie</a> and<a href= "Http://example.com/tillie" class= "Sister " id= "Link3" >Tillie</a>;and they lived at the bottom of a well.</p><p class= "story" >...</p>"""#创建 Beautiful Soup ObjectsSoup=BeautifulSoup (HTML)#打开本地 HTML file to create an object#soup = beautifulsoup (open (' index.html '))#格式化输出 the contents of a Soup objectPrintSoup.prettify ()

Operation Result:

   <title>The Dormouse ' s story</title>  <body>  <pclass="title"name="Dromouse">   <b>The Dormouse ' s story</b>  </p>  <pclass="Story">Once upon a time there were three Little sisters; and their names were<aclass="Sister"href="Http://example.com/elsie"id="Link1">    <!--Elsie --   </a>,<aclass="Sister"href="Http://example.com/lacie"id="Link2">Lacie</a>and<aclass="Sister"href="Http://example.com/tillie"id="Link3">Tillie</a>; and they lived at the bottom of a well.</p>  <pclass="Story">...</p> </body>
  • If we execute under IPython2, we will see a warning like this:

  • This means that if we do not specify the parser, the default is to use the best available HTML parser ("lxml") for this system. If you are running this code in another system, or in a different virtual environment, using a different parser causes different behavior.
  • But we can get throughsoup = BeautifulSoup(html, "lxml")
Four types of objects

Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object that can be summed up into 4 types:

    • Tag
    • Naviganlestring
    • BeautifulSoup
    • Comment
1.Tag

Tag Popular Point is a label in the HTM, for example:

The Dormouse's story</title><a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a><p class="title" name="dromouse"><b>The Dormouse's story</b></p>

The above title , head , a , and p so on label plus contains the content is tag, then try to use beautiful soup to get tags

#-*-coding:utf-8-*-#08-urllib2_beautifulsoup_tag.py fromBs4ImportBeautifulsouphtml= """<body><p class= "title" Name= "Dromouse" ><b>the dormouse ' s story</b></p><p class= "story" >once upon a time there were three Little Sisters; and their names were<a href= "Http://example.com/elsie" class= "sister" id= "Link1" ><!--Elsie--></a><a href= "Http://example.com/lacie" class= "sister" id= "Link2" >Lacie</a> and<a href= "Http://example.com/tillie" class= "Sister " id= "Link3" >Tillie</a>;and they lived at the bottom of a well.</p><p class= "story" >...</p>"""#创建Beautiful Soup ObjectsSoup=BeautifulSoup (HTML)PrintSoup.title#<title>the dormouse ' s story</title>PrintSoup.a#<a class= "sister" href= "Http://example.com/elsie" id= "Link1" ><!--Elsie--></a>PrintSoup.p#<p class= "title" Name= "Dromouse" ><b>the dormouse ' s story</b></p>Print type(SOUP.P)# <class ' Bs4.element.Tag ' >

We can easily get the contents of these tags with the soup spike signature, the type of these objects bs4.element.Tag . Note, however, that it looks for the first qualifying label in all content. If you want to query all the labels, you will find them later.

For tag, it has two important properties, name and Attrs.

PrintSoup.name#[document] #soup对象本身比较特殊, it's name is [document]PrintSoup.head.name#head #对于其他内部标签, the value of the output is the name of the label itselfPrintSoup.p.attrs#{' class ': [' title '], ' name ': ' Dromouse '}#在这里, we print out all the properties of the P tag, and the resulting type is a dictionaryPrintsoup.p[' class ']#soup. P.get (' class ')#[' title ' #还可以利用get方法, the method of passing in the property, the two are equivalent. soup.a[' class ']= ' Newclass 'PrintSoup.p#可以对这些属性和内容等等进行修改# <p class= "Newclass" name= "Dromouse" ><b>the dormouse ' s story</b></p>delsoup.p[' class ']#还可以对这个属性进行删除PrintSoup.p# <p Name= "dromouse" ><b>the dormouse ' s story</b></p>
2. navigablestring

Now that we've got the content of the tag, the question is, what do we do to get the text inside the tag? Very simple, use. String, for example

print soup.p.string#The Dormouse's storyprinttype<class'bs4.element.NavigableString'>
3. BeautifulSoup

The BeautifulSoup object represents the contents of a document. Large department, you can use it as a tag object, is a special tag, we can get its type, name, and attributes to feel it.

printtype(soup.name)#<type 'unicode'>print soup.name#[document]print#文档本身的属性为空#{}
4. Comment

The comment object is a special type of navigablestring object whose output does not include a comment symbol.

print soup.a# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>print soup.a.string#Elsieprinttype(soup.a.string)# <class 'bs4.element.Comment'>

The content in a tag is actually a comment, but if we use a. String to output its contents, the comment symbol is removed.

Traverse the document Tree 1. Direct Child nodes: .contents .childrenProperty

. Content

Tag's. Content property can output the child nodes of the tag as a list.

print soup.head.contents#[<title>The Dormouse's story</title>]

The output is a list, we can use the list index to get one of its elements

print soup.head.contents[0]#<title>The Dormouse's story</title>

. Children
It does not return a list, but we can get all the child nodes through the traversal.
We print the output. Children look, you can see that he is a list builder object.

print soup.head.children#<listiterator object at 0x7f71457f5710>forin soup.body.children:  print child

Results:

<pclass="title"name="Dromouse"><b>The Dormouse ' s story</b></p><pclass="Story">Once upon a time there were three Little sisters; and their names were<aclass="Sister"href="Http://example.com/elsie"id="Link1"><!--Elsie --</a>,<aclass="Sister"href="Http://example.com/lacie"id="Link2">Lacie</a>and<aclass="Sister"href="Http://example.com/tillie"id="Link3">Tillie</a>; and they lived at the bottom of a well.</p><pclass="Story">...</p>
2. All descendant nodes: .descendantsProperty

The. Contents and. Children properties contain only the direct child nodes of the tag, and the. Descendants property can recursively loop through the descendants of all tags, and, like. Children, we also need to traverse through to get the content.

forin soup.descendants:  print child
3. Node Content: .stringProperty

If the tag has only one navigablestring type child node, then this tag can use the. String to get the child nodes. If a tag has only one child node, the tag can also use the. String, and the output is the same as the result of the current unique child node .string .
Popular point is: if there is no label inside a label, then. String will return the contents of the tag. If there is only one label in the tag, then. String will also return the innermost content. For example:

print soup.head.string#The Dormouse's storyprint soup.title.string#The Dormouse's story
Search for document Tree 1. Find_all (name, Attrs, recursive, text, **kwargs) 1) name parameter

The name parameter can find all tags with the name of the property, and the string object is automatically ignored.

A. Passing A string
The simplest filter is a string, a string parameter is passed in the search method, and eautiful soup automatically finds the content that matches the string in its entirety, and the following example finds all the tags in the document :

soup.find_all('b')#[<b>The Dormouse's story</b>]print soup.find_all('a')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B. Passing regular expressions
If you pass in a regular expression as a parameter, Beautiful soup matches the content with the match () of the regular expression. In the example below, find all the tags that start with B, which means that the <body> <b> label should be found.

import reforin soup.find_all(re.compile('^b')): print(tag.name)#body#b

C. Pass-through list
If you pass in a list parameter, Beautiful soup will return content that matches any of the elements in the list to find all the <a> labels and labels in the document in the following code <b> :

soup.find_all(['a''b'])# [<b>The Dormouse's story</b>,# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
2) keyword parameters
soup.find_all(id='link2')# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
3) Text parameter

The text parameter allows you to search the contents of a string in a document, like the optional value of the name parameter, which receives the parameter value, the regular expression, the list

soup.find_all(text='Elsie')#[u'Elsie']soup.find_all(text=['Tillie''Elsie''Lacie'])# [u'Elsie', u'Lacie', u'Tillie']soup.find_all(text=re.compile("Dormouse"))[u"The Dormouse's story"u"The Dormouse's story"]
CSS Selector

This is another good way to find the Find_all method.

    • When writing CSS, the tag name does not have any decoration, the class name plus., id name plus #
    • Here we can also use a similar method to filter elements, using the method is that the soup.select() return type islist

Python crawler (14) _BEAUTIFULSOUP4 Parser

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.