Python modules--beautifulsoup4 and lxml

Source: Internet
Author: User
Tags tag name xml parser

BeautifulSoup4 and lxml

  These two libraries are mainly parsing html/xml documents, BeautifulSoup used to parse HTML is relatively simple, the API is very user-friendly, support CSS selectors,

The HTML parser in the Python standard library, as well as the lxml XML parser. Examples of BeautifulSoup and lxml are described below:

First, BEAUTIFULSOUP4 library:

Install: Pip Install BEAUTIFULSOUP4 If you do not write 4 will be installed by default beautifulsoup3

Data structure, kind: Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object and all objects can be

To sum up to 4 kinds: Tag navigablestring beautifulsoup Comment.

Tag: The tag we used when we wrote the page (e.g. <a> hyperlink tag)

  Navigablestring: Simple is a string that can be traversed

Search for documents:

Get the Web page source code using the requests library:

1 ImportRequests2  fromBs4ImportBeautifulSoup3URL ='Https://www.baidu.com/s?wd=python'    4headers ={5     'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko)
chrome/64.0.3282.140 safari/537.36',}6req =requests.session ()7Response = Req.get (URL, headers=headers, verify=False)8Html_test = Response.text

  Html_test is to get a Web page source code, it does not crawl to the content of JS, all May and page content is not exactly the same!

  before you can parse the contents of a document, you must first use an BeautifulSoup instance of an object. as follows, it is of type <class ' BS4. BeautifulSoup ' >

1 ' lxml ' )  2print(soup, type (soup))

  Gets the label Tag:soup. ' Tag name ' can match the first one, it will return the first occurrence of the label.

1 Print (Soup.span)

Get Label Properties:

1 Print(Type (SOUP.A))2 Print(soup.a['ID'])#no this attribute will cause an error    3 Print(Soup.a.attrs)#output label properties and values    4 Print(Soup.a.get ('ID'))#  It is recommended to use the Get Fetch property, no return none
The running structure of the code:
    1. <class ' Bs4.element.Tag' >
    2. Result_logo
    3. {' href ': '/', ' id ': ' Result_logo ', ' onmousedown ': ' return C ({' FM ': ' tab ', ' tab ': ' Logo '} '}
    4.   None

  Get the contents of the document: After getting to the label (or soup), there are several different ways to get the contents of the tag, as follows:

   strings: Direct Plus. Strings A generator is returned, but the author cannot call the next () method, and the query uses the following

1 a = soup.div.strings  2 A.__next__()

   After execution, you can call a.__next__ () again, which returns the text content, but most of the time, it is cumbersome.

Search method based find () and Find_all () to get the text content :

string : Soup.find_all (' P ') gets all P tags, returns a list of columns ,soup.findl (' P ') returns only one, type ' Bs4.element.Tag '

1 Print (Soup.find_all ('P') [1])
    2 Print (Soup.find_all (' I ', class_= ' C-icon-lidot ') # limit attribute class

The string can only be a tag name, not something else, otherwise find_all () Gets an empty list, and find () gets none

Find_all () gets a list element that is also of type ' Bs4.element.Tag '!

Regular Expressions : You need to import re, and then use Re.compile () creates a pattern object based on the string containing the regular expression .

1 Import Re 2  for  in Soup.find_all (Re.compile ('span')):  3         Print (I.text) # returns all text content within the span tag

    The elements of Soup.find_all (Re.compile (' span ')) are still ' Bs4.element.Tag '!

list:The Find_all method can also accept list parameters, and BeautifulSoup will return content that matches any of the elements in the list.

1      Print(Soup.find_all (['i'a'# get all I tags and a label

      The returned data type is Bs4.element.ResultSet, similar to the list, and can be indexed and ordered

method (Call function body): If there is no suitable filter, we can also customize a method, the method accepts only one element parameter.

1 defhas_class_and_no_id (tag):2     returnTag.has_attr ('class') and  notTag.has_attr ('ID')3  forTaginchSoup.find_all (has_class_and_no_id):4     Print(TAG)5 #the data type returned by Soup.find_all (has_class_and_no_id) is ' Bs4.element.ResultSet '      6 #ditto, like a list, you can index values and unordered! 

Based on select get:CSS selector, when writing CSS, tag name does not add any decoration, class name Plus., the ID name plus #; The return value is a list

Tag name lookup: Soup.select (' H3 a ') takes the A tag under the H3 tag; equivalent to Soup.select (' h3 > A ')

1  forIinchSoup.select ('H3 a'):2     #the text returns a STR string when fetching content      3Result_1 =I.text4     #Get_text returns a str string when fetching content      5Result_2 =I.get_text ()6     #string returns the Navigablestring, no content, will return none      7Result_3 =i.string8     #strings Returns the generator if the content is empty, it will return none      9Result_4 =i.stringsTen     Print(result_1, type (result_1)) One     Print(result_2, type (result_2)) A     Print(Result_3, type (result_3)) -     Print(Result_4, type (result_4))

class name lookup or ID lookup: soup.select ('. C-gap-left-small ') soup.select (' #content_bottom ')

Combination Lookup: Soup.select (' A. C-gap-left-small ')

Search

Python modules--beautifulsoup4 and lxml

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.