Python modules--beautifulsoup4 and lxml

Last Update:2018-03-02 Source: Internet

Author: User

Tags tag name xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

BeautifulSoup4 and lxml

　　These two libraries are mainly parsing html/xml documents, BeautifulSoup used to parse HTML is relatively simple, the API is very user-friendly, support CSS selectors,

The HTML parser in the Python standard library, as well as the lxml XML parser. Examples of BeautifulSoup and lxml are described below:

First, BEAUTIFULSOUP4 library:

Install: Pip Install BEAUTIFULSOUP4 If you do not write 4 will be installed by default beautifulsoup3

Data structure, kind: Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object and all objects can be

To sum up to 4 kinds: Tag navigablestring beautifulsoup Comment.

Tag: The tag we used when we wrote the page (e.g. <a> hyperlink tag)

　　Navigablestring: Simple is a string that can be traversed

Search for documents:

Get the Web page source code using the requests library:

1 ImportRequests2  fromBs4ImportBeautifulSoup3URL ='Https://www.baidu.com/s?wd=python' 　　 4headers ={5     'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko)
chrome/64.0.3282.140 safari/537.36',}6req =requests.session ()7Response = Req.get (URL, headers=headers, verify=False)8Html_test = Response.text

　　Html_test is to get a Web page source code, it does not crawl to the content of JS, all May and page content is not exactly the same!

　　before you can parse the contents of a document, you must first use an BeautifulSoup instance of an object. as follows, it is of type <class ' BS4. BeautifulSoup ' >

1 ' lxml ' )  2print(soup, type (soup))

　　Gets the label Tag:soup. ' Tag name ' can match the first one, it will return the first occurrence of the label.

1 Print (Soup.span)

Get Label Properties:

1 Print(Type (SOUP.A))2 Print(soup.a['ID'])#no this attribute will cause an error 　　 3 Print(Soup.a.attrs)#output label properties and values 　　 4 Print(Soup.a.get ('ID'))#  It is recommended to use the Get Fetch property, no return none
The running structure of the code:

<class ' Bs4.element.Tag' >
Result_logo
{' href ': '/', ' id ': ' Result_logo ', ' onmousedown ': ' return C ({' FM ': ' tab ', ' tab ': ' Logo '} '}
　　None

　　Get the contents of the document: After getting to the label (or soup), there are several different ways to get the contents of the tag, as follows:

　　　strings: Direct Plus. Strings A generator is returned, but the author cannot call the next () method, and the query uses the following

1 a = soup.div.strings  2 A.__next__()

　　　After execution, you can call a.__next__ () again, which returns the text content, but most of the time, it is cumbersome.

Search method based find () and Find_all () to get the text content :

string : Soup.find_all (' P ') gets all P tags, returns a list of columns ,soup.findl (' P ') returns only one, type ' Bs4.element.Tag '

1 Print (Soup.find_all ('P') [1])
　　　　2 Print (Soup.find_all (' I ', class_= ' C-icon-lidot ')   # limit attribute class

The string can only be a tag name, not something else, otherwise find_all () Gets an empty list, and find () gets none

Find_all () gets a list element that is also of type ' Bs4.element.Tag '!

Regular Expressions : You need to import re, and then use Re.compile () creates a pattern object based on the string containing the regular expression .

1 Import Re 2  for  in Soup.find_all (Re.compile ('span')):  3         Print (I.text) # returns all text content within the span tag

　　　　The elements of Soup.find_all (Re.compile (' span ')) are still ' Bs4.element.Tag '!

list:The Find_all method can also accept list parameters, and BeautifulSoup will return content that matches any of the elements in the list.

1　　　　  Print(Soup.find_all (['i'a'# get all I tags and a label

　　　　　 The returned data type is Bs4.element.ResultSet, similar to the list, and can be indexed and ordered

method (Call function body): If there is no suitable filter, we can also customize a method, the method accepts only one element parameter.

1 defhas_class_and_no_id (tag):2     returnTag.has_attr ('class') and  notTag.has_attr ('ID')3  forTaginchSoup.find_all (has_class_and_no_id):4     Print(TAG)5 #the data type returned by Soup.find_all (has_class_and_no_id) is ' Bs4.element.ResultSet ' 　　　　 6 #ditto, like a list, you can index values and unordered!

Based on select get:CSS selector, when writing CSS, tag name does not add any decoration, class name Plus., the ID name plus #; The return value is a list

Tag name lookup: Soup.select (' H3 a ') takes the A tag under the H3 tag; equivalent to Soup.select (' h3 > A ')

1  forIinchSoup.select ('H3 a'):2     #the text returns a STR string when fetching content 　　　　 3Result_1 =I.text4     #Get_text returns a str string when fetching content 　　　　 5Result_2 =I.get_text ()6     #string returns the Navigablestring, no content, will return none 　　　　 7Result_3 =i.string8     #strings Returns the generator if the content is empty, it will return none 　　　　 9Result_4 =i.stringsTen     Print(result_1, type (result_1)) One     Print(result_2, type (result_2)) A     Print(Result_3, type (result_3)) -     Print(Result_4, type (result_4))

class name lookup or ID lookup: soup.select ('. C-gap-left-small ') soup.select (' #content_bottom ')

Combination Lookup: Soup.select (' A. C-gap-left-small ')

Search

Python modules--beautifulsoup4 and lxml

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More