BeautifulSoup4 and lxml
These two libraries are mainly parsing html/xml documents, BeautifulSoup used to parse HTML is relatively simple, the API is very user-friendly, support CSS selectors,
The HTML parser in the Python standard library, as well as the lxml XML parser. Examples of BeautifulSoup and lxml are described below:
First, BEAUTIFULSOUP4 library:
Install: Pip Install BEAUTIFULSOUP4 If you do not write 4 will be installed by default beautifulsoup3
Data structure, kind: Beautiful soup transforms complex HTML documents into a complex tree structure, each of which is a Python object and all objects can be
To sum up to 4 kinds: Tag navigablestring beautifulsoup Comment.
Tag: The tag we used when we wrote the page (e.g. <a> hyperlink tag)
Navigablestring: Simple is a string that can be traversed
Search for documents:
Get the Web page source code using the requests library:
1 ImportRequests2 fromBs4ImportBeautifulSoup3URL ='Https://www.baidu.com/s?wd=python' 4headers ={5 'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko)
chrome/64.0.3282.140 safari/537.36',}6req =requests.session ()7Response = Req.get (URL, headers=headers, verify=False)8Html_test = Response.text
Html_test is to get a Web page source code, it does not crawl to the content of JS, all May and page content is not exactly the same!
before you can parse the contents of a document, you must first use an BeautifulSoup instance of an object. as follows, it is of type <class ' BS4. BeautifulSoup ' >
1 ' lxml ' ) 2print(soup, type (soup))
Gets the label Tag:soup. ' Tag name ' can match the first one, it will return the first occurrence of the label.
1 Print (Soup.span)
Get Label Properties:
1 Print(Type (SOUP.A))2 Print(soup.a['ID'])#no this attribute will cause an error 3 Print(Soup.a.attrs)#output label properties and values 4 Print(Soup.a.get ('ID'))# It is recommended to use the Get Fetch property, no return none
The running structure of the code:
- <class ' Bs4.element.Tag' >
- Result_logo
- {' href ': '/', ' id ': ' Result_logo ', ' onmousedown ': ' return C ({' FM ': ' tab ', ' tab ': ' Logo '} '}
- None
Get the contents of the document: After getting to the label (or soup), there are several different ways to get the contents of the tag, as follows:
strings: Direct Plus. Strings A generator is returned, but the author cannot call the next () method, and the query uses the following
1 a = soup.div.strings 2 A.__next__()
After execution, you can call a.__next__ () again, which returns the text content, but most of the time, it is cumbersome.
Search method based find () and Find_all () to get the text content :
string : Soup.find_all (' P ') gets all P tags, returns a list of columns ,soup.findl (' P ') returns only one, type ' Bs4.element.Tag '
1 Print (Soup.find_all ('P') [1])
2 Print (Soup.find_all (' I ', class_= ' C-icon-lidot ') # limit attribute class
The string can only be a tag name, not something else, otherwise find_all () Gets an empty list, and find () gets none
Find_all () gets a list element that is also of type ' Bs4.element.Tag '!
Regular Expressions : You need to import re, and then use Re.compile () creates a pattern object based on the string containing the regular expression .
1 Import Re 2 for in Soup.find_all (Re.compile ('span')): 3 Print (I.text) # returns all text content within the span tag
The elements of Soup.find_all (Re.compile (' span ')) are still ' Bs4.element.Tag '!
list:The Find_all method can also accept list parameters, and BeautifulSoup will return content that matches any of the elements in the list.
1 Print(Soup.find_all (['i'a'# get all I tags and a label
The returned data type is Bs4.element.ResultSet, similar to the list, and can be indexed and ordered
method (Call function body): If there is no suitable filter, we can also customize a method, the method accepts only one element parameter.
1 defhas_class_and_no_id (tag):2 returnTag.has_attr ('class') and notTag.has_attr ('ID')3 forTaginchSoup.find_all (has_class_and_no_id):4 Print(TAG)5 #the data type returned by Soup.find_all (has_class_and_no_id) is ' Bs4.element.ResultSet ' 6 #ditto, like a list, you can index values and unordered!
Based on select get:CSS selector, when writing CSS, tag name does not add any decoration, class name Plus., the ID name plus #; The return value is a list
Tag name lookup: Soup.select (' H3 a ') takes the A tag under the H3 tag; equivalent to Soup.select (' h3 > A ')
1 forIinchSoup.select ('H3 a'):2 #the text returns a STR string when fetching content 3Result_1 =I.text4 #Get_text returns a str string when fetching content 5Result_2 =I.get_text ()6 #string returns the Navigablestring, no content, will return none 7Result_3 =i.string8 #strings Returns the generator if the content is empty, it will return none 9Result_4 =i.stringsTen Print(result_1, type (result_1)) One Print(result_2, type (result_2)) A Print(Result_3, type (result_3)) - Print(Result_4, type (result_4))
class name lookup or ID lookup: soup.select ('. C-gap-left-small ') soup.select (' #content_bottom ')
Combination Lookup: Soup.select (' A. C-gap-left-small ')
Search
Python modules--beautifulsoup4 and lxml