Reptiles sometimes write regular expressions with suspended animation.
Is that regular expressions are always looking for a dead loop.
Example: https://social.msdn.microsoft.com/forums/azure/en-us/3f4390ac-11eb-4d67-b946-a73ffb51e4f3/netcpu100
So you can use BeautifulSoup library to solve regular expressions of Web pages when parsing Web pages.
The online explanation for BeautifulSoup is too complicated.
I just choose the part I need to learn, the other needs to learn, no need to waste time
At least a lot of worry
The explanations are in the comments.
Print out a sentence and you'll see.
1 #!/usr/bin/python3.42 #-*-coding:utf-8-*-3 Importurllib.request4 fromBs4ImportBeautifulSoup5 6 if __name__=='__main__':7URL ="http://www.lenggirl.com/"8headers = {9 'user-agent':'mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11',Ten 'Accept':'text/html;q=0.9,*/*;q=0.8', One 'Accept-charset':'iso-8859-1,utf-8;q=0.7,*;q=0.3', A 'accept-encoding':'gzip', - 'Connection':'Close', - 'Referer': None the } -data =urllib.request.urlopen (URL). Read () - #(' UTF-8 ') (' Unicode_escape ') (' GBK ', ' ignore ') -data = Data.decode ('UTF-8','Ignore') + #Initializing Web pages -Soup = beautifulsoup (data,"Html.parser") + #Print the entire page AHTML =soup.prettify () at #Print -Head =Soup.head - #Print <body>...</body> -BODY =Soup.body - #Print the first <p>...</p> -p =SOUP.P in #Print the contents of P -P_string =soup.p.string to #Soup.p.contents[0] for the + #soup.p.contents for [' 2016\n, '] -P_string =Soup.p.contents[0] the #print out all the heads inside the body * forChildinchSoup.body.children: $ #print (child)Panax Notoginseng Pass - #print out all <a>...</a> and <p>...</p> thea_and_p = Soup.find_all (["a","P"]) + #Find all the URLs in <a>...</a> A forMyimginchSoup.find_all ('a'): theIMG_SRC = Myimg.get ('href') + #print (IMG_SRC) - #Find the <a>...</a> below class for class_= ' a ' under ...</img> inside the SRC $ forMyimginchSoup.find_all ('a', class_='a'): $IMG_SRC = Myimg.find ('img'). Get ('src') - #page All information - #print (HTML)
The BeautifulSoup of Python crawlers