I. Analysis of Web information
from Import Beautifulsoupwith Open ('c:/users/michael/desktop/plan-for-combating-master/week1/1_2 /1_2code_of_video/web/new_index.html','r') as Web_data: = BeautifulSoup (web_data,'lxml') print(Soup)
Second, get the location where you want to crawl the element
Browser Right Button-"review element-" copy-"Seletor
""" body > Div.main-content > Ul > Li:nth-child (1) > Div.article-info > H3 > A body > div.ma In-content > Ul > Li:nth-child (1) > Div.article-info > P.meta-info > Span:nth-child (2) body > Div.ma In-content > Ul > Li:nth-child (1) > Div.article-info > P.description body > Div.main-content > UL &G T Li:nth-child (1) > Div.rate > Span body > Div.main-content > Ul > Li:nth-child (1) > img """
Images = Soup.select ('body > Div.main-content > Ul > Li:nth-child (1) > img' ) print(images)
Modified to:
Images = Soup.select ('body > Div.main-content > Ul > Li:nth-of-type (1) > img' ) print(images)
At this time can get to a
Images = Soup.select ('body > Div.main-content > Ul > Li > img') Print(images)
Get all the pictures
Titles = Soup.select ('body > Div.main-content > Ul > li > Div.article-info > H3 > A') Descs= Soup.select ('body > Div.main-content > Ul > li > Div.article-info > P.description') Rates= Soup.select ('body > Div.main-content > Ul > li > Div.rate > Span') Cates= Soup.select ('body > Div.main-content > Ul > li > Div.article-info > P.meta-info > Span') Print(images,titles,descs,rates,cates,sep='\ n-----------\ n')
Get additional information
Third, get the text information (Get_text ()) and attributes (get ()) in the tag
for inch titles: Print (Title.get_text ())
Package into a dictionary:
forTitle,image,desc,rate,cateinchZip (titles,images,descs,rates,cates): Data= { 'title': Title.get_text (),' Rate': Rate.get_text (),'desc':d Esc.get_text (),'Cate': Cate.get_text (),'Image': Image.get ('src') } Print(data)
Because Cates has multiple properties, it needs to rise to the parent node
Cates = Soup.select (' body > Div.main-content > Ul > li > Div.article-info > p.meta-info')
forTitle,image,desc,rate,cateinchZip (titles,images,descs,rates,cates): Data= { 'title': Title.get_text (),' Rate': Rate.get_text (),'desc':d Esc.get_text (),'Cate': List (cate.stripped_strings),'Image': Image.get ('src') } Print(data)
# find articles with ratings greater than 3 for inch Info: if float (i['rate') >3: print(i[' ) title '],i['cate'])
Iv. Complete code
fromBs4ImportBeautifulsoupinfo=[]with Open ('c:/users/michael/desktop/plan-for-combating-master/week1/1_2/1_2code_of_video/web/new_index.html','R') as Web_data:soup= BeautifulSoup (Web_data,'lxml') #print (Soup) """body > Div.main-content > Ul > Li:nth-child (1) > Div.article-info > H3 > A body > Div.main -content > Ul > Li:nth-child (1) > Div.article-info > P.meta-info > Span:nth-child (2) body > div.main- Content > Ul > Li:nth-child (1) > Div.article-info > P.description body > Div.main-content > UL > L I:nth-child (1) > Div.rate > Span body > Div.main-content > Ul > Li:nth-child (1) > IMG"""Images= Soup.select ('body > Div.main-content > Ul > Li > img') Titles= Soup.select ('body > Div.main-content > Ul > li > Div.article-info > H3 > A') Descs= Soup.select ('body > Div.main-content > Ul > li > Div.article-info > P.description') Rates= Soup.select ('body > Div.main-content > Ul > li > Div.rate > Span') Cates= Soup.select ('body > Div.main-content > Ul > li > Div.article-info > P.meta-info') #print (images,titles,descs,rates,cates,sep= ' \ n-----------\ n ') forTitle,image,desc,rate,cateinchZip (titles,images,descs,rates,cates): Data= { 'title': Title.get_text (),' Rate': Rate.get_text (),'desc':d Esc.get_text (),'Cate': List (cate.stripped_strings),'Image': Image.get ('src') } #Add to Listinfo.append (data)#find articles with ratings greater than 3 forIinchInfo:ifFloat (i[' Rate']) >3: Print(i['title'],i['Cate'])
First, use BeautifulSoup crawl Web information information