Because of the ability to choose an interested site for data analysis, so this time to crawl the site is Xinhua, its URL is "http://www.xinhuanet.com/", and then to analyze the data and generate word cloud
Run the entire program-related code package
Import Requests Import Re from Import BeautifulSoup from Import datetime Import Pandas Import Sqlite3 Import Jieba from Import Wordcloud import Matplotlib.pyplot as Plt
Crawling page information
URL ="http://www.xinhuanet.com/"F=open ("Css.txt","w+") Res0=requests.get (URL) res0.encoding="Utf-8"Soup= BeautifulSoup (Res0.text,"Html.parser") Newsgroup=[] forNewsinchSoup.select ("Li"): ifLen (News.select ("a")) >0:Print(News.select ("a") [0].text] Title=news.select ("a") [0].text f.write (title) F.close ()
In TXT file, and Word Count
F0 = open ('Css.txt','R') QZ=[]qz=F0.read () f0.close ( )Print(QZ) words=list (Jieba.cut (QZ)) UL={':','of the','"',',','"','"','. ','! ',':','? ',' ','\u3000',',','\ n'}dic={}keys= Set (words)-ul forIinchKeys:dic[i]=Words.count (i) C=list (Dic.items ()) C.sort (Key=LambdaX:x[1],reverse=True) F1= Open ('Diectory.txt','W') forIinchRange (10): Print(C[i]) forWords_countinchRange (c[i][1]): F1.write (C[i][0]+' ') F1.close ()
Deposit Database
DF = Pandas. DataFrame (words)print(Df.head ()) with Sqlite3.connect ('newsdb3.sqlite ' ) as DB: df.to_sql ('newsdb3', con = db)
Make Word Cloud
F3 = Open ('diectory.txt','r'== Wordcloud (). Generate (Cy_file) plt.imshow (CY) Plt.axis("off") plt.show ()
Final results
Full code
ImportRequestsImportRe fromBs4ImportBeautifulSoup fromDatetimeImportdatetimeImportPandasImportSqlite3ImportJieba fromWordcloudImportWordcloudImportMatplotlib.pyplot as Plturl="http://www.xinhuanet.com/"F=open ("Css.txt","w+") Res0=requests.get (URL) res0.encoding="Utf-8"Soup= BeautifulSoup (Res0.text,"Html.parser") Newsgroup=[] forNewsinchSoup.select ("Li"): ifLen (News.select ("a")) >0:Print(News.select ("a") [0].text] Title=news.select ("a") [0].text f.write (title) F.close () F0= Open ('Css.txt','R') QZ=[]qz=F0.read () f0.close ( )Print(QZ) words=list (Jieba.cut (QZ)) UL={':','of the','"',',','"','"','. ','! ',':','? ',' ','\u3000',',','\ n'}dic={}keys= Set (words)-ul forIinchKeys:dic[i]=Words.count (i) C=list (Dic.items ()) C.sort (Key=LambdaX:x[1],reverse=True) F1= Open ('Diectory.txt','W') forIinchRange (10): Print(C[i]) forWords_countinchRange (c[i][1]): F1.write (C[i][0]+' ') F1.close () DF=Pandas. DataFrame (words)Print(Df.head ()) with Sqlite3.connect ('Newsdb3.sqlite') as Db:df.to_sql ('newsdb3', con =db) F3= Open ('Diectory.txt','R') Cy_file=F3.read () f3.close () Cy=Wordcloud (). Generate (Cy_file) plt.imshow (CY) Plt.axis ("if") plt.show ()
A full Python big job