python-爬取中藥資訊

最後更新：2018-04-24 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：興趣 python版本 div AC 爬取 nbsp span rom rbd

1.選一個自己感興趣的主題或網站。(所有同學不能雷同)

源地址：http://www.18ladys.com/

2.用python 編寫爬蟲程式，從網路上爬取相關主題的資料。

3.對爬了的資料進行文本分析，產生詞雲。

圖3-1 爬蟲小程式的詞雲

4.對文本分析結果進行解釋說明。

因為爬取的是各個中藥的類別及名字，沒有爬取更細節的資料，所以顯示出來的多是一些中藥名詞

5.寫一篇完整的部落格，描述上述實現過程、遇到的問題及解決辦法、資料分析思想及結論。

(1).寫了兩個檔案，具體如下：

1).資料爬取並產生txt檔案的py檔案

2).利用python相關的包產生詞雲相關操作的py檔案

(2).遇到的問題以及解決方案：

1).wordcloud包的安裝配置出現很大的問題，本機系統裝載了兩個python版本導致裝載出現很多額外的問題。

解決：在同學的協助下安裝了whl檔案並刪除了本機中的另一個python版本。

2).資訊爬取過慢

解決：暫未解決。爬取的頁面預計超過100p，所以有關方面可能需要依賴別的技術。

6.最後提交爬取的全部資料、爬蟲及資料分析原始碼。

(1).文檔部分

1).用於資料爬取並產生txt檔案的py檔案：

import requestsfrom bs4 import BeautifulSoup#擷取——————————————————————————————————————————def catchSoup(url):    #url=‘http://www.18ladys.com/post/buchong/‘    res=requests.get(url)    res.encoding=‘utf-8‘    soup=BeautifulSoup(res.text,‘html.parser‘)    return soup#類型及其網頁尋找(首頁尋找)——————————————————————def kindSearch(soup):    herbKind=[]    for new in soup.select(‘li‘):        if(new.text!=‘首頁‘):            perKind=[]            perKind.append(new.text)            perKind.append(new.select(‘a‘)[0].attrs[‘href‘])            herbKind.append(perKind)    return herbKind#藥名尋找(傳入頁面)——————————————————————————————————————————————————————def nameSearch(soup):    herbName=[]    for new in soup.select(‘h3‘):        pername=new.text.split(‘_‘)[0].rstrip(‘圖片‘).lstrip(‘\xa0‘)        pername=pername.rstrip(‘的功效與作用‘)        herbName.append(pername)    return herbName#分頁及詳細地址——————————————————————————————————————————————————————————def perPage(soup):    kindPage=[]    add=[]    for new in soup.select(‘.post.pagebar‘):        for detail in new.select(‘a‘):            d=[]            d.append(detail.text)            d.append(detail.attrs[‘href‘])            kindPage.append(d)    kindPage.remove(kindPage[0])    kindPage.remove(kindPage[-1])    return kindPage#爬取某一類的所有藥名:kind是一個數字,照著kindSearch的結果輸入。————————————def herbDetail(kind):    soup=catchSoup(‘http://www.18ladys.com/post/buchong/‘)#從首頁開始    kindName=kindSearch(soup)[kind][0]       #這一類草藥的類名    adds=kindSearch(soup)[kind][1]           #這一類草藥的第一頁地址    totalRecord = []                         #這一類草藥的所有名字    print("正在爬取 "+str(kind)+‘.‘+kindName)    totalRecord.append(nameSearch(catchSoup(adds)))#第一頁的草藥    for add in perPage(catchSoup(adds)):           #第二頁以及之後的草藥        pageAdd=add[1]        totalRecord.append(nameSearch(catchSoup(pageAdd)))        #print(nameSearch(catchSoup(pageAdd)))    print(totalRecord)    return totalRecord#===========================================================#                      操作#===========================================================if __name__=="__main__":    #擷取類別名字及其網頁地址—    totalKind=kindSearch(catchSoup(‘http://www.18ladys.com/post/buchong/‘)) #首頁    #擷取某一類中藥的各種藥名    totalRecord=[]    kind=0    detailContent = ‘‘    while(kind<20):        totalRecord=herbDetail(kind)        if(kind==0):            detailContent+=‘目錄：\n‘            for i in totalKind:                detailContent+=str(totalKind.index(i)+1)+‘.‘+i[0]+‘ ‘            kind+=1            continue        else:            detailContent+=‘\n‘+str(totalKind[kind][0])+‘:\n‘        for i in totalRecord:            detailContent+=str(totalRecord.index(i)+1)+‘.‘+i[0]+‘ ‘        kind+=1f = open(‘herbDetail.txt‘, ‘a+‘,encoding=‘utf-8‘)f.write(detailContent)f.close()

2).程式運行：

3).匯出文檔：

(2).詞雲產生部分

1).產生詞雲相關操作的py檔案：

from wordcloud import WordCloudimport jiebafrom os import pathimport matplotlib.pyplot as pltcomment_text = open(‘D:\\herbDetail.txt‘,‘r‘,encoding=‘utf-8‘).read()cut_text = " ".join(jieba.cut(comment_text))d = path.dirname(__file__)cloud = WordCloud(    font_path="C:\\Windows\\Fonts\\simhei.ttf",    background_color=‘white‘,    max_words=2000,    max_font_size=40)word_cloud = cloud.generate(cut_text)word_cloud.to_file("cloud4herb.jpg")#顯示詞雲圖片===================================plt.imshow(word_cloud)plt.axis(‘off‘)plt.show()

2).產生的詞雲圖片：

見上文 3.

python-爬取中藥資訊

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More