Python網路爬蟲（七）：百度文庫文章爬取器_

Python網路爬蟲（七）：百度文庫文章爬取器__Python

最後更新：2018-07-24 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

在用前面的方法爬取百度文庫的文章時，發現只能爬取已顯示出來的幾頁文章，而對於沒有顯示的頁數則無法獲得其內容。如果要完整的看到整篇文章，需要手動地點擊底下的“繼續閱讀”，使所有的頁數都顯示出來。

查看元素後發現，展開前的html和展開後的html是不同的，前者隱藏頁的的常值內容是沒有顯示的。但是爬蟲獲得的是展開前的html檔案，所以也就只能獲得部分內容。
本文使用了一個工具來自動化操作網頁，獲得展開後的html。 使用Selenium自動化工具來操控瀏覽器 Selenium的安裝
pip3 install Selenium 安裝chromedriver.exe
這裡踩了很多坑。
驅動下載地址：
http://chromedriver.storage.googleapis.com/index.html
一定要下載與chrome版本相匹配的chromedriver，而且注意並不是版本號碼越大的驅動對應最新的chrome瀏覽器，要仔細查看notes.txt檔案看對應關係。比如我的chrome是v62，支援的chromedriver是v2.33。將安裝程式拖到C:\Program Files (x86)\Google\Chrome\Application\目錄下設定環境變數：win+r，輸入sysdm.cpl，進階，環境變數，設定Path為C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe。或者在調用chrome時指定這個路徑。
browser = webdriver.Chrome(‘C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe’) 使用selenium自動操作網頁：

from selenium import webdriveroptions = webdriver.ChromeOptions()options.add_argument('user-agent="Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19"')driver = webdriver.Chrome(chrome_options=options)driver.get('https://www.baidu.com/')html = driver.page_source

完整代碼

# contents_bdwk.pyfrom selenium import webdriverfrom bs4 import BeautifulSoup# ***selenium 自動操作網頁***options = webdriver.ChromeOptions()options.add_argument('user-agent="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36"')   #設定裝置代理程式driver = webdriver.Chrome(chrome_options=options)driver.get('https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html')    #此處填寫文章地址page = driver.find_element_by_xpath("//div[@id='html-reader-go-more']")driver.execute_script('arguments[0].scrollIntoView();', page)               #拖動網頁到可見的元素去nextpage = driver.find_element_by_xpath("//span[@class='moreBtn goBtn']")nextpage.click()                                                            #進行點擊下一頁操作# ***對開啟的html進行分析***html = driver.page_sourcebf1 = BeautifulSoup(html, 'lxml')# 獲得文章標題title = bf1.find_all('h1', class_='reader_ab_test with-top-banner')bf2 = BeautifulSoup(str(title), 'lxml')title = bf2.find('span')title = title.get_text()filename = title + '.txt'# 獲得文章內容texts_list = []result = bf1.find_all('div', class_='ie-fix')for each_result in result:    bf3 = BeautifulSoup(str(each_result), 'lxml')    texts = bf3.find_all('p')    for each_text in texts:        texts_list.append(each_text.string)contents = ''.join(texts_list).replace('\xa0', '')# ***儲存為.txt檔案with open(filename, 'a', encoding='utf-8') as f:    f.writelines(contents)    f.write('\n\n')

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More