下載《物理》文章的Python指令碼

最後更新：2015-02-13 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

本人雖然是個物理渣，沒事還是喜歡看看物理方面的內容以陶冶情操。一個比較好的來源是《物理》，這裡的文章是可以免費下載的。但是一篇篇下載有點麻煩，而且儲存的檔案名稱是文章標題的utf-8編碼，下完了還得改下檔案名稱。文章的不是直接寫在網頁裡的，而是在點擊下載的時候產生的，於是像DownThemAll、迅雷之類的工具就沒用了。於是自己動手寫一個下載指令碼。

通過查看網頁的源碼，它是用檔案的類型（應該都是pdf）和id來產生的。它是用的post，我用的get，我還不是很清楚這之間的區別，也準備學習下jQuery的內容。

我原本希望能只下載感興趣的文章。網頁上每篇文章對應有一個勾選框，勾選後對應的文章就會高亮，說實話我不知道網站用這個來幹什麼。。也許我可以勾選感興趣的文章後再下載。勾選後這個元素的class會從noselectrow變為selectedrow. 相關的代碼如下：

function hightlightrowaction(rowid) {    var thisrow = $("#"+rowid);    if ($(thisrow).hasClass("selectedrow")) {        $(thisrow).removeClass("selectedrow");        $(thisrow).addClass("noselectrow");    } else {        $(thisrow).addClass("selectedrow");        $(thisrow).removeClass("noselectrow");    }}

View Code

但有時勾選後class沒變，似乎有點問題，還沒搞清楚。

Python指令碼如下，用到了BeautifulSoup和requests。Regex寫得很渣。。

 1 # -*- coding: utf-8 -*- 2 """ 3 This script is used to download file from《物理》(http://www.wuli.ac.cn/CN/volumn/home.shtml) automatically. 4 example usage: 5  6 downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_1696.shtml") 7 """ 8 import requests 9 from bs4 import BeautifulSoup10 import urllib11 import re12 import os13 def hasDownloadLink(tag):14     return tag.has_attr(‘onclick‘) and tag[‘onclick‘].startswith(‘showArticleFile‘)15 16 def getFileTypeAndID(fileInfo):17     """18     :param fileInfo:19     :return: file type(usually pdf) and file ID20     """21     m = re.match(r‘[^,]*,\s*[\‘\"](.*)[\‘\"][^,]*,\s*([^\)]*).*‘, fileInfo)22     return m.groups()[0], m.groups()[1]23 24 def getPublicationYearMonth(tag):25     """26     :param tag:27     :return: publication year and month in the form YYYY-MM28     """29     return re.match(r‘.*(\d{4}-\d{2}).*‘, tag.get_text()).groups()[0]30 31 def modifyFileName(fname):32     # get rid of characters which are not allowed to be used in file name by Windows33     for inValidChar in r‘\/:?"<>|‘:34         fname = fname.replace(inValidChar, ‘‘)35     return fname36 37 def writeLog(saveDirectory, errMsg):38     fhandle = open(saveDirectory + "download log.txt", ‘w‘)39     for msg in errMsg:40         fhandle.write(msg.encode(‘utf-8‘));41     fhandle.close()42 43 def downloadFiles(saveDirectory, url, onlyDownloadSeleted = False):44     """45     :param saveDirectory: directory to store the downloaded files46     :param url: url of the download page47     :param onlyDownloadSeleted: not implemented yet. Ideally, it should allow one to download only interested instead of all files.48     :return: None49     """50     page = urllib.urlopen(url)51     soup = BeautifulSoup(page)52     volumeAndDateTag = soup.find(class_="STYLE5")53     yearMonth = getPublicationYearMonth(volumeAndDateTag)54     year = yearMonth[:4]55     relativePath = year + "\\" + yearMonth + "\\"56     absolutePath = saveDirectory + relativePath57     if not os.path.exists(absolutePath):58         os.makedirs(absolutePath)59     articleMark = "selectedrow" if onlyDownloadSeleted else "noselectrow"60     articles = soup.find_all(class_ = articleMark)61     errMsg = []62     for index, article in enumerate(articles, 1):63         print ‘Downloading the %d th file, %d left.‘ % (index, len(articles) - index)64         # the title of one article in contained in the first anchor65         title = article.find(‘a‘).get_text()66         title = modifyFileName(title)67         try:68             downloadAnchor = article.find(hasDownloadLink)69             fileInfo = downloadAnchor[‘onclick‘]70             fileType, fileID = getFileTypeAndID(fileInfo)71             fileName = title+‘.‘+fileType.lower()72             filePath = absolutePath + fileName73             param = {"attachType":fileType, "id":fileID}74             if not os.path.exists(filePath):75                    articleFile = requests.get("http://www.wuli.ac.cn/CN/article/downloadArticleFile.do",params=param)76                    fhandle = open(filePath, "wb")77                    fhandle.write(articleFile.content)78                    fhandle.close()79         except:80             errMsg.append(title + " download failed")81 82     if len(errMsg) > 0:83         writeLog(absolutePath, errMsg)84 85 if __name__ == "__main__":86     downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_921.shtml")

View Code

下載《物理》文章的Python指令碼

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More