標籤:
本人雖然是個物理渣,沒事還是喜歡看看物理方面的內容以陶冶情操。一個比較好的來源是《物理》,這裡的文章是可以免費下載的。但是一篇篇下載有點麻煩,而且儲存的檔案名稱是文章標題的utf-8編碼,下完了還得改下檔案名稱。文章的不是直接寫在網頁裡的,而是在點擊下載的時候產生的,於是像DownThemAll、迅雷之類的工具就沒用了。於是自己動手寫一個下載指令碼。
通過查看網頁的源碼,它是用檔案的類型(應該都是pdf)和id來產生的。它是用的post,我用的get,我還不是很清楚這之間的區別,也準備學習下jQuery的內容。
我原本希望能只下載感興趣的文章。網頁上每篇文章對應有一個勾選框,勾選後對應的文章就會高亮,說實話我不知道網站用這個來幹什麼。。也許我可以勾選感興趣的文章後再下載。勾選後這個元素的class會從noselectrow變為selectedrow. 相關的代碼如下:
function hightlightrowaction(rowid) { var thisrow = $("#"+rowid); if ($(thisrow).hasClass("selectedrow")) { $(thisrow).removeClass("selectedrow"); $(thisrow).addClass("noselectrow"); } else { $(thisrow).addClass("selectedrow"); $(thisrow).removeClass("noselectrow"); }}
View Code
但有時勾選後class沒變,似乎有點問題,還沒搞清楚。
Python指令碼如下,用到了BeautifulSoup和requests。Regex寫得很渣。。
1 # -*- coding: utf-8 -*- 2 """ 3 This script is used to download file from《物理》(http://www.wuli.ac.cn/CN/volumn/home.shtml) automatically. 4 example usage: 5 6 downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_1696.shtml") 7 """ 8 import requests 9 from bs4 import BeautifulSoup10 import urllib11 import re12 import os13 def hasDownloadLink(tag):14 return tag.has_attr(‘onclick‘) and tag[‘onclick‘].startswith(‘showArticleFile‘)15 16 def getFileTypeAndID(fileInfo):17 """18 :param fileInfo:19 :return: file type(usually pdf) and file ID20 """21 m = re.match(r‘[^,]*,\s*[\‘\"](.*)[\‘\"][^,]*,\s*([^\)]*).*‘, fileInfo)22 return m.groups()[0], m.groups()[1]23 24 def getPublicationYearMonth(tag):25 """26 :param tag:27 :return: publication year and month in the form YYYY-MM28 """29 return re.match(r‘.*(\d{4}-\d{2}).*‘, tag.get_text()).groups()[0]30 31 def modifyFileName(fname):32 # get rid of characters which are not allowed to be used in file name by Windows33 for inValidChar in r‘\/:?"<>|‘:34 fname = fname.replace(inValidChar, ‘‘)35 return fname36 37 def writeLog(saveDirectory, errMsg):38 fhandle = open(saveDirectory + "download log.txt", ‘w‘)39 for msg in errMsg:40 fhandle.write(msg.encode(‘utf-8‘));41 fhandle.close()42 43 def downloadFiles(saveDirectory, url, onlyDownloadSeleted = False):44 """45 :param saveDirectory: directory to store the downloaded files46 :param url: url of the download page47 :param onlyDownloadSeleted: not implemented yet. Ideally, it should allow one to download only interested instead of all files.48 :return: None49 """50 page = urllib.urlopen(url)51 soup = BeautifulSoup(page)52 volumeAndDateTag = soup.find(class_="STYLE5")53 yearMonth = getPublicationYearMonth(volumeAndDateTag)54 year = yearMonth[:4]55 relativePath = year + "\\" + yearMonth + "\\"56 absolutePath = saveDirectory + relativePath57 if not os.path.exists(absolutePath):58 os.makedirs(absolutePath)59 articleMark = "selectedrow" if onlyDownloadSeleted else "noselectrow"60 articles = soup.find_all(class_ = articleMark)61 errMsg = []62 for index, article in enumerate(articles, 1):63 print ‘Downloading the %d th file, %d left.‘ % (index, len(articles) - index)64 # the title of one article in contained in the first anchor65 title = article.find(‘a‘).get_text()66 title = modifyFileName(title)67 try:68 downloadAnchor = article.find(hasDownloadLink)69 fileInfo = downloadAnchor[‘onclick‘]70 fileType, fileID = getFileTypeAndID(fileInfo)71 fileName = title+‘.‘+fileType.lower()72 filePath = absolutePath + fileName73 param = {"attachType":fileType, "id":fileID}74 if not os.path.exists(filePath):75 articleFile = requests.get("http://www.wuli.ac.cn/CN/article/downloadArticleFile.do",params=param)76 fhandle = open(filePath, "wb")77 fhandle.write(articleFile.content)78 fhandle.close()79 except:80 errMsg.append(title + " download failed")81 82 if len(errMsg) > 0:83 writeLog(absolutePath, errMsg)84 85 if __name__ == "__main__":86 downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_921.shtml")
View Code
下載《物理》文章的Python指令碼