下載《物理》文章的Python指令碼

來源:互聯網
上載者:User

標籤:

本人雖然是個物理渣,沒事還是喜歡看看物理方面的內容以陶冶情操。一個比較好的來源是《物理》,這裡的文章是可以免費下載的。但是一篇篇下載有點麻煩,而且儲存的檔案名稱是文章標題的utf-8編碼,下完了還得改下檔案名稱。文章的不是直接寫在網頁裡的,而是在點擊下載的時候產生的,於是像DownThemAll、迅雷之類的工具就沒用了。於是自己動手寫一個下載指令碼。

通過查看網頁的源碼,它是用檔案的類型(應該都是pdf)和id來產生的。它是用的post,我用的get,我還不是很清楚這之間的區別,也準備學習下jQuery的內容。

我原本希望能只下載感興趣的文章。網頁上每篇文章對應有一個勾選框,勾選後對應的文章就會高亮,說實話我不知道網站用這個來幹什麼。。也許我可以勾選感興趣的文章後再下載。勾選後這個元素的class會從noselectrow變為selectedrow. 相關的代碼如下:

function hightlightrowaction(rowid) {    var thisrow = $("#"+rowid);    if ($(thisrow).hasClass("selectedrow")) {        $(thisrow).removeClass("selectedrow");        $(thisrow).addClass("noselectrow");    } else {        $(thisrow).addClass("selectedrow");        $(thisrow).removeClass("noselectrow");    }}
View Code

但有時勾選後class沒變,似乎有點問題,還沒搞清楚。

Python指令碼如下,用到了BeautifulSoup和requests。Regex寫得很渣。。

 1 # -*- coding: utf-8 -*- 2 """ 3 This script is used to download file from《物理》(http://www.wuli.ac.cn/CN/volumn/home.shtml) automatically. 4 example usage: 5  6 downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_1696.shtml") 7 """ 8 import requests 9 from bs4 import BeautifulSoup10 import urllib11 import re12 import os13 def hasDownloadLink(tag):14     return tag.has_attr(‘onclick‘) and tag[‘onclick‘].startswith(‘showArticleFile‘)15 16 def getFileTypeAndID(fileInfo):17     """18     :param fileInfo:19     :return: file type(usually pdf) and file ID20     """21     m = re.match(r‘[^,]*,\s*[\‘\"](.*)[\‘\"][^,]*,\s*([^\)]*).*‘, fileInfo)22     return m.groups()[0], m.groups()[1]23 24 def getPublicationYearMonth(tag):25     """26     :param tag:27     :return: publication year and month in the form YYYY-MM28     """29     return re.match(r‘.*(\d{4}-\d{2}).*‘, tag.get_text()).groups()[0]30 31 def modifyFileName(fname):32     # get rid of characters which are not allowed to be used in file name by Windows33     for inValidChar in r‘\/:?"<>|‘:34         fname = fname.replace(inValidChar, ‘‘)35     return fname36 37 def writeLog(saveDirectory, errMsg):38     fhandle = open(saveDirectory + "download log.txt", ‘w‘)39     for msg in errMsg:40         fhandle.write(msg.encode(‘utf-8‘));41     fhandle.close()42 43 def downloadFiles(saveDirectory, url, onlyDownloadSeleted = False):44     """45     :param saveDirectory: directory to store the downloaded files46     :param url: url of the download page47     :param onlyDownloadSeleted: not implemented yet. Ideally, it should allow one to download only interested instead of all files.48     :return: None49     """50     page = urllib.urlopen(url)51     soup = BeautifulSoup(page)52     volumeAndDateTag = soup.find(class_="STYLE5")53     yearMonth = getPublicationYearMonth(volumeAndDateTag)54     year = yearMonth[:4]55     relativePath = year + "\\" + yearMonth + "\\"56     absolutePath = saveDirectory + relativePath57     if not os.path.exists(absolutePath):58         os.makedirs(absolutePath)59     articleMark = "selectedrow" if onlyDownloadSeleted else "noselectrow"60     articles = soup.find_all(class_ = articleMark)61     errMsg = []62     for index, article in enumerate(articles, 1):63         print ‘Downloading the %d th file, %d left.‘ % (index, len(articles) - index)64         # the title of one article in contained in the first anchor65         title = article.find(‘a‘).get_text()66         title = modifyFileName(title)67         try:68             downloadAnchor = article.find(hasDownloadLink)69             fileInfo = downloadAnchor[‘onclick‘]70             fileType, fileID = getFileTypeAndID(fileInfo)71             fileName = title+‘.‘+fileType.lower()72             filePath = absolutePath + fileName73             param = {"attachType":fileType, "id":fileID}74             if not os.path.exists(filePath):75                    articleFile = requests.get("http://www.wuli.ac.cn/CN/article/downloadArticleFile.do",params=param)76                    fhandle = open(filePath, "wb")77                    fhandle.write(articleFile.content)78                    fhandle.close()79         except:80             errMsg.append(title + " download failed")81 82     if len(errMsg) > 0:83         writeLog(absolutePath, errMsg)84 85 if __name__ == "__main__":86     downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_921.shtml")
View Code

 

下載《物理》文章的Python指令碼

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.