Since some PDF files need to be downloaded from a Web page, there are hundreds of PDF files to download, so it is not possible to download them manually. Python has the relevant modules, so I wrote a program to download the PDF file, and by the way, I was familiar with Python's Urllib module and ULRLLIB2 module.
1, the problem description
A PDF file containing hundreds of papers from the http://www.cvpapers.com/cvpr2014.html is required, as shown in the following illustration:
2, Problem solving
Automated downloads are implemented by combining Python's Urllib module with the URLLIB2 module. The code is as follows:
test.py
#!/usr/bin/python #-*-coding:utf-8-*-import urllib #导入urllib模块 import urllib2 #导入urllib 2 Module Import re #导入正则表达式模块: RE module def getpdffromnet (inputurl): req = urllib2.
Request (inputurl) F = urllib2.urlopen (req) #打开网页 localdir = ' E:\downloadPDF\\ ' #下载PDF文件需要存储在本地的文件夹 Urllist = [] #用来存储提取的PDF下载的url的列表 for Eachline in F: #遍历网页的每一行 line = Eachline.strip () #去除行首位的空格, habitual writing if Re.match ('. *pdf.* ', line): #去匹配含有 "PDF" string of lines, only those lines have a pdf download address wordlist = line.split (' \ ") #以 "As a demarcation, separate the line so that the URL address is separately separated for word in wordlist: #遍历每个字符串 if Re.match ('. *\.pdf$ ', word): #去匹配含有"
. pdf string, only the URL has urllist.append (word) #将提取的url存入列表 for everyurl in urllist: #遍历列表的每一项, that is, the URL of each PDF Worditems = Everyurl.split ('/') #将url以/boundaries, in order to extract the PDF file name for the item in Worditems: #遍历每个字符串 if R E.match ('. *\.pdf$ ', item): #查找PDF的文件名 PDFName = Item #查找到PDF文件名 localpdf = Localdir + pdfname #将本地存储目录和需要提取的PDF文件名进行连接 Try:ur Llib.urlretrieve (Everyurl, localpdf) #按照url进行下载 and stored in its filename to the local directory except Exception,e:continue getpdffromnet (' H Ttp://www.cvpapers.com/cvpr2014.html ')
Attention:
(1) 1th, 6, 8, 23, respectively, thanks to a "\" to escape;
(2) The Urlretrieve function in line 27th has 3 parameters: The first argument is the target URL; the second parameter is the file absolute path (including file name) that is saved, and the return value of the function is a tuple (filename,header). The filename is the second parameter filename. If Urlretrieve only provides 1 parameters, the filename of the returned value is the resulting temporary filename, and the temporary file is deleted after the function has finished executing. The 3rd parameter is a callback function that triggers the callback when the server is connected and the corresponding block of data is transmitted. The callback function name can be arbitrary, but the argument must be three. Generally, a callback function is defined directly using Reporthook (block_read,block_size,total_size), Block_size is the size of each chunk of data read, Block_read is the number of blocks per read, Taotal_ The size is a total amount of data read, in bytes. You can use the Reporthook function to display the read progress.
If you want to display the read progress, you can add the third parameter, and change line 27th of the above procedure to read as follows:
The code for the Reporthook callback function is as follows:
def reporthook (block_read,block_size,total_size):
if not block_read:
print "Connection opened";
Return
if total_size<0:
#unknown size
print "read%d blocks (%dbytes)"% (Block_read,block_read*block_ size);
else:
amount_read=block_read*block_size;
To sum up, this is a simple to crawl data from the Web, download files of the small program, I hope to learn Python students are helpful. Thank you!