"Python" python urllib modules, urllib2 modules to download files in bulk

"Python" python urllib modules, urllib2 modules to download files in bulk _python

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Since some PDF files need to be downloaded from a Web page, there are hundreds of PDF files to download, so it is not possible to download them manually. Python has the relevant modules, so I wrote a program to download the PDF file, and by the way, I was familiar with Python's Urllib module and ULRLLIB2 module.

1, the problem description

A PDF file containing hundreds of papers from the http://www.cvpapers.com/cvpr2014.html is required, as shown in the following illustration:

2, Problem solving

Automated downloads are implemented by combining Python's Urllib module with the URLLIB2 module. The code is as follows:

test.py

#!/usr/bin/python #-*-coding:utf-8-*-import urllib #导入urllib模块 import urllib2 #导入urllib 2 Module Import re #导入正则表达式模块: RE module def getpdffromnet (inputurl): req = urllib2. 
  Request (inputurl) F = urllib2.urlopen (req) #打开网页 localdir = ' E:\downloadPDF\\ ' #下载PDF文件需要存储在本地的文件夹       Urllist = [] #用来存储提取的PDF下载的url的列表 for Eachline in F: #遍历网页的每一行 line = Eachline.strip ()     #去除行首位的空格, habitual writing if Re.match ('. *pdf.* ', line): #去匹配含有 "PDF" string of lines, only those lines have a pdf download address wordlist = line.split (' \ ") #以 "As a demarcation, separate the line so that the URL address is separately separated for word in wordlist: #遍历每个字符串 if Re.match ('. *\.pdf$ ', word): #去匹配含有"  
    . pdf string, only the URL has urllist.append (word) #将提取的url存入列表 for everyurl in urllist: #遍历列表的每一项, that is, the URL of each PDF Worditems = Everyurl.split ('/') #将url以/boundaries, in order to extract the PDF file name for the item in Worditems: #遍历每个字符串 if R E.match ('. *\.pdf$ ', item): #查找PDF的文件名 PDFName = Item #查找到PDF文件名 localpdf = Localdir + pdfname #将本地存储目录和需要提取的PDF文件名进行连接 Try:ur Llib.urlretrieve (Everyurl, localpdf) #按照url进行下载 and stored in its filename to the local directory except Exception,e:continue getpdffromnet (' H  Ttp://www.cvpapers.com/cvpr2014.html ')

Attention:

(1) 1th, 6, 8, 23, respectively, thanks to a "\" to escape;

(2) The Urlretrieve function in line 27th has 3 parameters: The first argument is the target URL; the second parameter is the file absolute path (including file name) that is saved, and the return value of the function is a tuple (filename,header). The filename is the second parameter filename. If Urlretrieve only provides 1 parameters, the filename of the returned value is the resulting temporary filename, and the temporary file is deleted after the function has finished executing. The 3rd parameter is a callback function that triggers the callback when the server is connected and the corresponding block of data is transmitted. The callback function name can be arbitrary, but the argument must be three. Generally, a callback function is defined directly using Reporthook (block_read,block_size,total_size), Block_size is the size of each chunk of data read, Block_read is the number of blocks per read, Taotal_ The size is a total amount of data read, in bytes. You can use the Reporthook function to display the read progress.
If you want to display the read progress, you can add the third parameter, and change line 27th of the above procedure to read as follows:

The code for the Reporthook callback function is as follows:

def reporthook (block_read,block_size,total_size): 
 if not block_read: 
 print "Connection opened"; 
 Return 
 if total_size<0: 
 #unknown size 
 print "read%d blocks (%dbytes)"% (Block_read,block_read*block_ size); 
 else: 
 amount_read=block_read*block_size;

To sum up, this is a simple to crawl data from the Web, download files of the small program, I hope to learn Python students are helpful. Thank you!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Python" python urllib modules, urllib2 modules to download files in bulk _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Python" python urllib modules, urllib2 modules to download files in bulk _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support