[Python] Python's urllib module and urllib2 module batch download files from webpages,

Last Update:2016-12-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some PDF files need to be downloaded from a certain webpage, but there are hundreds of PDF files to be downloaded, so it is impossible to download them manually. Python has related modules, so I wrote a program to download PDF files. By the way, I am familiar with Python's urllib module and ulrllib2 module.

1. Problem Description

You need to download several hundred documents from http://www.cvpapers.com/cvpr2014.html. the webpage is shown as follows:

2. Problem Solving

You can use the Python urllib module and urllib2 module for automatic download. The Code is as follows:

Test. py

#! /Usr/bin/python #-*-coding: UTF-8-*-import urllib # import urllib module import urllib2 # import urllib2 module import re # import Regular Expression module: re module def getPDFFromNet (inputURL): req = urllib2.Request (inputURL) f = urllib2.urlopen (req) # Open the webpage localDir = 'e: \ downloadPDF \ '# list of URLs used to download extracted PDF files stored in the local folder urlList = [] # for storing extracted PDF files in f: # traverse each line of the webpage line = eachLine. strip () # removes spaces at the beginning of a row and habitually writes if re. match ('. * PDF. * ', line): # match rows containing "PDF" strings. Only these rows have PDF wordList = line. split ('\ "') # Separate the Line Based on". in this way, the url address is separated separately for word in wordList: # traverse each string if re. match ('. *\. pdf $ ', word): extract to match a string containing unique characters. urlList is available only in URLs. append (word) # Save the extracted url to the list for everyURL in urlList: # traverse each item in the list, that is, the url wordItems = everyURL of each PDF. split ('/') # divide the url in the/field. To extract the PDF file name for item in wordItems: # traverse each string if re. match ('. *\. pdf $ ', item): # Find the PDF file name named alias name = item # Find the PDF file name localPDF = localDir + example name # connect the local storage directory and the PDF file name to be extracted try: urllib. urlretrieve (everyURL, localPDF) # download the file according to the url and store it to the local directory using its file name. Example t Exception, e: continue getPDFFromNet ('HTTP: // www.cvpapers.com/cvpr2014.html ')

Note:

(1) lines 1st, 6, 8, and 23 respectively thanked a "\" for escape;

(2) The urlretrieve function of row 27th has three parameters: the first parameter is the target url, and the second parameter is the absolute path (including the file name) of the saved file ), the return value of this function is a tuple (filename, header), where filename is the second parameter filename. If urlretrieve only provides one parameter, the returned filename is the generated temporary file name. After the function is executed, the temporary file will be deleted. The first parameter is a callback function. This callback is triggered when the server is connected and the corresponding data block is transferred. The callback function name can be any, but the parameter must be three. Generally, you can directly use reporthook (block_read, block_size, total_size) to define the callback function. block_size is the size of the data block read each time, and block_read is the number of data blocks read each time, taotal_size is the total data volume read in bytes. You can use the reporthook function to display the read progress.
If you want to display the read progress, you can add the third parameter and change row 27th of the above program to the following:

urllib.urlretrieve(everyURL, localPDF, reporthook=reporthook)

The code of the reporthook callback function is as follows:

def reporthook(block_read,block_size,total_size):  if not block_read:  print "connection opened";  return  if total_size<0:  #unknown size  print "read %d blocks (%dbytes)" %(block_read,block_read*block_size);  else:  amount_read=block_read*block_size;  print 'Read %d blocks,or %d/%d' %(block_read,block_read*block_size,total_size);

To sum up, this is a simple small program for capturing data from the web page and downloading files, hoping to help those who are learning Python. Thank you!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Python] Python's urllib module and urllib2 module batch download files from webpages,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Python] Python's urllib module and urllib2 module batch download files from webpages,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support