[Python] Python's urllib module and urllib2 module batch download files from webpages,

Source: Internet
Author: User

[Python] Python's urllib module and urllib2 module batch download files from webpages,

Some PDF files need to be downloaded from a certain webpage, but there are hundreds of PDF files to be downloaded, so it is impossible to download them manually. Python has related modules, so I wrote a program to download PDF files. By the way, I am familiar with Python's urllib module and ulrllib2 module.

1. Problem Description

You need to download several hundred documents from http://www.cvpapers.com/cvpr2014.html. the webpage is shown as follows:

2. Problem Solving

You can use the Python urllib module and urllib2 module for automatic download. The Code is as follows:

Test. py

#! /Usr/bin/python #-*-coding: UTF-8-*-import urllib # import urllib module import urllib2 # import urllib2 module import re # import Regular Expression module: re module def getPDFFromNet (inputURL): req = urllib2.Request (inputURL) f = urllib2.urlopen (req) # Open the webpage localDir = 'e: \ downloadPDF \ '# list of URLs used to download extracted PDF files stored in the local folder urlList = [] # for storing extracted PDF files in f: # traverse each line of the webpage line = eachLine. strip () # removes spaces at the beginning of a row and habitually writes if re. match ('. * PDF. * ', line): # match rows containing "PDF" strings. Only these rows have PDF wordList = line. split ('\ "') # Separate the Line Based on". in this way, the url address is separated separately for word in wordList: # traverse each string if re. match ('. *\. pdf $ ', word): extract to match a string containing unique characters. urlList is available only in URLs. append (word) # Save the extracted url to the list for everyURL in urlList: # traverse each item in the list, that is, the url wordItems = everyURL of each PDF. split ('/') # divide the url in the/field. To extract the PDF file name for item in wordItems: # traverse each string if re. match ('. *\. pdf $ ', item): # Find the PDF file name named alias name = item # Find the PDF file name localPDF = localDir + example name # connect the local storage directory and the PDF file name to be extracted try: urllib. urlretrieve (everyURL, localPDF) # download the file according to the url and store it to the local directory using its file name. Example t Exception, e: continue getPDFFromNet ('HTTP: // www.cvpapers.com/cvpr2014.html ')

Note:

(1) lines 1st, 6, 8, and 23 respectively thanked a "\" for escape;

(2) The urlretrieve function of row 27th has three parameters: the first parameter is the target url, and the second parameter is the absolute path (including the file name) of the saved file ), the return value of this function is a tuple (filename, header), where filename is the second parameter filename. If urlretrieve only provides one parameter, the returned filename is the generated temporary file name. After the function is executed, the temporary file will be deleted. The first parameter is a callback function. This callback is triggered when the server is connected and the corresponding data block is transferred. The callback function name can be any, but the parameter must be three. Generally, you can directly use reporthook (block_read, block_size, total_size) to define the callback function. block_size is the size of the data block read each time, and block_read is the number of data blocks read each time, taotal_size is the total data volume read in bytes. You can use the reporthook function to display the read progress.
If you want to display the read progress, you can add the third parameter and change row 27th of the above program to the following:

urllib.urlretrieve(everyURL, localPDF, reporthook=reporthook) 

The code of the reporthook callback function is as follows:

def reporthook(block_read,block_size,total_size):  if not block_read:  print "connection opened";  return  if total_size<0:  #unknown size  print "read %d blocks (%dbytes)" %(block_read,block_read*block_size);  else:  amount_read=block_read*block_size;  print 'Read %d blocks,or %d/%d' %(block_read,block_read*block_size,total_size); 

To sum up, this is a simple small program for capturing data from the web page and downloading files, hoping to help those who are learning Python. Thank you!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.