Batch download data using Python

Source: Internet
Author: User

I wrote another webpage data file in Python and downloaded the information on the National Bureau of Statistics website at http://blog.csdn.net/liminlu0314/article/details/7300240.

This is still, but the download is the export Alian Bureau of Statistics data, are XLS table, URL: http://www.abs.gov.au. There is a tree directory on the left of the webpage, which records the Australian Administrative Region name, and there is a get data link next to it. However, this tree is dynamically loaded, the client cannot get all the content. I have never done network programming, and I don't know anyone like HTML, or Ajax. Stupid people have a stupid way, so they just open all the trees, then save the source code of the web page, and organize a TXT file. The content is similar to the following:

<span id="uidynatreeCb3trSLA1">New South Wales<a href="/ausstats/abs@nrp.nsf/lookup/1Main+Features12006-2010" onclick="openWindow('/ausstats/abs@nrp.nsf/lookup/1Main+Features12006-2010','1');"> - Get Data</a></span><br>

There are a total of 1665 rows in a similar format. We can see that the link to get data is/AUSSTATS/abs@nrp.nsf/lookup/1 main + Features12006-2010, then is to extract the link to get data from the above string, get the connection, use urllib to obtain the HTML content of the URL.

Then, search for the XLS from the obtained HTML and find out the XLS. Use the urlretrieve function to save it locally. The 72 lines of the Code are as follows:

#-*-Coding: gb2312-*-# Please contact my liminlu0314@gmail.comimport sysimport urllibimport re # parse def parserxlsurl (content): ipost = content from the read HTML. find ('% 2 EXlS &') If ipost =-1: Return ''strtemp = content [iPost-100: ipost + 300] IPOs = strtemp. find ('<a href = "') strtemp = strtemp [IPOs + 9:] IPOs = strtemp. find ('& latest "> <') strtemp = strtemp [: IPOs + 7] #/AUSSTATS/freenrp. NSF/log? Openagent & region % 5F1% 2 EXlS & 1 & 2006% 2d2010% 20 national % 20 regional % 20 Profile & Region & 0 & 2006% 2d2010 & 04% 2e11% 2e2011 & latest return 'HTTP: // www.abs.gov. AU '+ strtemp # parse urldef save2xls (lines) from the read string: IPOs = lines. find ('>') strtmp = lines [IPOs + 1:] IPOs = strtmp. find ('<') strname = strtmp [: IPOs] # retrieve name IPOs = strtmp. find ('"') strtmp = strtmp [IPOs + 1:] IPOs = strtmp. find ('"') strtmp = Strtmp [: IPOs] #/AUSSTATS/abs@nrp.nsf/lookup/1 main + Features12006-2010 IPOs = strtmp. find ('lookup/') strtmp = strtmp [IPOs + 7:] IPOs = strtmp. find ('main + ') stridcode = strtmp [: IPOs] strstart = 'HTTP: // www.abs.gov. au/AUSSTATS/abs@nrp.nsf/detailspage/'strmid = '2017-2006? OpenDocument & tabname = details & prodno = 'strend = '& issue = 2006-2010 & num = & view = & 'strurl = strstart + stridcode + strmid + stridcode + strend html = urllib. urlopen (strurl) # Open the connection content = html. read () # obtain the page content strurlxls = parserxlsurl (content) If strurlxls = '': Return 0 strxlsdir = '. /XLS/'strxls = strxlsdir + strname + '_' + stridcode + '.xls 'urllib. urlretrieve (strurlxls, strxls) # Start to download return 1if _ name _ = "_ main _": F = open('geturl.htm', 'R') alllines = f. readlines () F. close () Index = 0 for eachline in alllines: save2xls (eachline) Index = index + 1 print "current is % d of 1665, precent % d %" % (index, (INT) (index/1665.0*100 ))

After writing the program, I found that I did not know where to download it, and added a progress information. I like python. I can't imagine using C ++ for the same code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.