I wrote another webpage data file in Python and downloaded the information on the National Bureau of Statistics website at http://blog.csdn.net/liminlu0314/article/details/7300240.
This is still, but the download is the export Alian Bureau of Statistics data, are XLS table, URL: http://www.abs.gov.au. There is a tree directory on the left of the webpage, which records the Australian Administrative Region name, and there is a get data link next to it. However, this tree is dynamically loaded, the client cannot get all the content. I have never done network programming, and I don't know anyone like HTML, or Ajax. Stupid people have a stupid way, so they just open all the trees, then save the source code of the web page, and organize a TXT file. The content is similar to the following:
<span id="uidynatreeCb3trSLA1">New South Wales<a href="/ausstats/abs@nrp.nsf/lookup/1Main+Features12006-2010" onclick="openWindow('/ausstats/abs@nrp.nsf/lookup/1Main+Features12006-2010','1');"> - Get Data</a></span><br>
There are a total of 1665 rows in a similar format. We can see that the link to get data is/AUSSTATS/abs@nrp.nsf/lookup/1 main + Features12006-2010, then is to extract the link to get data from the above string, get the connection, use urllib to obtain the HTML content of the URL.
Then, search for the XLS from the obtained HTML and find out the XLS. Use the urlretrieve function to save it locally. The 72 lines of the Code are as follows:
#-*-Coding: gb2312-*-# Please contact my liminlu0314@gmail.comimport sysimport urllibimport re # parse def parserxlsurl (content): ipost = content from the read HTML. find ('% 2 EXlS &') If ipost =-1: Return ''strtemp = content [iPost-100: ipost + 300] IPOs = strtemp. find ('<a href = "') strtemp = strtemp [IPOs + 9:] IPOs = strtemp. find ('& latest "> <') strtemp = strtemp [: IPOs + 7] #/AUSSTATS/freenrp. NSF/log? Openagent & region % 5F1% 2 EXlS & 1 & 2006% 2d2010% 20 national % 20 regional % 20 Profile & Region & 0 & 2006% 2d2010 & 04% 2e11% 2e2011 & latest return 'HTTP: // www.abs.gov. AU '+ strtemp # parse urldef save2xls (lines) from the read string: IPOs = lines. find ('>') strtmp = lines [IPOs + 1:] IPOs = strtmp. find ('<') strname = strtmp [: IPOs] # retrieve name IPOs = strtmp. find ('"') strtmp = strtmp [IPOs + 1:] IPOs = strtmp. find ('"') strtmp = Strtmp [: IPOs] #/AUSSTATS/abs@nrp.nsf/lookup/1 main + Features12006-2010 IPOs = strtmp. find ('lookup/') strtmp = strtmp [IPOs + 7:] IPOs = strtmp. find ('main + ') stridcode = strtmp [: IPOs] strstart = 'HTTP: // www.abs.gov. au/AUSSTATS/abs@nrp.nsf/detailspage/'strmid = '2017-2006? OpenDocument & tabname = details & prodno = 'strend = '& issue = 2006-2010 & num = & view = & 'strurl = strstart + stridcode + strmid + stridcode + strend html = urllib. urlopen (strurl) # Open the connection content = html. read () # obtain the page content strurlxls = parserxlsurl (content) If strurlxls = '': Return 0 strxlsdir = '. /XLS/'strxls = strxlsdir + strname + '_' + stridcode + '.xls 'urllib. urlretrieve (strurlxls, strxls) # Start to download return 1if _ name _ = "_ main _": F = open('geturl.htm', 'R') alllines = f. readlines () F. close () Index = 0 for eachline in alllines: save2xls (eachline) Index = index + 1 print "current is % d of 1665, precent % d %" % (index, (INT) (index/1665.0*100 ))
After writing the program, I found that I did not know where to download it, and added a progress information. I like python. I can't imagine using C ++ for the same code.