This article describes how to use the urllib2 module to capture HTML page resources in Python. The page addresses to be crawled are written in a separate rule list for easy organization and reuse, for more information, see list the network addresses to be crawled in a separate list file.
http://www.jb51.net/article/83440.htmlhttp://www.jb51.net/article/83437.htmlhttp://www.jb51.net/article/83430.htmlhttp://www.jb51.net/article/83449.html
Then let's look at the program operation. The Code is as follows:
#!/usr/bin/pythonimport osimport sysimport urllib2import redef Cdown_data(fileurl, fpath, dpath): if not os.path.exists(dpath): os.makedirs(dpath) try: getfile = urllib2.urlopen(fileurl) data = getfile.read() f = open(fpath, 'w') f.write(data) f.close() except: print with open('u1.list') as lines: for line in lines: URI = line.strip() if '?' and '%' in URI: continue elif URI.count('/') == 2: continue elif URI.count('/') > 2: #print URI,URI.count('/') try: dirpath = URI.rpartition('/')[0].split('//')[1] #filepath = URI.split('//')[1].split('/')[1] filepath = URI.split('//')[1] if filepath: print URI,filepath,dirpath Cdown_data(URI, filepath, dirpath) except: print URI,'error'
Original Site: http://www.diyoms.com/python/1806.html