Recently in the study of JDK source code, the Internet to find the next resource, found are not complete.
Later discovered a complete source of the place, mainly including the java,c,c++ of things, loaded with forced needs, just want to play. However, find a lot of ways to open the download, found all wrong. So I wrote the Python crawler and took care of him.
1. Analysis of Ideas
1.1. Target Address: http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/dddb1b026323/, open the first to see their own, whether to meet their own needs;
1.2. Analysis of the structure of the main two forms, one is the directory file, the second is the final document, the characteristics are obvious, can distinguish the final result;
1.3. The depth of the catalogue is uncertain, and it is natural to think of recursion;
1.4. Query the valid directory, it is natural to think of regular expressions;
1.5. Based on possible interruptions, you may need to download breakpoints, so consider adding a simple skip function;
1.6. Consider the possibility of a duplicate download of a file or directory, the cost of resources, so add a global file set, the deduplication process;
1.7. Since the directory of the document is very regular, it is directly followed by its directory structure;
1.8. Taking into account that the local environment may be unstable, so use the company test environment server;
1.9. Start!
2. Robust code to a wave of
#!/usr/bin/python#-*-coding:utf-8-*-ImportUrllib,urllib2ImportReImportOSImportHtmlparserdirbase='/ tmp'URLBase='http://hg.openjdk.java.net'URL= URLBase +'/jdk8u/jdk8u/jdk/file/dddb1b026323/src'#identification used to skip the directory to XX to achieve on-demand downloadSkip_to_p ="'#Secondary Skip IdentitySkip_find =False;textmod={'User':'Amei'}textmod=Urllib.urlencode (textmod)Print(URL) req= Urllib2. Request (url ='%s%s%s'% (URL,'?', Textmod)) Res=Urllib2.urlopen (req) Res=Res.read ()#globally searched collection of files, anti-re-entryAllflist = []#1. Find the Content form, 2. Find a valid address linkTable=re.findall (R'<tbody class= "Stripes2" > (. +) <\/tbody>', Res, re. S) Harr= Re.findall (r'href= "(/jdk8u[\w\/\._]+)" > (?! \[up\])', table[0])defdown_src_recursion (Harr):GlobalAllflist,skip_find; if( notHarr):returnFalse; I= 0; Arrlen =Len (Harr)Print("In new dir cur ...") if(Len (allflist) > 1500): Print('Over -exists, cut to ...') Allflist= allflist[-800:] forALinkinchharr:i+ = 1; #remove the last directory symbol to prevent unwanted scanningALink = Alink.rstrip ('/') if(skip_to_p and notskip_find):if(ALink! =skip_to_p):Print('skip file, cause no find ..., skip=%s,now=%s'%(Skip_to_p, ALink))Continue; Else: Skip_find=True; if(ALinkinchallflist):Print('directory has been searched:'+ALink)Continue; PA= Dirbase +ALinkif(Os.path.isfile (PA)):Print('file already exists, no download required:'+PA)Continue; REQT= Urllib2. Request (URLBase +ALink) Rest=Urllib2.urlopen (REQT) Rest=Rest.read () allflist.append (ALink)if(Rest.find ('class= "Sourcefirst"') >0):Print('This is a resource file:%s%d/%d'%(ALink, I, Arrlen)) filename= Alink.split ('/') [-1] Linearr= Re.findall (r'<span id= ". +" > (. +) </span>', rest) FileObject= Open (Dirbase + ALink,'W') forLineinchLinearr:Try: #The file has a special HTML identifier that needs to be converted, but because of the presence of a nonstandard language, it may throw an exception and need to captureline =Htmlparser.htmlparser (). Unescape (line)exceptUnicodedecodeerror as E:Print('Oops, ASCII convert error Accour:', E) fileobject.write ( line+'\ r \ n') Fileobject.close ()Else: Print('This is the directory:%s%d/%d'%(ALink, I, Arrlen))if( notos.path.exists (PA)):Print('Create directory:%s'%ALink) Os.makedirs ('/ tmp'+ ALink, mode=0777) Ta=re.findall (R'<tbody class= "Stripes2" > (. +) <\/tbody>', rest, re. S) Ha= Re.findall (r'href= "(/jdk8u[\w\/\._]+)" > (?! \[up\])', ta[0]) down_src_recursion (ha)#go ...Down_src_recursion (Harr);
3. Let the code run
Python jdk-crawler.py
4. How's the download going?
du -sh /tmp/jdk8u/
OK, above, it's finished. When the test environment is downloaded, it will be moved to your computer via FTP.
Use Python to write a download source crawler try