Use Python to write a download source crawler try

Source: Internet
Author: User

Recently in the study of JDK source code, the Internet to find the next resource, found are not complete.

Later discovered a complete source of the place, mainly including the java,c,c++ of things, loaded with forced needs, just want to play. However, find a lot of ways to open the download, found all wrong. So I wrote the Python crawler and took care of him.

1. Analysis of Ideas

1.1. Target Address: http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/dddb1b026323/, open the first to see their own, whether to meet their own needs;

1.2. Analysis of the structure of the main two forms, one is the directory file, the second is the final document, the characteristics are obvious, can distinguish the final result;

1.3. The depth of the catalogue is uncertain, and it is natural to think of recursion;

1.4. Query the valid directory, it is natural to think of regular expressions;

1.5. Based on possible interruptions, you may need to download breakpoints, so consider adding a simple skip function;

1.6. Consider the possibility of a duplicate download of a file or directory, the cost of resources, so add a global file set, the deduplication process;

1.7. Since the directory of the document is very regular, it is directly followed by its directory structure;

1.8. Taking into account that the local environment may be unstable, so use the company test environment server;

1.9. Start!

2. Robust code to a wave of
#!/usr/bin/python#-*-coding:utf-8-*-ImportUrllib,urllib2ImportReImportOSImportHtmlparserdirbase='/ tmp'URLBase='http://hg.openjdk.java.net'URL= URLBase +'/jdk8u/jdk8u/jdk/file/dddb1b026323/src'#identification used to skip the directory to XX to achieve on-demand downloadSkip_to_p ="'#Secondary Skip IdentitySkip_find =False;textmod={'User':'Amei'}textmod=Urllib.urlencode (textmod)Print(URL) req= Urllib2. Request (url ='%s%s%s'% (URL,'?', Textmod)) Res=Urllib2.urlopen (req) Res=Res.read ()#globally searched collection of files, anti-re-entryAllflist = []#1. Find the Content form, 2. Find a valid address linkTable=re.findall (R'<tbody class= "Stripes2" > (. +) <\/tbody>', Res, re. S) Harr= Re.findall (r'href= "(/jdk8u[\w\/\._]+)" > (?! \[up\])', table[0])defdown_src_recursion (Harr):GlobalAllflist,skip_find; if( notHarr):returnFalse; I= 0; Arrlen =Len (Harr)Print("In new dir cur ...")  if(Len (allflist) > 1500):     Print('Over -exists, cut to ...') Allflist= allflist[-800:]   forALinkinchharr:i+ = 1; #remove the last directory symbol to prevent unwanted scanningALink = Alink.rstrip ('/')    if(skip_to_p and  notskip_find):if(ALink! =skip_to_p):Print('skip file, cause no find ..., skip=%s,now=%s'%(Skip_to_p, ALink))Continue; Else: Skip_find=True; if(ALinkinchallflist):Print('directory has been searched:'+ALink)Continue; PA= Dirbase +ALinkif(Os.path.isfile (PA)):Print('file already exists, no download required:'+PA)Continue; REQT= Urllib2. Request (URLBase +ALink) Rest=Urllib2.urlopen (REQT) Rest=Rest.read () allflist.append (ALink)if(Rest.find ('class= "Sourcefirst"') >0):Print('This is a resource file:%s%d/%d'%(ALink, I, Arrlen)) filename= Alink.split ('/') [-1] Linearr= Re.findall (r'<span id= ". +" > (. +) </span>', rest) FileObject= Open (Dirbase + ALink,'W')        forLineinchLinearr:Try:            #The file has a special HTML identifier that needs to be converted, but because of the presence of a nonstandard language, it may throw an exception and need to captureline =Htmlparser.htmlparser (). Unescape (line)exceptUnicodedecodeerror as E:Print('Oops, ASCII convert error Accour:', E) fileobject.write ( line+'\ r \ n') Fileobject.close ()Else:      Print('This is the directory:%s%d/%d'%(ALink, I, Arrlen))if( notos.path.exists (PA)):Print('Create directory:%s'%ALink) Os.makedirs ('/ tmp'+ ALink, mode=0777) Ta=re.findall (R'<tbody class= "Stripes2" > (. +) <\/tbody>', rest, re. S) Ha= Re.findall (r'href= "(/jdk8u[\w\/\._]+)" > (?! \[up\])', ta[0]) down_src_recursion (ha)#go ...Down_src_recursion (Harr);
3. Let the code run
Python jdk-crawler.py

4. How's the download going?
du -sh /tmp/jdk8u/

OK, above, it's finished. When the test environment is downloaded, it will be moved to your computer via FTP.

Use Python to write a download source crawler try

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.