Use Python to write a download source crawler try

Last Update:2018-08-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in the study of JDK source code, the Internet to find the next resource, found are not complete.

Later discovered a complete source of the place, mainly including the java,c,c++ of things, loaded with forced needs, just want to play. However, find a lot of ways to open the download, found all wrong. So I wrote the Python crawler and took care of him.

1. Analysis of Ideas

1.1. Target Address: http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/dddb1b026323/, open the first to see their own, whether to meet their own needs;

1.2. Analysis of the structure of the main two forms, one is the directory file, the second is the final document, the characteristics are obvious, can distinguish the final result;

1.3. The depth of the catalogue is uncertain, and it is natural to think of recursion;

1.4. Query the valid directory, it is natural to think of regular expressions;

1.5. Based on possible interruptions, you may need to download breakpoints, so consider adding a simple skip function;

1.6. Consider the possibility of a duplicate download of a file or directory, the cost of resources, so add a global file set, the deduplication process;

1.7. Since the directory of the document is very regular, it is directly followed by its directory structure;

1.8. Taking into account that the local environment may be unstable, so use the company test environment server;

1.9. Start!

2. Robust code to a wave of

#!/usr/bin/python#-*-coding:utf-8-*-ImportUrllib,urllib2ImportReImportOSImportHtmlparserdirbase='/ tmp'URLBase='http://hg.openjdk.java.net'URL= URLBase +'/jdk8u/jdk8u/jdk/file/dddb1b026323/src'#identification used to skip the directory to XX to achieve on-demand downloadSkip_to_p ="'#Secondary Skip IdentitySkip_find =False;textmod={'User':'Amei'}textmod=Urllib.urlencode (textmod)Print(URL) req= Urllib2. Request (url ='%s%s%s'% (URL,'?', Textmod)) Res=Urllib2.urlopen (req) Res=Res.read ()#globally searched collection of files, anti-re-entryAllflist = []#1. Find the Content form, 2. Find a valid address linkTable=re.findall (R'<tbody class= "Stripes2" > (. +) <\/tbody>', Res, re. S) Harr= Re.findall (r'href= "(/jdk8u[\w\/\._]+)" > (?! \[up\])', table[0])defdown_src_recursion (Harr):GlobalAllflist,skip_find; if( notHarr):returnFalse; I= 0; Arrlen =Len (Harr)Print("In new dir cur ...")  if(Len (allflist) > 1500):     Print('Over -exists, cut to ...') Allflist= allflist[-800:]   forALinkinchharr:i+ = 1; #remove the last directory symbol to prevent unwanted scanningALink = Alink.rstrip ('/')    if(skip_to_p and  notskip_find):if(ALink! =skip_to_p):Print('skip file, cause no find ..., skip=%s,now=%s'%(Skip_to_p, ALink))Continue; Else: Skip_find=True; if(ALinkinchallflist):Print('directory has been searched:'+ALink)Continue; PA= Dirbase +ALinkif(Os.path.isfile (PA)):Print('file already exists, no download required:'+PA)Continue; REQT= Urllib2. Request (URLBase +ALink) Rest=Urllib2.urlopen (REQT) Rest=Rest.read () allflist.append (ALink)if(Rest.find ('class= "Sourcefirst"') >0):Print('This is a resource file:%s%d/%d'%(ALink, I, Arrlen)) filename= Alink.split ('/') [-1] Linearr= Re.findall (r'<span id= ". +" > (. +) </span>', rest) FileObject= Open (Dirbase + ALink,'W')        forLineinchLinearr:Try:            #The file has a special HTML identifier that needs to be converted, but because of the presence of a nonstandard language, it may throw an exception and need to captureline =Htmlparser.htmlparser (). Unescape (line)exceptUnicodedecodeerror as E:Print('Oops, ASCII convert error Accour:', E) fileobject.write ( line+'\ r \ n') Fileobject.close ()Else:      Print('This is the directory:%s%d/%d'%(ALink, I, Arrlen))if( notos.path.exists (PA)):Print('Create directory:%s'%ALink) Os.makedirs ('/ tmp'+ ALink, mode=0777) Ta=re.findall (R'<tbody class= "Stripes2" > (. +) <\/tbody>', rest, re. S) Ha= Re.findall (r'href= "(/jdk8u[\w\/\._]+)" > (?! \[up\])', ta[0]) down_src_recursion (ha)#go ...Down_src_recursion (Harr);

3. Let the code run

Python jdk-crawler.py

4. How's the download going?

du -sh /tmp/jdk8u/

OK, above, it's finished. When the test environment is downloaded, it will be moved to your computer via FTP.

Use Python to write a download source crawler try

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python to write a download source crawler try

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Python to write a download source crawler try

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support