Classify genomic data and write files, python,awk,r data.table speed PK

Last Update:2017-03-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Because the genome data is too large, want to further use R language processing worry about the system memory is not enough, so think of the file by chromosome splitting, found that python,awk,r language can be very simple and fast implementation, then the speed is not a gap, so before running a few 50G large files, Each script is tested with 244MB of data, and its speed is Compared.

The first is awk processing, and awk is doing line-by-row, with its own syntax, with a lot of flexibility, a line of code to solve, time 24S,

1#!/usr/bin/SH2 functionMain ()3 {4start_tm=Date5Start_h= ' $start _tm +%H '6Start_m= ' $start _tm +%M '7Start_s= ' $start _tm +%S '8 awk-f $sep'{print $ "," $ "," $ >> "' "$inputfile"'" " _ " $"'$inputfile9end_tm=DateTenEnd_h= ' $end _tm +%H ' oneEnd_m= ' $end _tm +%M ' aEnd_s= ' $end _tm +%S ' -Use_tm= 'Echo$end _h $start _h $end _m $start _m $end _s $start _s |awk '{print ($-$), "h", ($3-$4), "m", ($5-$6), "s"}'` - Echo "finished in"$use _tm the } -  -  - if[ $# ==2]; then +sep=$1 -inputfile=$2 + Main a Else at Echo "usage:SplitChr.sh Sep inputfile" - Echo "eg:SplitChr.sh, test.csv" - fi

Next is the use of Python,python language simple, easy to Write. As a result, the program was quickly implemented, and the same Line-by-row process was added to the details of awk, only the chromosomes Needed. Spents 19.9 SECONDS.

1 #!/usr/bin/python2 ImportSYS3 Import time4 defMain ():5     ifLen (sys.argv)!=3:6         Print "usage:splitchr Sep inputfile eg:splitchr ', ' Test.txt"7 Exit ()8Sep=sys.argv[1]9Filename=sys.argv[2]TenF=open (filename,'R') oneHeader=F.readline () a     ifLen (header.split (sep)) <2: -         Print "the Sep can ' t be recongnized!" - Exit () theChrlst=range (1,23) -Chrlst.extend (["X","Y"]) -chrlst=["CHR"+str (i) forIinchchrlst] -Outputdic={} +      forChriinchchrlst: -output=filename+"_"+Chri +Outputdic[chri]=open (output,'W') a Outputdic[chri].write (header) at      forEachlineinchF: -tmplst=eachline.strip (). Split (sep) -Tmpchr=tmplst[0] -         ifTmpchrinchchrlst: - Outputdic[tmpchr].write (eachline) -End=Time.clock () in     Print "read:%f S"% (end-start) -  to  +  - if __name__=='__main__': thestart=Time.clock () *Main ()

finally, with the R language data.table package for processing, Data.table is the advanced version of data.frame, the speed of a great improvement, but compared to awk and Python have advantages?

1 #!/usr/bin/rscript2 Library (data.table)3Main <-function (filename,sep) {4started.at <-proc.time ()5Arg <-Commandargs (T)6Sep <-arg[1]7Inputfile <-arg[2]8DT <-fread (filename,sep=sep,header=T)9Chrlst <-lapply (c (1:22,"X","Y"), function (x) Paste ("CHR", x,sep=""))Ten  for(chriinchChrlst) { oneOutputFile <-paste (filename,"_", chri,sep="") aFwrite (dt[. ( chri), on=. (chr)],file=outputfile,sep=Sep) - } -Cat ("finished in", Timetaken (started.at),"\ n") the } -  -Arg <-Commandargs (T) - if(length (arg) ==2){ +Sep <-arg[1] -FileName <-arg[2] + Main (filename,sep) a}Else{ atCat"usage:splitchr.r Sep inputfile eg:splitchr.r ' \\t ' test.csv","\ n") -}

Spents 10.6 seconds, found just finished reading the data, immediately processing and writing finished, processing and writing time is very short, so overall times shorter.

Summarize

Although they are all progressive, it is assumed by the above results that awk is running inside without Python fast, but Awk writes a line of code, writes fast, and Python is slower than data.table, guessing because R Data.table in C language, and the use of multi-threaded write, hash read, address various ways to optimize the speed of the Results. of course, the above results are for reference only.

Classify genomic data and write files, python,awk,r data.table speed PK

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Classify genomic data and write files, python,awk,r data.table speed PK

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Classify genomic data and write files, python,awk,r data.table speed PK

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support