Classify genomic data and write files, python,awk,r data.table speed PK

Source: Internet
Author: User

Because the genome data is too large, want to further use R language processing worry about the system memory is not enough, so think of the file by chromosome splitting, found that python,awk,r language can be very simple and fast implementation, then the speed is not a gap, so before running a few 50G large files, Each script is tested with 244MB of data, and its speed is Compared.

The first is awk processing, and awk is doing line-by-row, with its own syntax, with a lot of flexibility, a line of code to solve, time 24S,

1#!/usr/bin/SH2 functionMain ()3 {4start_tm=Date5Start_h= ' $start _tm +%H '6Start_m= ' $start _tm +%M '7Start_s= ' $start _tm +%S '8 awk-f $sep'{print $ "," $ "," $ >> "' "$inputfile"'" " _ " $"'$inputfile9end_tm=DateTenEnd_h= ' $end _tm +%H ' oneEnd_m= ' $end _tm +%M ' aEnd_s= ' $end _tm +%S ' -Use_tm= 'Echo$end _h $start _h $end _m $start _m $end _s $start _s |awk '{print ($-$), "h", ($3-$4), "m", ($5-$6), "s"}'` - Echo "finished in"$use _tm the } -  -  - if[ $# ==2]; then +sep=$1 -inputfile=$2 + Main a Else at Echo "usage:SplitChr.sh Sep inputfile" - Echo "eg:SplitChr.sh, test.csv" - fi

Next is the use of Python,python language simple, easy to Write. As a result, the program was quickly implemented, and the same Line-by-row process was added to the details of awk, only the chromosomes Needed. Spents 19.9 SECONDS.

1 #!/usr/bin/python2 ImportSYS3 Import time4 defMain ():5     ifLen (sys.argv)!=3:6         Print "usage:splitchr Sep inputfile eg:splitchr ', ' Test.txt"7 Exit ()8Sep=sys.argv[1]9Filename=sys.argv[2]TenF=open (filename,'R') oneHeader=F.readline () a     ifLen (header.split (sep)) <2: -         Print "the Sep can ' t be recongnized!" - Exit () theChrlst=range (1,23) -Chrlst.extend (["X","Y"]) -chrlst=["CHR"+str (i) forIinchchrlst] -Outputdic={} +      forChriinchchrlst: -output=filename+"_"+Chri +Outputdic[chri]=open (output,'W') a Outputdic[chri].write (header) at      forEachlineinchF: -tmplst=eachline.strip (). Split (sep) -Tmpchr=tmplst[0] -         ifTmpchrinchchrlst: - Outputdic[tmpchr].write (eachline) -End=Time.clock () in     Print "read:%f S"% (end-start) -  to  +  - if __name__=='__main__': thestart=Time.clock () *Main ()

finally, with the R language data.table package for processing, Data.table is the advanced version of data.frame, the speed of a great improvement, but compared to awk and Python have advantages?

1 #!/usr/bin/rscript2 Library (data.table)3Main <-function (filename,sep) {4started.at <-proc.time ()5Arg <-Commandargs (T)6Sep <-arg[1]7Inputfile <-arg[2]8DT <-fread (filename,sep=sep,header=T)9Chrlst <-lapply (c (1:22,"X","Y"), function (x) Paste ("CHR", x,sep=""))Ten  for(chriinchChrlst) { oneOutputFile <-paste (filename,"_", chri,sep="") aFwrite (dt[. ( chri), on=. (chr)],file=outputfile,sep=Sep) - } -Cat ("finished in", Timetaken (started.at),"\ n") the } -  -Arg <-Commandargs (T) - if(length (arg) ==2){ +Sep <-arg[1] -FileName <-arg[2] + Main (filename,sep) a}Else{ atCat"usage:splitchr.r Sep inputfile eg:splitchr.r ' \\t ' test.csv","\ n") -}

Spents 10.6 seconds, found just finished reading the data, immediately processing and writing finished, processing and writing time is very short, so overall times shorter.

Summarize

Although they are all progressive, it is assumed by the above results that awk is running inside without Python fast, but Awk writes a line of code, writes fast, and Python is slower than data.table, guessing because R Data.table in C language, and the use of multi-threaded write, hash read, address various ways to optimize the speed of the Results. of course, the above results are for reference only.

Classify genomic data and write files, python,awk,r data.table speed PK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.