Prior to write a program in C, find whether the reads contains adaptor, if detected to filter out the reads containing adaptor, this time after filtering data found that the connector sequence is more, in order to improve the assembly effect, can not greatly affect the amount of data, Need to truncate the connector, and filter the short reads, with Python wrote a short program, specify more than 3 mismatch within the match is considered to match, and the length of less than 50bp reads filter, in the following program based on the addition of incoming parameters, can be applied to more cases (single-ended, Double-ended, containing single, etc.):
1 ImportSYS2 ImportRe3 fromBioImportSeqio4 5 defRmpe (read1,read2,adaptor1,adaptor2,min_length):6Res_1 =RmSE (read1,adaptor1,min_length)7Res_2 =RmSE (read2,adaptor2,min_length)8 ifRes_1 andres_2:9 returnres_1,res_2Ten Else: One returnFalse A - defRmSE (read,adaptor,min_length): -Seq =Read.seq theSeed_len = 6 -A_len =len (adaptor) -Seq_len =len (seq) - forIinchRange (A_len-Seed_len): +Seed = adaptor[i:i+Seed_len] -pos =0 + while(Pos <Seq_len): AFind_pos =Seq.find (Seed,pos) at ifFind_pos >0: -Mistaken_count =0 -_b =Find_pos -_e = Find_pos +Seed_len - while(_b >= 0 andI >= Find_pos-_b): - ifAdaptor[i-find_pos + _b]! =Seq[_b]: inMistaken_count + = 1 - ifMistaken_count > 3: to Break +_b-= 1 - Else : the while(_e < Seq_len andI-find_pos + _e <A_len): * ifadaptor[I-find_pos + _e]! =Seq[_e]: $Mistaken_count + = 1Panax Notoginseng ifMistaken_count > 3: - Break the_e + = 1 + Else: A ifFind_pos-i >min_length: the returnread[:find_pos-i] + Else : - returnFalse $pos = Find_pos + 1 $ Else: - Break - returnRead the - defRmadaptor (argv):Wuyi argv.pop (0) theRead1_file,read2_file,reads_file,adaptor1,adaptor2,out_prefix,min_length =argv -Reads_records = Seqio.parse (open (Reads_file),'FASTQ') WuRead2_records = Seqio.parse (open (Read2_file),'FASTQ') -Read1_out = open ('%S.1.FQ'%out_prefix,'W' ) AboutRead2_out = open ('%S.2.FQ'%out_prefix,'W' ) $Reads_out = open ('%S.SINGLE.FQ'%out_prefix,'W' ) - forRead1inchSeqio.parse (Open (Read1_file),'FASTQ'): -Read2 =Read2_records.next () -Reads =Reads_records.next () ARmpe_res =Rmpe (read1,read2,adaptor1,adaptor2,min_length) + ifRmpe_res: theRead1_out.write (Rmpe_res[0].format ('FASTQ')) -Read2_out.write (Rmpe_res[1].format ('FASTQ')) $Rmse_res =False the ifRe.search ('[\s\/] (\d)', reads.id) = ='1': theRmse_res =RmSE (reads,adaptor1,min_length) the elifRe.search ('[\s\/] (\d)', reads.id) = ='2': theRmse_res =RmSE (reads,adaptor2,min_length) - ifRmse_res: inReads_out.write (Rmse_res.format ('FASTQ')) the the if __name__=='__main__': AboutRmadaptor (SYS.ARGV)
Removal of connectors in sequencing reads: adaptor