The CDS sequence is extracted from the GFF file and converted to the amino acid sequence.
Recently on the principles of bioinformatics, I plan to record assignments in some classes. First assignment: as a question.
Basic Ideas:
1. Read the initial termination position and positive and negative chain information of CDS from GFF. GFF format see http://blog.sina.com.cn/s/blog_8a4f556e0102yd3l.html.
2. Extract the CDS sequence from the FNA file using information such as the start/end position. FNA format see http://boyun.sh.cn/bio? P = 1192.
3. Use the CDS sequence and codon table to obtain and output the FAA file.
Note: When the CDS in GFF is in the negative chain, base complementary pairing is required, that is, reverse complementing (5 'to 3' with 3 'to 5 ').
The python code is provided below. Python3.6
The following problems need to be solved: the comments of the obtained FNA file and the FAA file are incomplete, and the command line parameters are not provided, and the job is handed in and supplemented.
1 # bioinformatics homework 2 import re 3 class CDS2AA (): 4 Pa = re. compile (r's \ s + ') 5 Pa = re. compile (R' [TCAG] TG ') # bacterial start code NTG 6 def _ init _ (self, fna, gff): 7 self. fna = fna 8 self. gff = gff 9 def N2M (self, sequence): 10 hash = {'A': 'T', 't': 'A', 'C': 'G ', 'G': 'C'} 11 sequence = ''. join ([hash [I] for I in sequence]) # convert Positive and Negative chains to 12 return sequence [:-1] 13 def Get_CDS_index (self, line): 14 line = self. pa. spli T (line) 15 CDS = (line [0], line [3], line [4], line [6]) 16 return CDS17 def Seq2AA (self, sequence, hash): 18 AA = hash [sequence [: 3] 19 if self. pa. match (sequence [: 3]): 20 AA = 'M' # Start code 21 for I in range (3, len (sequence)-3, 3 ): 22 AA + = hash [sequence [I: I + 3] 23 return AA24 def CDS2AA (self): 25 fn = open (self. fna, 'R') 26 gf = open (self. gff, 'R') 27 r = open('AA_sequence.txt ', 'w') 28 w = open('CDS.txt', 'w') 2 9 hash_AA ={}# AA hash, sequence2AA30 with open('AA.txt ', 'R') as f: 31 for line in f: 32 line = line. strip () 33 if line: 34 line = self. pa. split (line) 35 hash_AA [line [0] = line [1] # AA hash36 hash_CDS ={}# CDS hash, CDS2sequence37 for line in fn: 38 line = line. strip () 39 if line. startswith ('>'): 40 A = self. pa. split (line) [0]. replace ('>', '') 41 hash_CDS [A] ='' 42 else: 43 hash_CDS [A] + = line44 fn. close () 45 for line in gf: 46 line = line. strip () 47 if 'cds' in line: 48 I = self. get_CDS_index (line) 49 sequence = hash_CDS [I [0] [int (I [1])-1: int (I [2])] 50 if I [3] = '-': 51 sequence = self. n2M (sequence) 52 # w. write (I [0] + '\ n' + sequence +' \ n') 53 s = self. pa. split (line) 54 p = re. compile (r'id = (. *?);. *? Dbxref = (.*?);. *? Name = (.*?);. *? Gbkey = (.*?);. *? Product = (.*?);. *? Protein_id = (.*?); ') 55 m = re. findall (p, line) 56 s = s [0] + '_' + m [0] [0] + m [0] [2] + '\ tdbxref =' + m [0] [1] + '\ tprotein =' + m [0] [4] + '\ tprotein_id =' + m [0] [5] + '\ tgbkey =' + m [0] [3] 57 w. write (s + '\ n' + sequence +' \ n') 58 AA = self. seq2AA (sequence, hash_AA) 59 r. write (I [0] + '\ n' + AA +' \ n') 60 w. close () 61 r. close () 62 63 if _ name __= = '_ main __': 64 fna = 'gca _ 000160075.2 _ ASM16007v2_genomic.fna '65 gff = 'gca _ 000160075.2 _ ASM16007v2_genomic.gff '66 m = CDS2AA (fna, gff) 67 m. CDS2AA ()
I will solve some other problems before submitting my homework. Interesting assignment questions will be uploaded one after another.