The CDS sequence is extracted from the GFF file and converted to the amino acid sequence.

Last Update:2017-11-27 Source: Internet

Author: User

Tags faa

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The CDS sequence is extracted from the GFF file and converted to the amino acid sequence.

Recently on the principles of bioinformatics, I plan to record assignments in some classes. First assignment: as a question.

Basic Ideas:

1. Read the initial termination position and positive and negative chain information of CDS from GFF. GFF format see http://blog.sina.com.cn/s/blog_8a4f556e0102yd3l.html.

2. Extract the CDS sequence from the FNA file using information such as the start/end position. FNA format see http://boyun.sh.cn/bio? P = 1192.

3. Use the CDS sequence and codon table to obtain and output the FAA file.

Note: When the CDS in GFF is in the negative chain, base complementary pairing is required, that is, reverse complementing (5 'to 3' with 3 'to 5 ').

The python code is provided below. Python3.6

The following problems need to be solved: the comments of the obtained FNA file and the FAA file are incomplete, and the command line parameters are not provided, and the job is handed in and supplemented.

1 # bioinformatics homework 2 import re 3 class CDS2AA (): 4 Pa = re. compile (r's \ s + ') 5 Pa = re. compile (R' [TCAG] TG ') # bacterial start code NTG 6 def _ init _ (self, fna, gff): 7 self. fna = fna 8 self. gff = gff 9 def N2M (self, sequence): 10 hash = {'A': 'T', 't': 'A', 'C': 'G ', 'G': 'C'} 11 sequence = ''. join ([hash [I] for I in sequence]) # convert Positive and Negative chains to 12 return sequence [:-1] 13 def Get_CDS_index (self, line): 14 line = self. pa. spli T (line) 15 CDS = (line [0], line [3], line [4], line [6]) 16 return CDS17 def Seq2AA (self, sequence, hash): 18 AA = hash [sequence [: 3] 19 if self. pa. match (sequence [: 3]): 20 AA = 'M' # Start code 21 for I in range (3, len (sequence)-3, 3 ): 22 AA + = hash [sequence [I: I + 3] 23 return AA24 def CDS2AA (self): 25 fn = open (self. fna, 'R') 26 gf = open (self. gff, 'R') 27 r = open('AA_sequence.txt ', 'w') 28 w = open('CDS.txt', 'w') 2 9 hash_AA ={}# AA hash, sequence2AA30 with open('AA.txt ', 'R') as f: 31 for line in f: 32 line = line. strip () 33 if line: 34 line = self. pa. split (line) 35 hash_AA [line [0] = line [1] # AA hash36 hash_CDS ={}# CDS hash, CDS2sequence37 for line in fn: 38 line = line. strip () 39 if line. startswith ('>'): 40 A = self. pa. split (line) [0]. replace ('>', '') 41 hash_CDS [A] ='' 42 else: 43 hash_CDS [A] + = line44 fn. close () 45 for line in gf: 46 line = line. strip () 47 if 'cds' in line: 48 I = self. get_CDS_index (line) 49 sequence = hash_CDS [I [0] [int (I [1])-1: int (I [2])] 50 if I [3] = '-': 51 sequence = self. n2M (sequence) 52 # w. write (I [0] + '\ n' + sequence +' \ n') 53 s = self. pa. split (line) 54 p = re. compile (r'id = (. *?);. *? Dbxref = (.*?);. *? Name = (.*?);. *? Gbkey = (.*?);. *? Product = (.*?);. *? Protein_id = (.*?); ') 55 m = re. findall (p, line) 56 s = s [0] + '_' + m [0] [0] + m [0] [2] + '\ tdbxref =' + m [0] [1] + '\ tprotein =' + m [0] [4] + '\ tprotein_id =' + m [0] [5] + '\ tgbkey =' + m [0] [3] 57 w. write (s + '\ n' + sequence +' \ n') 58 AA = self. seq2AA (sequence, hash_AA) 59 r. write (I [0] + '\ n' + AA +' \ n') 60 w. close () 61 r. close () 62 63 if _ name __= = '_ main __': 64 fna = 'gca _ 000160075.2 _ ASM16007v2_genomic.fna '65 gff = 'gca _ 000160075.2 _ ASM16007v2_genomic.gff '66 m = CDS2AA (fna, gff) 67 m. CDS2AA ()

I will solve some other problems before submitting my homework. Interesting assignment questions will be uploaded one after another.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More