2015 "Shenzhen Cup" mathematical Modeling summer Camp B: K-mer index problem of DNA sequence

Source: Internet
Author: User

This is a mountain hkust classmate gave me a question, ask me the idea, for mathematical modeling, I do not have much understanding, so can only use computer program method to answer.

This is the specific question:

This problem comes from the K-mer index problem of DNA sequences.

given a DNA sequence, this series contains only 4 letter ATCG, such asS ="Ctgtactgtat". Given an integer valuek, fromSstart at the first position, take a continuouska short string of letters, calledk -mer (e.g.k= 5, the short string is Ctgta), and then from theSthe second position, take anotherk -mer (e.g.k= 5, then this short string is TGTAC), so that untilSend, you get a set that contains allk -Mer. For the sequence s, all 5-mer are

{ Ctgta , Tgtac , gtact , TACTG , actgt , Ctgta, Tgta T}

typically these k-mer require a data indexing method that can be accessed quickly by subsequent operations. for example, for 5-mer, when querying ctgta, this method of data indexing can return its position in the DNA sequence s as {1,6}.

Problem

The 1 million DNA sequence is now given as a file, sequence number 1-1000000, and each gene sequence length is 100.

(1) require a given k, give and implement a data indexing method, you can return any one k-mer the DNA sequence number and the corresponding sequence of the occurrence of the position. Each time an index is indexed, only one k value is supported, and all k values need not be supported.

(2) Once the index is established, the query speed is as fast as possible, and the memory used is as small as possible.

(3) The computational complexity and spatial complexity of the index are given.

(4) The computational complexity and the spatial complexity of the index query are given.

(5) Assuming that the memory limit is 8G, the maximum k value and the corresponding data query efficiency which can be supported by the Designed index method are analyzed.

(6) According to the importance from high to low arrangement, according to the following points, to evaluate the index method performance

·  Index Query Speed

·  Index Memory Usage

·  The range of k values that can be supported under 8G memory

·  set up indexing time

To make an index, you must know the location of the sub-sequence, requiring the child sequence number.

In fact, there are two more than 70 m of documents, ending with a. FA,

After Baidu checked a bit, FA file is a data file. The full name is: FASTA formatted Sequence File.

It belongs to the data file format in bioinformatics and is a format based on text used to represent nucleotide sequences or amino acid sequences. In this format, base pairs or amino acids are encoded with a single letter, and sequence names and annotations are allowed to be added before the sequence.

Try to open it with a notepad, and the result is jammed directly.

So, these should be the data that needs to be read. We read in binary mode.

70 m, 1 million 100-bit base sequence, requires indexing in 8G memory.

Since it is an index, it is really time-consuming to find a given subsequence in a string, so I have established a rule that:

Replace A to 1;

Replace T with 2;

Replace C to 3;

Replace G to 4.

We can define the code like this:

This is currently just a numeric value conversion code: #include <iostream> #include <cstdio> #include <cstring>using namespace Std;int Getnum (int,int); char s[100];int a[100];//The array that holds the numeric value int Numall=0;int main () {    int len,k;//len: Find out how many sequences k:k a string    int i,j;    Gets (s);    cin>>k;//input integer value K    len=strlen (s);//Gets the length of the sequence for    (i=0,j=0; i< (len-k+1); i++,j++)    {        a[i]= Getnum (i,k);        cout<<a[i]<<endl;    }} int getnum (int n,int k)//Gets the numeric value of a single sequence and saves the result in the a array {    int numb,l,m=0;    L=n;    numall=0;    for (; n<k+l; n++,m++)    {        numb=1;        for (int i=0; i<k-m-1; i++)            numb=numb*10;        if (s[n]== ' A ')            numb*=1;        else if (s[n]== ' T ')            numb*=2;        else if (s[n]== ' C ')            numb*=3;        else if (s[n]== ' G ')            numb*=4;        Numall+=numb;    }    return numall;}

We will enter the base sequence in the example:



In this way, it is easier to find the numbers in the code.

The other is the establishment of the index, since the search, we first think of a place from the beginning of a comparison, in line with, found.

However, with memory limitations, 1 million sequences can be difficult for computers.

According to our thoughts, sort first.

Bubble? Choose? Cask sort? That's better? In fact, I also in melancholy, in fact, the first thought or bubble sort, so, we can choose bubble sort.

Remember that the topic requires a specific location, so we should use a different array and the original array to do the corresponding, that is, numbering.

The order is complete, the numbered array cannot be changed, and the rule of the original array is small to large.

Given a subsequence, we use the dichotomy method to obtain a specific position, and then find the number, thus obtaining its position in the original array.

Finally, the sequence is translated according to the rules.

All the flow, probably like this:


Purely personal opinion, if there is a mistake in the hope of correction.


@ Mayuko


2015 "Shenzhen Cup" mathematical Modeling summer Camp B: K-mer index problem of DNA sequence

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.