Summer camp for "Shenzhen Cup" Mathematical Modeling in 2015-Question B: k-mer index of DNA sequence,-bk-mer

Source: Internet
Author: User

Summer camp for "Shenzhen Cup" Mathematical Modeling in 2015-Question B: k-mer index of DNA sequence,-bk-mer

This is a question given to me by a student from Shandong University of Science and Technology. I don't know much about mathematical modeling, so I can only use computer programs to solve it.

This is a specific problem:

This problem comes from the k-mer index problem of the DNA sequence.

Given a DNA sequence, this series contains only four letters ATCG, such as S = "CTGTACTGTAT ". Given an integer k, starting from the first position of S, take a short string of k consecutive letters, called k-mer (such as k = 5, the short string is CTGTA), and then take another k-mer from the second position of S (for example, k = 5, the short string is TGTAC ), in this way, until the end of S, a set containing all k-mer is obtained. For example, for sequence S, all 5-mer is

{CTGTA, TGTAC, GTACT, TACTG, ACTGT, CTGTA, TGTAT}

These k-mer usually requires a data indexing method that can be quickly accessed by subsequent operations. For example, for 5-mer, when querying CTGTA, this data index method can return the position in the DNA sequence S }.

Problem

Now 1 million DNA sequences are given in the form of files. The serial number is 1-1000000, and the length of each gene sequence is 100.

(1) A data index method must be provided and implemented for a given k to return the DNA sequence numbers of any k-mer and the locations in the corresponding sequence. Each time you create an index, you only need to support one K value. You do not need to support all K values.

(2) It is required that once an index is created, the query speed should be as fast as possible, and the memory used should be as small as possible.

(3) provides the computing complexity and spatial complexity analysis used to create an index.

(4) provides the computing complexity and spatial complexity analysis of index queries.

(5) assuming that the memory limit is 8 GB, the maximum K value and the corresponding data query efficiency supported by the design index method are analyzed.

(6) sort by importance from high to low. The index method performance will be evaluated based on the following points:

· Index query speed

· Index memory usage

· The range of K values supported in 8 GB memory

· Index creation time

 

To create an index, you must know the position of the subsequence and require the subsequence number.

In fact, there are two more than 70 M documents, ending with. fa,

Baidu checked that the fa file is a data file. Full name: FASTA Formatted Sequence File.

It is a data file format in bioinformatics and is a text-based format used to represent the nucleotide or amino acid sequences. In this format, the base pair or amino acid is encoded with a single letter, and the sequence name and comment can be added before the sequence.

I tried to open it with notepad and the result got stuck.

So, these should be the data to be read. We use binary to read data.

More than 70 M, 1000000 base sequences of 100 bits, requiring an index to be created in 8 GB memory.

Since an index is created, it takes a lot of time for a given subsequence to be searched in string mode. Therefore, I have established a rule:

Replace A with 1;

Replace T with 2;

Replace C with 3;

Replace G with 4.

We can define the Code as follows:

// This Code only converts numeric values: # include <iostream> # include <cstdio> # include <cstring> using namespace std; int getnum (int, int ); char s [100]; int a [100]; // The array that saves the value int numall = 0; int main () {int len, k; // len: determine the number of sequences k: k strings int I, j; gets (s); cin> k; // enter the integer k len = strlen (s ); // obtain the sequence length for (I = 0, j = 0; I <(len-k + 1); I ++, j ++) {a [I] = getnum (I, k); cout <a [I] <endl ;}} int getnum (int n, int k) // obtain the numeric value of a single sequence and save the result in array a {int numb, l, m = 0; l = n; numall = 0; (; n <k + l; n ++, m ++) {numb = 1; for (int I = 0; I <k-m-1; I ++) numb = numb * 10; if (s [n] = 'A') numb * = 1; else if (s [n] = 'T') numb * = 2; else if (s [n] = 'C') numb * = 3; else if (s [n] = 'G') numb * = 4; numall + = numb;} return numall ;}

We input the base sequence in the example:



In this way, it is more convenient to search for numbers in the code.

The other is the establishment of indexes. Since the search, we first think of comparison from the ground up one by one.

However, under the memory limit, 1000000 sequences are also difficult for computers.

Sort according to our ideas first.

Bubble? Select? Bucket sorting? Which one is better? In fact, I am also depressed. In fact, the first thought is Bubble sorting. Therefore, we can choose Bubble sorting.

Remember that the question requires a specific location, so we should use another array to correspond to the original array, that is, number.

After sorting is completed, the number array cannot be changed. The rules of the original array are from small to large.

Given a subsequence, we use the binary method to obtain the specific position and then find the number to obtain its position in the original array.

Finally, according to the rule deserialization sequence.

All the processes are like this:


It is purely personal opinion. If there are any mistakes, we hope to correct them.


@ Mayuko


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.