The full bucket Sorting Algorithm Luo Weifeng 2011-7-3 of the BWT rotation matrix declares that the algorithm is not professional, so please bypass it. First of all, let's celebrate our successful arrival in Beijing and our stay in the 798 Art Zone. The residence is quite good, and the cultural cells are everywhere. The trip along the road was smooth and there was almost no pause. Once again, I came to the imperial capital and felt very special. Another thing to celebrate is the new version of the csdn blog, because many of my colleagues have joined other places due to the poor use of csdn, and I am only struggling here, the new version of blog is awesome, and this online editor has greatly improved. In order to celebrate the launch of the new version of blog, we have prepared to write blogs every day over the past few days. Bwt (Burrows-wheeler transformation) algorithm has an important application in human genome sequencing. Open Source bzip is a successful case of BWT compression algorithm. For more information, see Wikipedia. This article mainly introduces a bucket sorting method that I designed to generate the L (that is, the last column of the rotation matrix) required by the BWT algorithm. Question input: A 2 m human genome with a gene sequence chr1.fa. The content of the file is in disorder of nacgt letters. The entire file is a complete sequence. Required output: The L string of the BWT algorithm. Algorithm Analysis: Data Model:
The data required by the algorithm is stored in a region in the heap. the header of this region is Pointer P and the region length is N. Then the first string is from P to P + (N-1 ). Element B of string a is P + (a + B) % N.
F A (B) = p + (a + B) % N.
Core Ideas:
The core idea of the algorithm is to sort millions of gene strings in buckets Based on the acgt feature of only useful characters in human genome strings, and sort each sub-bucket in the same way, until only one or zero elements are left in the bucket, and the collection of the method bucket is triggered. When the parent bucket can be merged is checked, the bucket is merged until the merging is completed.
Details:
The bucket stores the string ID. The element 4 in the bucket represents 5th strings. According to the formula on the previous page, it indicates from P + (5 + 0) % N to P + (5 + N-1) % N string.
Preprocessing: The Preprocessing Program first reads the original data file from the file, filters all N, and loads other content into a character array in the content heap. Then initialize a 0 bucket, which contains all strings (only the offset of the string is saved in the bucket, for example, I saved 4, then it indicates the Character Sequence from P + 4 to P + (4 + N-1) % N ).
Distribution: the control program passes 0 barrels to the distribution program for distribution. The distribution program distributes the first letters A, C, G, and T in the character sequence to 0a, 0C, 0g, and 0 T buckets. In addition, the number of entries distributed to these buckets is counted while being distributed. If the number is greater than or equal to 2, the next layer is distributed. Otherwise, the bucket is marked and collected. (For example, if the bucket 0aa contains less than two data entries, rename the bucket 0aa as the bucket $ AA ).
Collection: Check whether all buckets in the same directory are collected during collection. If a bucket is not collected, stop the execution. Otherwise, the parent collection bucket is generated, then, the data of the four sub-buckets is collected to the parent bucket. If the parent bucket is not $ (the final bucket), collection of the parent bucket is triggered. (Similarly, in the previous example: if the collection of Bucket $ AA is triggered, the system first checks whether $ AC, $ AG, and $ at exist, if one does not exist, stop running and exit. If both exist, create the parent collection bucket $, and collect the data in $ AA, $ AC, $ AG, and $ at to the parent bucket $)
Conclusion: At this point, the top-level collection bucket $ has been generated. Extract the string from the bucket in sequence and output the last character of the string. The last column of the BWT rotation matrix is obtained. For example, if you get an int data in a, you need to output P + (a + N-1) % N.
Problem: the most direct problem here is that when a bucket is used as a file, the length of the file name will grow along with the growth of raw data. The solution is to logically use the method just mentioned. In actual bucket usage, the 32-bit MD5 value of the bucket name is used as the actual file name.
Another problem is that if the method described above is used, there may be insufficient memory or stack overflow during the operation, and the solution is very direct, third-party control functions are used to control the two atomic operations, and recursive methods are completely discarded. This will be changed later. Now we have a recursive implementation. Note that this program has a bug. I will add the control program to the main program at the time to avoid memory problems.
If you want to upload the project directly, the project is code: blocks, and the Standard C is used. The current project is still very bad and will be changed in a few days.
Link: http://download.csdn.net/source/3414631
Advantages and disadvantages of the algorithm: the disadvantage is obvious. It requires a large number of file operations and dynamic File Creation and deletion. However, the advantage is obvious, and the entire generation process does not need to be compared, for distribution and collection, the model of the operation is almost the same as that of distributed hadoop. That is to say, such a model can be easily extended to distributed processing applications, although it seems that a single machine is not as good as other molding algorithms, it is not as good as a bucket arrangement to speed up the arrangement, but it can be easily extended to distributed processing, at that time, the performance improvement was very obvious.