Base algorithm-Find: Linear index Lookup

Source: Internet
Author: User

Several of the algorithms described above are based on data order. But in the actual application, many data sets may have unexpectedly the amount of data, in the face of these massive data, to ensure that all the records in accordance with one of the keywords in order, the time cost is very expensive, so this data is usually in order to be stored sequentially.

So how can you quickly find the data you need? The way is--index.

An index is the process of associating a keyword with its corresponding record. An index consists of several index entries, each of which should include, at a minimum, information such as the keywords and the corresponding records in memory .

Indexes can be divided into linear indexes, tree indexes, and multilevel indexes by structure. The so-called linear index is to organize the index item collection into a linear structure, also known as an index table.

Dense index

Dense indexes are those in a linear index table that correspond to one index entry for each record in the dataset. And the index entries must be ordered according to the key code order.

Index entry order also means that when looking for keywords, you can use the binary, interpolation, Fibonacci and other ordered search algorithm.

The improvement of the dense index is that it simplifies the large original data set, makes the large data set that cannot be loaded into memory, can load the memory at once, and can implement the sort of key loadline in memory, and each index entry can point to the original data record that it represents on the disk.

The ability to take advantage of advanced lookup algorithms, which is obviously the advantage of dense indexes, but if the data set is very large, then the index table is very large, for the memory of a limited computer, have to put the index table on disk, which greatly reduces the efficiency.

Chunking Index

Dense indexes because the index entries are the same as the number of records in the dataset, the space cost is significant. To reduce the number of index entries, block the dataset so that it blocks in order, and then create an index entry for each block, reducing the number of indexed items.

The block order is to divide the data set into several blocks, which are required to satisfy the following conditions:

The blocks are unordered, and the blocks are ordered in between.

The structure of the indexed items defined in is divided into three data items:

(1) The maximum key, which stores the largest keyword in each block, has the advantage of making the smallest keyword in the next chunk behind it larger than the largest keyword.

(2) stores the number of records in the block for use in loops.

(3) A pointer to the first data element of the block to facilitate the beginning of the record traversal in this piece

This can be done well, large chunks of data are stored on disk, and index tables are stored in memory. This model does not require a sort operation on the original dataset, because the blocks and blocks can be stored discontinuous. Determine the number of blocks before the original data is generated, and where each block is stored (the block is not contiguous, the block is contiguous), then the range of stored data within each block is determined, and when new data arrives, it is possible to determine which block to put the data in.

As an example:

I want to design a sub-index to find the data, roughly estimated to have 3,600 data, so according to the algorithm is optimal (for the moment, think about it), set 60 blocks, each block has 60 records. The 60 blocks correspond to the 60 folder directories on the disk to store the data, and the blocks between the 60 blocks are not contiguous in each other. Assuming that the keyword size range for these 3,600 records is 1-3000, the first block stores 1-50 of the records. To a new record, if the keyword is between 1-50, append it directly to the first block. Also, if the key value of this record is greater than the maximum key in the index table, the maximum critical loadline in the index table is updated.

Analysis of average lookup length for block index Table

There are N records, divided into m blocks, each with a T-bar record. Apparently N=mxt. The average lookup lengths in index tables and blocks are lb and LW, respectively.

In the above analysis, the order lookup is also used between blocks, because the blocks are ordered, so fast algorithms such as binary lookups can be used to improve efficiency.

http://blog.csdn.net/wtfmonking/article/details/17337703
http://blog.csdn.net/xiaoping8411/article/details/7706381

http://blog.csdn.net/xiaoping8411/article/details/7706376

http://blog.csdn.net/wangyunyun00/article/details/23464359

http://blog.csdn.net/fovwin/article/details/9077017

#define Ilmsize; #define MAXSIZE 3600;//Building an index entry structure struct Indexitem{int index;int start;int length;};/ /Build Index Table typedef struct INDEXITEM indexlist[ilmsize];//ilmsize is a predefined integer constant, greater than or equal to the number of index entries m//the primary table int Mainlist[maxsize] that created the original data. //maxsize is a predefined integer constant that is greater than or equal to the number of records in the primary table n/** input: Main Table A, Index Table B, Index Table index number m, element to search elem* output: Subscript of the found element */int Blocksearch (mainlist A , indexlist B, int m, int elem) {for (int i = 0; i < m; i++) {if (B[i].index >= elem) {break;}}  Forif (i = = m) {return-1;//lookup failed}int endnum = B[i].start + b[i].length;for (int j = B[i].start; j < num; J + +) {if (a[j] = = Elem) {break;}} Forif (J < num) {return j;} if (j = = num) {return-1;}}

  

Base algorithm-Find: Linear index Lookup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.