Seven Search algorithms

Last Update:2016-09-26 Source: Internet

Author: User

Tags mysql index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Read Catalogue

1. Sequential Lookup
2. Two points Search
3. Interpolation Lookup
4. Fibonacci Lookup
5. Tree Table Lookup
6. Block Lookup
7. Hash Lookup

Looking for a specific information element in a large amount of information, in computer applications, find is a common basic operations, such as the compilation of symbols table lookup. This paper introduces the common seven kinds of search algorithms, say seven kinds, in fact, binary search, interpolation lookup and Fibonacci Search can be classified as a class-interpolation search. Interpolation lookups and Fibonacci lookups are algorithms for optimizing lookups based on binary lookups. Tree table lookups and hash lookups are described in detail in subsequent posts.

　　Find definition: determines, in a lookup table, a data element (or record) whose key is equal to a given value, based on a given value.

　　 Find algorithm Categories: 　　1) static and dynamic lookups; Note: Static or dynamic is for lookup tables.　　Dynamic tables are tables in the lookup table that have delete and insert operations.　　　　2) unordered lookups and ordered lookups. Unordered Lookup: The ordered order of the sequence can be sorted; ordered lookup: The searched sequence must be an ordered sequence. Average lookup Length (Average search LENGTH,ASL):The expected value of the number of keywords to compare with the specified key, called the average lookup length of the lookup algorithm when the lookup is successful. For a lookup table with n data elements, the average lookup length for the lookup success is: ASL = Pi*ci.
Pi: The probability of finding the first Data element in a table.
Ci: The number of times that the first data element was found to have been compared. Back to top of 1. Sequential Lookup Description: The sequential lookup is suitable for linear tables where the storage structure is stored sequentially or linked. 　　 Basic idea:Sequential lookups, also called line lookups, are unordered lookup algorithms. From one end of the data structure Line table, sequential scan, sequentially scan to the node keyword compared to the given value K, if the equality is to find the success of the search, if the end of the scan has not found the keyword equals k node, indicating that the lookup failed. Analysis of Complexity:The average lookup length when finding success is: (assuming the probability of each data element is equal) ASL = 1/n (1+2+3+...+n) = (n+1)/2;
When the lookup is unsuccessful, a n+1 comparison is required, with a time complexity of O (n); So the time complexity of sequential lookups is O (n ) . C + + implementation Source:

Order lookup int Sequencesearch (int a[], int value, int n) {    int i;    for (i=0; i<n; i++)        if (a[i]==value)            return i;    return-1;}

　　Description: The element must be ordered, and if it is unordered, it should be sorted first.

　　The basic idea: also known as binary lookup, belongs to the ordered search algorithm. With the given value K first compared with the middle node of the keyword, the middle node divides the Line table into two sub-tables, if the equality is found successful; if not equal, then according to the comparison between K and the middle node keyword to determine which child table next, so recursively, until the find or find the end of the discovery table does not have such a node.

　　Complexity Analysis: in the worst case, the number of keyword comparisons is log2 (n+1), and the expected time complexity is O (log2n);

Note: the precondition of binary lookup is that ordered table sequential storage is required, for static lookup table, once sorting no longer changes, binary lookup can get good efficiency. However, for datasets that require frequent insert or delete operations, maintaining an ordered ordering can have a small amount of work, which is not recommended. --"Big liar data Structure"

　　C + + Implementation Source:

Binary lookup (binary lookup), version 1int BinarySearch1 (int a[], int value, int n) {    int low, high, mid;    Low = 0;    High = n-1;    while (Low<=high)    {        mid = (Low+high)/2;        if (A[mid]==value)            return mid;        if (a[mid]>value) high            = Mid-1;        if (a[mid]<value) low            = mid+1;    }    return-1;} Binary lookup, recursive version int BinarySearch2 (int a[], int value, int low, int. high) {    int mid = low+ (high-low)/2;    if (A[mid]==value)        return mid;    if (a[mid]>value)        return BinarySearch2 (A, value, low, mid-1);    if (a[mid]<value)        return BinarySearch2 (A, value, mid+1, high);}

Back to top of 3.　　Interpolation lookup before introducing the interpolation lookup, first consider a new question, why the above algorithm must be binary, instead of One-fourth or more fold? For example, in the English Dictionary check "Apple", you subconsciously open the dictionary is to turn the front page or the back of the page? If we let you check the zoo again, how do you check?　　Obviously, here you will not start from the middle, but have a certain purpose to move forward or backward.　　Similarly, for example, to find 5 in an array of 100 elements from small to large in the range of 1 to 10000, we would naturally consider starting from the smaller array subscript. After the above analysis, binary find the way to find this, not adaptive (that is, a fool-style).　　Find points in binary lookup are calculated as follows: mid= (Low+high)/2, i.e. mid=low+1/2* (high-low); By analogy, we can improve the points to be found as follows: mid=low+ (Key-a[low])/(A[high]-a[low]) * (high-low), that is, the above-mentioned ratio parameter 1/2 is improved to adaptive, according to the keyword in the entire ordered table in the position, The change in the mid value is closer to the keyword key, which indirectly reduces the number of comparisons. Basic idea:Based on the binary search algorithm, the selection of finding points is improved to adaptive selection, which can improve the searching efficiency.　　Of course, the difference lookup also belongs to an ordered lookup. Note: The average performance of the interpolation lookup algorithm is much better than that of binary lookup tables, where the table length is larger and the keyword distribution is more uniform. Conversely, if the distribution is very uneven in the array, then the interpolation lookup is not necessarily the appropriate choice. Complexity Analysis: The time complexity of finding success or failure is O (log2 (log2n)). 　　 C + + implementation Source:

Interpolation lookup int Insertionsearch (int a[], int value, int low, int.) {    int mid = low+ (Value-a[low])/(A[high]-a[low]) * (high -low);    if (A[mid]==value)        return mid;    if (a[mid]>value)        return Insertionsearch (A, value, low, mid-1);    if (a[mid]<value)        return Insertionsearch (A, value, mid+1, high);}

Before introducing the Fibonacci lookup algorithm, let's start by introducing a concept that is very closely connected and well known to all-the golden segment.

The golden ratio, also known as the Golden Section, refers to a certain mathematical proportional relationship between parts of things, which is about to be divided into two parts, and the ratio of the majority to the smaller part equals to the ratio of the whole to the larger proportion, which is about 1:0.618 or 1.618:1.

0.618 is recognized as the most aesthetic proportion of the number, the role of this value is not only reflected in such as painting, sculpture, music, architecture and other artistic fields, but also in management, engineering design and other aspects of the role can not be ignored. So it is called the Golden segment.

Do you remember the Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 ... (starting with the third number, each of the following numbers is the first two numbers and the same). We will then find that as the Fibonacci sequence increases, the ratio of the two numbers to the front and back is closer to 0.618, and with this feature we can apply the golden ratio to the search technique.

　　 Basic idea:It is also a lifting algorithm for binary search, which uses the concept of the golden ratio to find the search point in the sequence to improve the search efficiency.　　Similarly, Fibonacci lookups belong to an ordered lookup algorithm. Compared to the binary lookup, the key value to be compared with the mid= (Low+high)/2 position of the element comparison, the results are divided into three cases:

1) equal, the element of the mid position is the desired

2) >,low=mid+1;

3) <,high=mid-1.

The Fibonacci lookup is very similar to the binary lookup, which splits the ordered table according to the characteristics of the Fibonacci sequence. He asked for the number of records in the start table to be one Fibonacci number small 1, and n=f (k)-1;

The comparison between the K value and the record of the position F (k-1) (Mid=low+f (k-1)-1) is started, and the results are divided into three types.

1) equal, the element of the mid position is the desired

2) >,low=mid+1,k-=2;

Description: Low=mid+1 indicates that the element to be found is within the [Mid+1,high] range, the number of elements in the k-=2 description range [Mid+1,high] is n (F (k-1)) = Fk-1-f (k-1) =fk-f (k-1) -1=f (k-2)-1, So Fibonacci lookups can be applied recursively.

3) <,high=mid-1,k-=1.

Description: Low=mid+1 indicates that the element to be found is within the [low,mid-1] range, the number of elements in the k-=1 description range [Low,mid-1] is F (k-1)-1, so the Fibonacci lookup can be applied recursively.

　　 complexity analysis: In the worst case, the time complexity is O (log2n), and its expected complexity is O (log2n ).

C + + Implementation Source:

Fibonacci find. cpp #include "stdafx.h" #include <memory> #include <iostream>using namespace Std;const int max_size    =20;//the length of the Fibonacci array/* Constructs a Fibonacci array */void Fibonacci (int * F) {f[0]=0;    F[1]=1; for (int i=2;i<max_size;++i) f[i]=f[i-1]+f[i-2];}  /* Define the Fibonacci lookup method */int fibonaccisearch (int *a, int n, int key)//a to the array to find, n is the length of the array to find, and key is the keyword {int low=0 to find;    int high=n-1;  int f[max_size];  Fibonacci (f);//Constructs a Fibonacci array F int k=0;  while (N&GT;F[K]-1)//calculates n at the position of the Fibonacci sequence ++k;  int * temp;//extends array A to f[k]-1 length temp=new int [f[k]-1];  memcpy (temp,a,n*sizeof (int));    for (int i=n;i<f[k]-1;++i) temp[i]=a[n-1];    while (low<=high) {int mid=low+f[k-1]-1;      if (Key<temp[mid]) {high=mid-1;    K-=1;     } else if (Key>temp[mid]) {low=mid+1;    k-=2; } else {if (mid<n) return mid;//if equal means that mid is the found location else return n-1;//If Mid>=n  The description is an extended numeric value that returns n-1}} delete [] temp; return-1;} int main () {int a[] = {0, 16,24,35,47,59,62,73,88,99};    int key=100;    int Index=fibonaccisearch (a,sizeof (a)/sizeof (int), key);    cout<<key<< "is located at:" <<index; return 0;}

　　5.1 Simplest tree Table lookup algorithm--two binary tree lookup algorithm.

Basic idea: two forks the Find tree is a spanning tree that treats the found data first, making sure that the value of the left branch of the tree is less than the value of the right branch, and then compares the size of the row and the parent node of each node to find the most appropriate range. This algorithm is very efficient to find, but if you use this lookup method, you first create the tree.

　　A two-fork search tree (binarysearch tree, also known as a binary trees, or binary sort tree) or an empty tree, or a two-fork tree with the following properties:

1) If the left subtree of any node is not empty, then the value of all nodes on the left subtree is less than the value of its root node;

2) If the right subtree of any node is not empty, the value of all nodes on the right subtree is greater than the value of its root node;

3) The left and right sub-trees of any node are also two-fork search trees.

　　Two fork Search tree Properties : The sequential sequence is obtained by traversing the two-fork search tree.

The two-fork look-up tree in different forms is as follows:

For a detailed explanation of the operation of finding, inserting, and deleting binary search trees, please discuss the algorithm and data structure: 72-fork search tree.

　　Complexity analysis: It is the same as the binary lookup, where the time complexity of insertions and lookups is O (Logn), but in the worst case there is still a time complexity of O (n). The reason is that when inserting and deleting elements, the tree is not balanced (for example, we look for "93" in (b) and we need to perform N-search operations). What we are looking for is that there is still a good time complexity in the worst case, which is the original intention of balancing the search tree design.

　　Compare graphs for two-tree lookups and sequential lookups and binary lookup performance:

Based on binary search tree optimization, we can get other tree table lookup algorithms, such as balance tree, red black tree and other efficient algorithms.

　　5.2 Balance Find Tree 2-3 find tree (2-3 trees)

　　2-3 Find Tree definition : Unlike two fork trees, 2-3 trees run each node to save 1 or two values. For the normal 2 node (2-node), he saves 1 keys and left and right two points. corresponding to 3 nodes (3-node), save two key,2-3 find tree is defined as follows:

1) either empty, or:

2) for 2 nodes, the node holds a key and corresponding value, and two points to the left and right nodes, and a 2-3 node, all the values are smaller than key, the right node is also a 2-3 node, all the value is larger than key.

3) for a 3 node, the node holds two keys and corresponding value, and three points to the left and right nodes. The left node is also a 2-3 node, all values are smaller than the smallest key in two key, the middle node is also a 2-3 node, the key value of the middle node is between two and the node key value, the right node is also a 2-3 node, All key values for a node are larger than the largest key in the two key.

　　2-3 the nature of the Find tree:

　　1) If the middle order traversal 2-3 finds the tree, can get the sequence of the sequence;

2) in a fully balanced 2-3 lookup tree, the root node is the same distance from each of the empty nodes. (This is also the concept of the term "balance" in the balance tree, where the longest distance from the root node to the leaf node corresponds to the worst case of the lookup algorithm, while the distance from the root node to the leaf node in the balance tree is the same, and the worst case has a logarithmic complexity.) ）

Property 2) as shown in:

　　Analysis of Complexity:

2-3 Tree Search efficiency is closely related to the height of the tree.

In the worst case, that is, all nodes are 2-node nodes, and the lookup efficiency is LGN
In the best case, all nodes are 3-node nodes, the lookup efficiency is log3n approximately equal to 0.631lgN

Distance, for 1 million nodes of 2-3 trees, the height of the tree is 12-20, for 1 billion nodes of 2-3 trees, the height of the tree is between 18-30.

For inserts, it takes only a few operations to complete, because he only needs to modify the nodes associated with that node, and does not need to check other nodes, so efficiency and lookup are similar. Here is the efficiency of the 2-3 lookup tree:

　　5.3 Balance search Tree of red and black trees (red-black tree)

2-3 the Find tree ensures that the tree's balance is maintained after the element is inserted, in the worst case, all child nodes are 2-node and the tree height is LGN, thus guaranteeing the worst-case time complexity. However, the 2-3 tree is more complex to implement, so there is a simple implementation of 2-3 tree data structure, that is, the red-black trees (red-black tree).

　　Basic idea: The idea of the red and black tree is to encode the 2-3 lookup tree, especially to add additional information to the 3-nodes node in the 2-3 lookup tree. The red-black tree divides the links between nodes into two different types, the red link, which he uses to link two 2-nodes nodes to represent a 3-nodes node. A black link is used to link a normal 2-3 node. In particular, two 2-nodes with red links are used to represent a 3-nodes node, and to the left, where one 2-node is the left child node of another 2-node. The advantage of this approach is that you don't have to make any changes when looking for the same as a normal two-fork lookup tree.

　　Definition of red and black trees:

The red-black tree is a balanced lookup tree with red and black links, while satisfying:

Red nodes Tilt Left
A node cannot have two red links
The whole tree is completely black balanced, that is, from the root node to the path of the leaf node, the number of black links is the same.

You can see that the red and black trees are actually another form of expression for the 2-3 trees: if we draw the red line horizontally, then the two 2-node nodes that he links are a 3-node node in the 2-3 tree.

　　the nature of the red-black tree : The entire tree is completely black balanced, that is, from the root node to the path of the leaf node, the number of black links is the same (2-3 tree of the 2nd) nature, from the root node to the leaf node distance is equal).

Analysis of Complexity: The worst case scenario is that the red and black trees, except for the leftmost path, are all made up of 3-node nodes, that is, the red and black path length is twice times the length of the full black path.

is a typical red-black tree from which you can see the longest path (the red-black path) is twice times the shortest path:

The average height of the red and black trees is about logn.

Is the time complexity of the red and black trees in various cases, it can be seen that the red and black tree is an implementation of the 2-3 lookup tree, it can ensure that the worst case still has a logarithmic time complexity.

Red and black tree This data structure is widely used in many programming languages as a symbol table implementation, such as:

The Java.util.treemap,java.util.treeset in Java;
C + + STL: Map,multimap,multiset;
. NET: Sorteddictionary,sortedset and so on.

　　5.4 B-and + +-trees (b tree/b+ Tree)

Balances the 2-3 trees in the find tree and implements the red-black tree. 2-3 tree species, a node has a maximum of 2 keys, while the red-black tree uses a staining method to identify the two keys.

Wikipedia defines B-trees as "in computer science, B-Tree (B-tree) is a tree-like data structure that can store data, sort it, and allow the time complexity of O (log n) to run for lookups, sequential reads, insertions, and deletions. B-Tree, which is generally a node can have more than 2 child nodes of the two-fork lookup tree. Unlike the self-balancing binary lookup tree, the B-tree is the system's most optimized read and write operation for large chunks of data . The b-tree algorithm reduces the intermediate process that is experienced when locating records, thus speeding up the access speed. Widely used in databases and file systems .

　　B-Tree definition:

A B-tree can be seen as an extension of the 2-3 lookup tree, that is, he allows each node to have M-1 child nodes.

Root node has at least two child nodes
Each node has a M-1 key and is sorted in ascending order
The values of the child nodes at M-1 and M key are located between M-1 and M key corresponding to value
Other nodes have at least M/2 child nodes

is a m=4-order B-Tree:

You can see that the B-tree is an extension of 2-3 trees, and he allows a node to have more than 2 elements. The insertion and balancing of the B-tree is similar to the 2-3 tree, which is not covered here. The following is inserted into the B-tree in turn

6 10 4 14 5 11 15 3 2 12 1 7 8 8 6 3 6 21 5 15 15 6 32 23 45 65 7 8 6 5 4

The Demo animation:

B + Tree Definition:

b + tree is a kind of deformation tree of the tree, which differs from B-tree in that:

The nodes with K nodes must have K key codes;
The non-leaf nodes only have index function, and the information related to the records is stored in the leaf node.
All leaf nodes of a tree form an ordered list that can traverse all records in the order in which the key codes are sorted.

For example, is a B + tree:

Is the Insert animation for B + trees:

　　The difference between B and C + trees is that the non-leaf nodes of the + + tree contain only navigational information and do not contain actual values, and all leaf nodes and connected nodes are linked using a linked list for easy interval lookup and traversal.

The advantages of B + trees are:

Since the B + tree does not contain data information on the internal node, it can store more keys in the memory page. The data is stored more tightly and has better spatial locality. So accessing the leaves at several points on the associated data also has a better cache hit ratio.
B + Tree leaf nodes are chain-linked, so the convenience of the whole tree only need a linear traversal of the leaf node. And because the data order is arranged and connected, it is convenient to find and search the interval. The B-tree requires recursive traversal of each layer. Adjacent elements may not be contiguous in memory, so cache hit is not as good as a B + tree.

　　The advantage of B-trees, however, is that since each node in the B-tree contains both key and value, the frequently accessed elements may be closer to the root node and therefore more quickly accessed.

　　Here is a diagram of the difference between B-and + + trees:

b/b+ tree is often used in file system and database system, it is through the expansion of the number of each node, so that the continuous data can be faster positioning and access, can effectively reduce the search time, improve the spatial locality of storage to reduce IO operations. It is widely used in file systems and databases, such as:

WINDOWS:HPFS file system;
mac:hfs,hfs+ file system;
LINUX:RESISERFS,XFS,EXT3FS,JFS file system;
Database: Oracle,mysql,sqlserver and so on.

For the application of the b/b+ tree in the database index, see the data structure and algorithm principle behind the MySQL index of Zhang Yang This article provides a more detailed introduction to how MySQL uses B + trees for indexing, which is recommended for reading.

Tree Table lookup Summary:

Binary search tree Average lookup performance is good, O (Logn), but the worst case will degenerate to O (n). Based on the binary lookup tree, we can use a balanced lookup tree. The 2-3 lookup tree in the balance lookup tree, which can be self-balanced after insertion, ensures that the height of the tree is within a certain range to ensure the worst-case time complexity. But the 2-3 find tree is difficult to implement, and the red-black tree is a simple and efficient implementation of 2-3 trees, and he cleverly uses color tags instead of the more difficult-to-handle 3-node node problem in 2-3 trees. Red-black tree is a relatively efficient and balanced search tree, the application is very extensive, many of the internal implementation of programming languages are more or less the use of red and black trees.

In addition, the 2-3 find tree of another extended--b/b+ balance tree, in the file system and database system has a wide range of applications.

Block lookup, also known as index order lookup, is an improved method for sequential lookups.
　　algorithm idea: divide n Data Elements "ordered by block" into M-Block (m≤n). The nodes in each block do not have to be ordered, but the blocks and blocks must be "ordered by block"; that is, the keyword of any element in the 1th block must be less than the keyword of any element in the 2nd block, and any element in the 2nd block must be less than any element in the 3rd block ...
　　Algorithm Flow:
Step1 First select the largest keywords in each block to form an index table;
Step2 Lookup is divided into two parts: first binary lookup or sequential lookup of the Index table to determine which piece the unknown origin is recorded in, and then, in the determined block with the order method to find.

　　What is a hash table (hash)?

We use a larger array of subscript ranges to store the elements. You can design a function (a hash function, also called a hash function), so that each element's keyword corresponds to a function value (that is, an array subscript), and then use the array element to store it, or it can be simply understood, by the keyword for each element "classification", and then store the element in the corresponding "class" The corresponding place. However, it is not possible to guarantee that each element's keyword corresponds to a function value of one by one, so it is very likely that there are different elements, but the same function values are computed, thus creating a "conflict", in other words, by splitting different elements into the same "class". Later we will see a simple way to resolve "conflicts". In general, "direct addressing" and "conflict resolution" are two major features of the hash table.

　　What is a hash function?

The rule of the hash function is: through some kind of transformation relationship, so that the keyword is moderately dispersed into the order structure of the specified size, the more dispersed, the less time complexity of the later lookup, the higher the space complexity.

　　algorithm thought: the idea of hashing is very simple, if all the keys are integers, then you can use a simple unordered array to implement: The key as an index, the value is its corresponding value, so that you can quickly access the value of any key. This is the case for a simple key, which we extend to a key that can handle more complex types.

　　Algorithm Flow:

1) construct a hash table with the given hash function, 2) resolve the address conflict according to the chosen conflict handling method, the common method of conflict resolution: Zipper method and Linear detection method.　　A detailed introduction can be found in: talking about algorithms and data structures: 11 hash tables. 3) hash lookup is performed on the basis of the hash table. a hash table is a classic example of a trade-off between time and space. If there is no memory limit, you can directly index the key as an array. Then all the lookup time complexity is O (1), and if there is no time limit, then we can use unordered arrays and order lookups, which requires very little memory. The hash table uses a modest amount of time and space to find a balance between these two extremes. You only need to adjust the hash function algorithm to make time and space choices.

　　Analysis of Complexity :

Simple discussion of finding complexity: for non-conflicting hash tables, the lookup complexity is O (1) (Note that we need to build the appropriate hash table before looking for it).

　　What do we pay for using hash?
We store a large-scale data in the actual programming, the first thought of the storage structure may be the map, that is, we often say that the kv pair, often using Python bo friends may have this experience. The advantage of using map is that we can quickly find the corresponding value value based on the key of the data when we process the data processing. The essence of map is the hash table, then we get the super-high search efficiency on the basis of what we pay?

Hash is a typical space-time algorithm, such as the original array of length 100, to its search, only need to traverse and match the corresponding record, from the spatial complexity, if the array is stored in byte type data, then the array occupies 100byte space. Now we use the hash algorithm, we said before the hash must have a rule, constraint key and storage location relationship, then need a fixed-length hash table, at this time, is still an array of 100byte, assuming we need 100byte to record the relationship between the key and position, Then the total space is 200byte, and the size of the table used to record the rule will vary depending on the rule, the size may be variable.

Comparison of the performance of hash algorithms and other search algorithms:

Seven Search algorithms

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More