Detailed search data structure and algorithm (Python implementation)

Detailed search data structure and algorithm (Python implementation) _python

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Basic CONCEPTS

Lookup (searching) is a data element (or record) in a lookup table that is equal to a given value, based on a given value.

Lookup table (search table): A collection of data elements (or records) of the same type

Keyword (key): The value of a data item in a data element, also known as a key value.

Primary KEY (Primary key): A keyword that uniquely identifies a data element or record.

The lookup table can be divided into the following ways:

Static lookup tables (static search table): Lookup tables that do only lookup operations. Its main operations are:
Query whether a "specific" data element is in a table
Retrieving a "specific" data element and various properties
Dynamic Search table: Insert or delete at the same time in a lookup:
Inserting data when finding
Delete data while looking

Second, unordered table lookup

That is, the data is not sorted linear lookup, traversing the data element.

Algorithm analysis: The best situation is found in the first position, this is O (1); the worst case is found in the last position, this is O (n); So the average lookup number is (n+1)/2. The final time complexity is O (n)

# Most basic traversal unordered list lookup algorithm
# time complexity O (n)

def sequential_search (Lis, key):
  length = Len (lis) for I in
  range (length) :
    if lis[i] = = key: Return
      i
    else: return
      False


if __name__ = = ' __main__ ':
  LIST = [1, 5, 8, 123, 7, 222] Result
  = Sequential_search (LIST, 123)
  print (Result)

Third, ordered table lookup

The data in the lookup table must be sorted by one of the primary keys!

1. Two point search (Binary search)

Algorithm core: In the lookup table to continue to take the middle element and lookup value comparison, to One-second of the magnification of the table to reduce the scope.

# Binary Lookup algorithm for ordered lookup table
# time complexity O (log (n))

def binary_search (Lis, key): Low
  = 0 High
  = Len (lis)-1 times
  = 0
   while Low < high: time =
    1
    mid = Int (low + high)/2)
    if key < Lis[mid]: High
      = Mid-1
    El If key > Lis[mid]: Low
      = mid + 1
    else:
      # print binary number of times print
      (' Time:%s '% time ') return
      mid
  pri NT ("Times:%s"% time) return
  False

if __name__ = ' __main__ ':
  LIST = [1, 5, 7, 8, 22, 54, 99, 123, 200, 22 2, 444] result
  = Binary_search (LIST, the)
  print (Result)

2. Interpolation Lookup

Although the two-point search method has been very good, but there are places to optimize.

Sometimes, half filtration is not ruthless enough, if every time to exclude nine-tenths of the data is not better? Selecting this value is the key problem, and the interpolation is meant to be reduced at a faster rate.

The core of interpolation is the use of formulas:

Value = (Key-list[low])/(List[high]-list[low])

Use this value instead of 1/2 in the binary lookup.

The above code can be used directly, only need to change a sentence.

# interpolation lookup Algorithm
# time complexity O (log (n))

def binary_search (Lis, key): Low
  = 0 High
  = Len (lis)-1 times
  = 0
  while Low < high: Time
    = 1
    # Calculation Mid is the core code of the interpolation algorithm
    mid = low + int ((high-low) * (Key-lis[low))/(Lis[high]-lis[l OW])
    print ("mid=%s, low=%s, high=%s"% (Mid, low, high))
    if key < Lis[mid]: High
      = mid-1
    elif ke Y > Lis[mid]: Low
      = mid + 1
    else:
      # Number of print lookups
      (' Times:%s '% time ') return to
      mid print
  (" Times:%s "% time" return
  False

if __name__ = ' __main__ ':
  LIST = [1, 5, 7, 8, 22, 54, 99, 123, 200, 222, 4 Result
  = Binary_search (LIST, 444)
  print (Result)

The overall time complexity of the interpolation algorithm still belongs to the O (log (n)) level. The advantage is that the average performance of the interpolation algorithm is much better than that of the binary lookup table for the large amount of data in the table and the more uniform keyword distribution. Conversely, the interpolation algorithm is not suitable for data with extremely uneven distribution.

3. Fibonacci Lookup

Based on the heuristic of interpolation algorithm, the Fibonacci algorithm is invented. The core is also how to optimize that reduction rate, so that the number of lookups is minimized.

Using this algorithm, if there is already a list containing Fibonacci data

F = [1, 1, 2, 3, 5, 8, 144, MB,...]

# Fibonacci Lookup Algorithm # time complexity O (log (n)) def fibonacci_search (Lis, key): # requires a ready-made Fibonacci list.
  The value of its maximum element must exceed the number of elements in the lookup table.  F = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 
  46368] Low = 0 high = Len (LIS)-1 # To make the lookup table satisfy the Fibonacci property, add a few of the same values at the end of the table the value is the value of the last element of the original lookup table # The number added is determined by F[k]-1-high  k = 0 While high > f[k]-1:k + + 1 print (k) i = high while f[k]-1 > I:lis.append (lis[high)) I + + 1 print (LIS) # algorithm main logic.
  Time is used to show the number of loops.
      Time = 0 while low <= high:time + = 1 # To prevent the F list subscript overflow, set if and else if k < 2:mid = Low else:  MID = low + f[k-1]-1 print ("low=%s, mid=%s, high=%s"% (low, mid, high) if key < Lis[mid]: High =
        Mid-1 K-= 1 elif key > Lis[mid]: Low = mid + 1 K-= 2 else:if mid <= High:
# Number of printed lookups print (' Times:%s '% time] return to mid else:print (time:%s%)        Return high print ("Times:%s"% time) return False if __name__ = = ' __main__ ': LIST = [1, 5, 7, 8, 22, 54, 9

 9, 123, 222, 444] result = Fibonacci_search (LIST, 444) print (result)

Algorithmic analysis: The whole time complexity of Fibonacci lookup is also O (log (n)). But on average performance, it is better than binary lookup. But at worst, for example, if the key is 1, it is always on the left half of the search, at which point it is less efficient than the binary lookup.

Summary: The mid operation of binary search is addition and division, interpolation lookup is complex arithmetic, and Fibonacci lookup is just the simplest addition and subtraction operation. In the search for massive amounts of data, this subtle difference may affect the ultimate search efficiency. Therefore, the search method of three kinds of ordered table is essentially different from the choice of segmentation points, each has its merits and demerits, and should be chosen according to the actual situation.

Four, linear index lookup

For large amounts of unordered data, the index table is typically constructed to improve the lookup speed.

An index is the process of associating a keyword with a record that corresponds to it.

An index consists of several index entries, each containing at least information such as the keyword and its corresponding location in the memory.

The indexes can be divided into linear index, tree Index and multilevel index according to the structure.

Linear index: The collection of index entries is organized through a linear structure, also called an index table.

Linear indexes can be divided into: Dense index, block index and inverted index

1. Dense index

A dense index is a linear index in which an index entry is established for each record in the data collection.

This is actually equivalent to giving a disordered set, creating an orderly linear table. The index entries must be sorted according to the key code.

This is also equivalent to the search process required for the sort of work to be done in advance.

1. Sub-block Index

A large number of unordered data sets are divided into chunks, so that the block is unordered, and the blocks are ordered.

This is actually an intermediate state or compromise state of ordered lookup and unordered lookup. Because the volume of data is too large, the establishment of a complete dense index is time-consuming and consumes too much resources; but if you do not do any sort or index, then the search for traversal can not accept, can only compromise, to do a certain degree of sorting or indexing.

The efficiency of The Block index is higher than the O (n) that traverses the lookup, but it is still a lot worse than the Logn O ().

1. Inverted index

Instead of determining the property value by the record, the attribute value determines the location of the record, which is called the inverted index. The record number table stores the address or reference of all records with the same secondary keyword (either a pointer to a record or a primary key for that record).

Inverted indexing is the most basic search engine indexing technology.

Five or two fork sort tree

A binary sort tree is also called a binary lookup tree. It is either an empty tree or a two-pronged tree with the following properties:

If its left subtree is not empty, the value of all nodes on the left subtree is less than the value of its root structure;
If its right subtree is not empty, the value of all nodes on the right subtree is greater than the value of its root structure;
Its left and right subtrees are also two-fork-sorted trees.

The purpose of constructing a binary sort tree is often not to sort, but to improve the speed of locating and inserting deletion keywords.

Binary Sort Tree operations:

1. Find: Compare the value of the node and keywords, equal to indicate found; Small to the node of the left subtree to find, the big is to the right subtree to find, so recursive down, the last return boolean value or found node.

2. Insert: Starting from the root node with the keyword, one by one, small to the left, big to the right, touch the child tree is empty situation will be the new node link.

3. Delete: If you want to delete the node is the leaf, delete directly; if only Zuozi or only the right subtree, then after the node is deleted, link the subtree to the parent node, and if there is a left and a subtree, you can traverse the two-fork sort tree in sequence, Replaces the location of the deleted node with the predecessor or successor node of the node that will be deleted.

#!/usr/bin/env python #-*-coding:utf-8-*-# author:liu Jiang # Python 3.5 class Bstnode: "" Defines a binary tree node class.
  Based on the discussion algorithm, some problems such as judging data types are neglected.
    "" "Def __init__ (self, data, Left=none, Right=none):" "Initialize:p Aram Data: node-stored:p Aram left: node Zuozi :p Aram Right: "" "self.data = Data Self.left = Left Self.right = Right-hand class Binarysorttree:" " "A two-fork sort tree based on the Bstnode class.
  Maintains a pointer to a root node. "" "Def __init__ (self): Self._root = None def is_empty (self): return self._root is None def search (self, k EY): "" "Key code search:p Aram key: Key code: Return: Query node or none" "" BT = self._root while bt:entry =
        Bt.data if key < ENTRY:BT = Bt.left elif key > ENTRY:BT = Bt.right Else:
    Return entry return None def insert (self, key): "" "Insert operation:p Aram key: Key code: Return: Boolean" "" BT = Self._root if not bt:self._root = Bstnode (key) return WhilE true:entry = Bt.data if key < Entry:if bt.left is None:bt.left = Bstnode (key)
          Return BT = Bt.left elif key > Entry:if bt.right is None:bt.right = Bstnode (key)
    Return BT = bt.right Else:bt.data = key return def delete (self, key): "" Binary sort tree The most complex method:p Aram key: Key code: Return: Boolean "" "" "p, q = None, Self._root # Maintain the parent node p for Q, for subsequent link operations if no T q:print ("Empty Tree!")
        ") return while Q and q.data!= key:p = q if key < Q.data:q = Q.left Else:
        Q = q.right if not Q: # ends when there are no key keys in the tree. Return # The node to be deleted is found above and is referenced with Q.
    P is the parent node of Q or None (q is the root node). If not q.left:if p. none:self._root = q.right elif q is p.left:p.left = Q.right els E:p.right = q.right Return # Find the right node of the left subtree of node Q, link the right subtree of Q to the right subtree of the node # This method may increase the depth of the tree, and the efficiency is not high.
   Other methods can be designed. r = q.left while r.right:r = r.right r.right = Q.right if p is none:self._root = Q.left Eli
    F p.left is q:p.left = Q.left else:p.right = Q.left def __iter__ (self): "" "Implementation of the two-fork tree in the sequence traversal algorithm,
    Show us the two-fork sort tree we created.
    Use Python's built-in list as a stack directly. : Return:data "" "stack = [] node = self._root while node or Stack:while node:stack.append (node) node = node.left node = stack.pop () yield node.data node = node.right if __name__ = = ' _ _main__ ': lis = [-----------------------------------Bs_tree = Binarysorttree () for I in Ran GE (len (LIS)): Bs_tree.insert (Lis[i]) # Bs_tree.insert (M) bs_tree.delete (%) for I-bs_tree:print (i, End

 = "") # print ("\ n", Bs_tree.search (4))

Binary Sort Tree Summary:

Binary sort trees are stored in chains, preserving the advantages of the link structure in the INSERT and delete operations.
In extreme cases, the number of queries is 1, but the maximum number of operations does not exceed the depth of the tree. In other words, the lookup performance of the binary sort tree depends on the shape of the two-fork sort tree, and then the balanced binary tree behind it.
Given an element set, you can construct a different two-fork sort tree, and when it is a complete binary tree, the time complexity of the lookup is O (log (n)), which is approximate to the binary lookup.
When the most extreme oblique tree appears, its time complexity is O (n), which is equivalent to sequential lookup, and has the worst effect.

Six, balanced two-fork tree

Balanced binary tree (AVL tree, Inventor's initials): A highly balanced sort binary tree with a height difference of up to 1 for each node's Saozi right subtree.

The balanced binary tree must first be a binary sort tree!

Balance factor (Balance Factor): Subtracts the left subtree depth of the node in the two-fork tree by the value of the right subtree depth.

For a balanced binary tree, all the balancing factors including branch nodes and leaf nodes are only -1,0 and 1, and the two-tree is unbalanced as long as the factor of one node is not within these three values.

Minimum unbalanced subtree: the node closest to the insertion point, and where the absolute value of the balance factor is greater than 1 is the subtree of the root.

The idea of balancing the binary tree: whenever you insert a new node, check to see if it destroys the balance of the tree, and if so, find the smallest unbalanced subtree. On the premise of maintaining the characteristic of the binary sort tree, the connection between the nodes in the minimum unbalanced subtree is adjusted, and the corresponding rotation is made to become a new equilibrium subtree.

The following is the construction of a balanced binary tree by [1,2,3,4,5,6,7,10,9]

Seven, multi-way lookup tree (b-Tree)

Muitl-way Search Tree: Each node has more than two children, and multiple elements can be stored at each point.
For a multichannel lookup tree, how many elements each node can store, and how many of its children are critical, are commonly used in these 4 forms: 2-3 trees, 2-3-4 trees, B-trees, and plus + trees.

2-3 Trees

2-3 trees: Each node has 2 children, or 3 children, or no children.

A 2 node contains one element and two children (or no children, not just one child). Like a two-fork sort tree, its left subtree contains less than the element, and the right subtree contains more elements than that element.

A 3 node contains two elements and three children (or no children, not just one or two children).

All the leaves in the 2-3 tree must be on the same level.

The insert operation is as follows:

The delete operation is as follows:

2-3-4 Tree

is actually the expansion of the 2-3 tree, including the use of 4 nodes. A 4 node contains a small medium large of three elements and four children (or no children).

Its insert operation:

Its delete operation:

B-Tree

B-Tree is a balanced multi-channel lookup tree. The number of children with the largest node is called the order of the B-tree. The 2-3 tree is a 3-order B-Tree, and the 2-3-4 is a 4-order B-tree.

The data structure of B-tree is mainly used in the data interaction between memory and external memory.

B + Tree

In order to solve the basic problems such as the traversal of all the elements of B-tree, we formed a B + tree on the basis of the original structure and adding the new element organization method.

B + Tree is a kind of tree which is required by file system, and it is not the most basic tree in the strict sense.

In a B + tree, elements that appear in the branch node are listed again as the successor (leaf node) in the branch node position. In addition, each leaf node will hold a pointer to a leaf-back node.

All the leaf nodes contain all the key word information, and related pointers, the leaf nodes themselves according to the size of the key words to the large order link

The structure of B + trees is particularly suitable for searching with a range. For example, look for people aged between 20~30岁.

Viii. hash list (hash table)

Hash table: There is no relationship between all the elements. The location where the elements are stored is computed directly by using a function of the element's keywords. This one by one corresponding relation function is called a hash function or a hash function.

The hashing technique is used to store records in a contiguous storage space, called a hash table or Hashtable (hash tables). The storage location of the keyword, called the hash address.

A hash table is a lookup-oriented storage structure. The best solution for this problem is to find records that are equal to the given value. But for a keyword to have a lot of records of the situation does not apply, such as looking for all the "male" sex. Nor is it suitable for range lookups, such as finding people between age 20~30. Sorting, Max, Min, and so on are not appropriate.

Therefore, a hash table is typically used for data structures that do not duplicate the keywords. Like Python's Dictionary data type.

To design a simple, uniform and high storage utilization hash function is the most critical problem in hashing technology.

However, the general hash function is confronted with the problem of conflict.

Conflict: Two different keywords, calculated with the same result after hashing the hash function. Collision.

The construction method of 8.1 hash function

Good hash function: Simple calculation, uniform hash address distribution

1. Direct Addressing method

For example, a linear function that takes a keyword is a hash function:

F (Key) = A*key + B (A,b is constant)

2. Digital Analysis method

Extract the numbers in the keyword, and assign the addresses according to the characteristics of the numbers

3. Method of square-taking

Square The number of keywords, and then intercept the part

4. Folding method

The number of keywords to be divided into a separate calculation, and then combined to calculate, a means of playing with numbers.

5. Method of excluding residue

One of the most common methods.

For a data collection with a table length of M, the hash formula is:

F (key) = key mod p (p<=m)

MoD: modulo (remainder)

The key of this method is P's choice, and when the data volume is large, the conflict is inevitable. Typically, a prime number near M is selected.

6. Random number method

Select a random number, and take the random function value of the keyword as its hash address.

F (key) = random (key)

To summarize, in practice, according to different data characteristics of different hashing methods, consider the following main problems:

The time required to compute the hash address
The length of the keyword
Size of the hash table
The distribution of keywords
Record the frequency of the lookup

8.2 Handling Hash conflicts

1. Open Addressable method

is to find the next empty hash address whenever a conflict occurs, as long as the hash table is large enough, the empty hash address is always found, and the record is saved.

The formula is:

This simple conflict resolution is called linear detection, nothing more than a hole in the home is occupied, one after the call to visit the pit, there are empty to enter, regardless of whether the pit is behind a reservation.

The biggest problem with linear detection is the accumulation of conflicts, you take the other people's reservations, others will be like you to find a hole.

The improved methods include two-square detection method and random number detection method.

2. Re-hashing function method

In the event of a conflict, a hash function is computed, and there is always a way to resolve the conflict, which allows the keyword to not generate aggregation, but correspondingly increases the computational time.

3. Link Address law
When you encounter a conflict, instead of replacing the address, you store all the keywords as synonyms in a list, and only the head pointer of the synonym child table is stored in the Hashtable, as shown in the following figure:

The advantage is that there is no fear of conflict; The disadvantage is that the random storage performance of the hashing structure is reduced. The essence is to use the single linked list structure to assist the problem of hashing structure.

4. Public overflow Zone method

In fact, for all conflicts, an additional storage space. It is appropriate to use this method if the conflicting data is very small relative to the base table.

8.3 Hash List Lookup implementation

Here's a simple implementation code:

#!/usr/bin/env python #-*-coding:utf-8-*-# author:liu Jiang # python 3.5 # ignores the judgment of data types, element overflows, and so on. Class Hashtable:def __init__ (self, size): Self.elem = [None for i in range (size)] # Use the list data structure as a hash table element to save the method self. count = size # max table Long def hash (self, key): Return key% Self.count # hash function using the addition of remainder method Def insert_hash (self, key): "
      "" Insert the keyword into the hash table "" "Address = Self.hash (key) # hash addresses while self.elem[address]: # The current position has data, there is a conflict.

  Address = (address+1)% Self.count # linear probe to see if the next one is available self.elem[address] = key # Save directly without conflict. 
      def search_hash (self, Key): "" "Lookup keyword, return boolean" "star = Address = Self.hash (key) while self.elem[address]!= key:
        Address = (site + 1)% Self.count if not self.elem[address] or site = = Star: # Description not found or cycled to the beginning Return False return True if __name__ = = ' __main__ ': List_a = [A, M, M, M, Panax, G, h] ash_table = HashTable () for I-list_a:hash_table.insert_hash (i) for I In Hash_table.elem:if I:print ((I, Hash_table.elem.index (i)), end= "") print ("\ n") print (Hash_table.searc

 H_hash) Print (Hash_table.search_hash (33))

8.4 Hash Table Lookup performance analysis

If there is no conflict, its lookup time complexity is O (1), the most extreme good.

However, in reality conflicts can be unavoidable, the following three areas have a greater impact on lookup performance:

Whether the hash function is even
Ways to deal with conflicts
The filling factor for the hash table (the extent to which the data is full)

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed search data structure and algorithm (Python implementation) _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detailed search data structure and algorithm (Python implementation) _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support