Write a search engine with Golang

Last Update:2016-04-14 Source: Internet

Author: User

Tags blizzard

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

Write a search engine with Golang

Guess you like it.

Golang Introduction--A 2D graphics library learning
Getting Started with Golang-a simple HTTP client
Golang's first deadlock
Litejob, a Golang local task Scheduler
Self-hacking marathon again--without a third-party library to implement a Golang-based Web service
How to use Golang reflection values to define a variable
Write a search engine with Golang
Write a search engine with Golang (0x01)
Write a search engine with Golang (0x02)
Write a search engine with Golang (0x03)

Already said the inverted index of the rationale, the principle is very simple, also very good understanding, the key is how to design a second inverted table, the second column of the inverted table is also very good design, the first column is the key, in order to meet the performance of Fast lookup, design the structure of the first column, we need to meet the following two conditions.

Find very fast and can find the location of the key words we need in a very short period of time.
Adding keywords also needs to be faster, to ensure that the input documents as fast as possible.

In addition to the above two conditions, there are some additional points:

If you could use as little memory as possible, that would be good
If you can traverse the entire column sequentially, it's certainly better than

In order to meet the ability to find, can add, we first think of is the sequential table, that is, linked list, linked list words, add no problem, the key is to find the complexity of O (n), which can endure? So the list is not the first one. But there is a variant of the list, we can use it, that is the jumping table .

Jumping table (skiplist)

What is a jumping table? Jumping table is also called jumping table, we can think of it as a variant of the list, is a multi-level sequential list of the parallel structure of the table, the definition of Wikipedia is

is a randomized data structure, a linked list based on a parallel, whose efficiency is comparable to a binary lookup tree (o (log n) Average time for most operations)

Let's take a look at the jump chart (image source)

Obviously, the bottom is a sequential table, and then on the 1,3,4,6,9 node on the second layer of the linked list, and then continue to appear on the 1,4,6 node above the third-tier linked list, so that the construction of the three-level linked list query efficiency is higher than the first layer, In general, the way to build a jump table is based on the probability to decide whether you need to add a layer for this node, here in the layer i elements in a fixed probability p (usually 0.5 or 0.25) appears on the layer i +1. On average, each element appears in a list of 1/(1-p ), while the highest-level element (usually a special head element at the front of the Jump List) is in O (log1/< Span style= "padding:0px; margin:0px ">p n ) appears in a list.

When looking for elements, start with the head element and the top-level list and search along each linked list until it reaches the last element that is less than or equal to the target. By tracking the reverse lookup path from the target until it reaches the element that appears in the higher list, the number of steps that are expected in each linked table is clearly 1/p . So the overall cost of the lookup is O ((Log1/p n)/p), when p is a constant O (Log n ). By choosing a different p value, you can make a tradeoff between finding the cost and the storage cost.

For example, the above figure, we want to find 7 this element, we need to traverse 1->4->6->7, more efficient than a list of linked lists

In the implementation of the jump table, although the general use of probability to determine whether or not to increase the level of the current node, but the actual problem can be specific analysis, such as we know the underlying list of how long, then we have 10 elements per cell to add a level, then such a jump table storage space We can probably estimate, We can estimate the average query time.

Jumping table is a very useful data structure, and it is easier to implement, the list is known to implement, then the Jump table is a set of linked lists, just add and delete the need to operate multiple linked lists.

I do not use the Jump table in the project, the subsequent need to add it, so we can not see the code. let you down. Oh.

General jumping table can be used together with hash, because the hash has a bucket, occupy a large amount of memory, if the hash value exists in the Jump table, with mmap to load the jumping table into memory, then both save the memory, and a better query speed, and the implementation is quite simple.

Jumping table for the search engine to achieve the self-growth type of the primary key is also more appropriate, first in the search engine, the primary key lookup is not so frequent, the general query is through the keyword query, to the primary key, the query speed requirements are not particularly high, only when the primary key changes need to be queried, Second, since the growth of the primary key in general, the insert operation directly behind the list append can be, do not query, so the time to insert is also relatively fast.

Hash table

The hash table is also a way of implementing the skip tables, which are directly accessing the data structure in the memory location according to the keyword (key value). That is, it accesses records by calculating a function of the key value, mapping the data of the desired query to a location in the table, which speeds up the lookup. This mapping function is called a hash function, and the array that holds the record is called a hash table, also known as a hash table.

Hash is the foundation of big data technology, we should all know, here is not the depth of the expansion, the introduction of the algorithm has a chapter has been very clear, here say I think more interesting a hash of things.

The core of a hash table is a hashing algorithm, a good hashing algorithm can make collisions less, and find the speed closer to O (1), so a good hashing algorithm is very important.

hash algorithm a lot of, say are not finished, different algorithms to adapt to different scenarios, I know, legend has a hashing algorithm, from World of Warcraft (!!!!!) For the tribe!!! , known as Blizzard hash , the algorithm produces a hash value that is completely unpredictable and is called "one-way hash" (a one-way hash is a algorithm that's constructed in such A-t Hat deriving the original string (set of strings, actually) is virtually impossible).

The following is the implementation of this algorithm's go language, in my project also, but later I did not use a hash table, so deleted, claiming to have this algorithm, all strings are easy, collision probability is very low.

  Initialize hash calculation required base map tablefunc initcrypttable ()  {    var seed,  index1, index2 uint64 = 0x00100001, 0, 0    i :=  0    for index1 = 0; index1 < 0x100; index1  += 1 {        for index2, i =  index1, 0; i < 5; index2 += 0x100 {             seed =  (seed*125 + 3)  % 0x2aaaab             temp1 :=  (seed &  0XFFFF)  << 0x10            seed  =  (seed*125 + 3)  % 0x2aaaab             temp2 := seed & 0xffff             crypttable[index2] = temp1 | temp2             i += 1        }    }}/ / hash,  and correlation Check hash value Func hashkey (lpszstring string, dwhashtype int)  uint64  {    i, ch := 0, 0    var seed1,  seed2 uint64 = 0x7fed7fed, 0xeeeeeeee    var key uint8     strlen := len (lpszstring)     for i < strlen  {        key = lpszString[i]         ch = int (ToUpper (Rune (Key)))          I += 1        seed1 = crypttable[(dwhashtype<<8) +ch] ^   (SEED1&NBSP;+&NBSP;SEED2)         seed2 = uint64 (CH)  + seed1 + seed2 +  (seed2 << 5)  + 3     }    return uint64 (SEED1)}

There are many ways to implement a hash table, the most basic is the form of the array + chain list , also known as the chain hash, the length of the array is the length of the hash, the list is used to resolve the conflict, when inserting data when the hash collision, the specific node hangs in the chain list behind the node, query data when there is a conflict , we continue to query the linked list under this node linearly.

There is also a closed-chain hash, the closed-chain hash is actually a loop array, the length of the array is the length of the bucket, when inserting the data when there is a conflict, move to the next node, until there is no conflict, if moved to the end, to the head of the array, look for data similar to the time.

There is a small problem here, if the collision, whether it is open chain or closed-chain hash, both need to be linear match, and compare the actual value of two data, so regardless of the kind of hash implementation, all need to save the original data information in the node, Otherwise there is no way to match the collision, so that the two problems derived from:

If key is a long string, then the storage space of the hash table will be larger to store this string for comparison.
If it is a string comparison, then the speed is slow, when the collision is more, it will affect performance, although the current machine these comparisons are not a cinch.

However, Thunder Cliff programmers think of a better way to use the above hash function, through different dwhashtype , hashes three times, gets three integers, the first integer is used to determine the position, the second and third integers are used instead of the original string, stored in the node of the hash table to resolve the conflict, when to query, First calculate the three hash of the key to be queried, and then use the first to locate, if the first value does not conflict, the return node, if the conflict, then whether the open chain implementation mode or closed chain implementation, find the next node, and then compare the second and third hash values of the two nodes, if the same, return the node, Not the same words continue to find the next, through so Daoteng, first, storage space problem solved, each hash node only need to save 3 integers, space fixed, the second problem also solved, compare two integers is faster than the comparison string.

OK, jumping tables and hash table is these, in my code, there is no jumping table, followed by, the hash table would have, and later in order to save memory space, with a B + tree instead of the hash tables, so the Hashtable code temporarily not see, but I have the blizzard hash written above the ha.

The next chapter will detail the B + Tree, My Code is also used in the B + tree, and almost all of the database index is a B + tree.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More