1. Hash table)
Note: A special data structure.
Features: quick search, insertion, and deletion.
1.1 Basic Ideas
Array features: addressing is easy, but insertion and deletion are difficult.
Linked list features: addressing is difficult, but insertion and deletion are easy.
A hash table is created by combining the two advantages.
1.2 Basic Concepts
- Hash Table: A hash table is a data structure directly accessed Based on keywords (code value), that is, map.
- Hash function: A hash table ing function.
- Standard definition: If the keyword is $ K $, the value is stored in the storage location of $ F (k) $. Therefore, you can directly obtain the queried records without comparing them. The ing relation $ F $ is a hash function, and the table created based on this idea is a hash table.
2. Hash Algorithm
A hash algorithm is not a specific algorithm, but a collective name of a class of algorithms.
The hash algorithm is also called a hash algorithm. Generally, $ F (data) = Key $ is used to input $ data of any length, after processing by the hash algorithm, a fixed-length data $ key $ is output.
In short, the hash algorithm can be analogous to a pseudo-random number generator to convert "plaintext> ciphertext". That is, a piece of information is mapped to a fixed-length numeric string by the hash algorithm.
Nature:
- Irreversible, that is, the hash value of $ x $ cannot be obtained $ x $.
- No conflict exists, that is, knowing $ x $, you cannot find a $ y $, so that the hash values of $ x $ and $ y $ are the same.
"Collision" means that different input data correspond to the same hash value.
Note: "Collision" does not violate the "no conflict" nature. A good hash algorithm should be highly resistant to conflicts.
Hamming distance: In Information encoding, the numbers of digits corresponding to two valid codes are called the Hamming distance. That is, the number of 1 in the two binary strings or results.
For common hash algorithms, see several common hash algorithms.
3. Local sensitive hashing
Local sensitive hashing is similar to the concept of spatial domain conversion.
Feature: maintain data similarity.
If the original data space of the two texts is similar, it is also similar after hash conversion. On the contrary, if the two texts are not similar, they are not similar after conversion.
The local sensitive hash is relative. what we talk about to keep the data similarity is not to say that the packet eats 100% of the similarity, but to maintain the highest possible similarity (after dimensionality reduction ).
3.1 shingling
Shingling is a deduplication algorithm.
Map the string set to be queried to a set. For example, the string "abcede" is mapped to the set "(A, B, C, D, E.
Note that the elements in the set are non-repeated. This step is called shingling, which means to build a short string set in the document, that is, the shingle set.
3.2 simhash
The traditional hash function maps the original content as evenly and randomly as possible to the signature value. For the traditional hash function, if the two signature values are equal, it means that the original content is the same to a certain extent (with a certain probability). If the original content is not the same, no information is provided except that the original content is not the same.
Traditional hash functions cannot measure the similarity of original content in the signature value dimension, while simhash is a local sensitive hash, the hash signature generated by the hash signature can represent the similarity of the original content to a certain extent.
For text deduplication, there are many NLP-related algorithms that can be solved with high accuracy. However, to deduplicate text in the big data dimension, this requires a high algorithm efficiency.
The biggest advantages of simhash:
- You can map text to numbers.
- Similar text is equivalent to similar signature ".
Summary of basic hash knowledge points