String similarity algorithm/the arithmetic of string similarity degree

Source: Internet
Author: User

According to dongle2001's introduction to string similarity algorithms (sorting), there are three types of algorithms:

1. levenshtein distance)

The editing distance is used to calculate the minimum insert, delete, and replace required to convert the original string (s) to the target string (t ).
The number is widely used in NLP, such as wer and mwer in some evaluation methods. It is also commonly used to calculate the number of changes you have made to the original version. Levenshtein, also called levenshtein distance, was first proposed by Russian scientist levenshtein.
The levenshtein distance algorithm can be considered as dynamic planning. The idea is to compare two strings on the left, record the substring similarity (actually called distance) that has been compared, and then obtain the similarity at the next character position. Use the following example: gumbo and gambol. When we calculate the position of matrix D [3, 3], that is, when we compare gum and GAM, We Need To GU-GAM from the three pairs that have been compared, the smallest difference between GUM-GA and GU-GA is its value. therefore, we need to construct a matrix from top left to bottom right.

2. Longest Common substring (LCS)
The LCS problem is to find the longest common substring of two strings. The solution is to use a matrix to record two characters.
The matching condition between two characters in all positions in the string. If it matches, it is 1; otherwise, it is 0. Then we can find the longest 1 series of diagonal lines. The corresponding position is the longest position matching the substring.
The following is the matching matrix between string 21232523311324 and string 312123223445. The former is in the X direction,
The latter is in the Y direction. It is not hard to find. The red part is the longest matching substring. The longest matching substring is 21232.


3. cosine theorem (Vector Space algorithm)
The ancient and extensive mathematical concepts of cosine theorem have been widely used in various disciplines and practices. Here we will briefly introduce its application in determining the similarity between two strings.

For more information, see the original document.

The string similarity algorithm (levenshtein Distance Algorithm) of Alibaba CPP implements the LD Algorithm in C ++.

Thanh Dao's "an improvement on capturing similarity between strings" implements an improved LD algorithm using C.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.