After all, DNA matching is a little far away from life. Since it is a string matching, can I make a spelling mistake check?
First, we need to introduce the concept of editing distance ). Editing distance refers to the operation required to modify a string to another string. There are usually three situations:
1. insert: "AC" -----> "ABC". The editing distance is 1.
2. Delete: "ABC" ------> "AC". The editing distance is 1.
3. Replace: "ABC" ------> "ADC". The editing distance is 1.
Here we define the editing distance between two strings X and Y:
Dist = Len (x) + Len (y)-score (x, y)
Score indicates the score returned by global match. But what about scoring_matrix? I found some strings that I could calculate and edit the distance, and then calculated some possible values one by one. The scoring_matrix parameter mentioned here has three: diag_score (the same character), off_diag_score (two characters, but different), and dash_score (one character, one '-'). The three values are as follows:
diag_score: 2off_diag_score: 1dash_score: 0
Note that you should not select strings that are too complex. In this case, the editing distance is easily counted incorrectly. For example, we only need to match "abcdged" with "acdad" by ourselves, it is difficult to find the best match.
There is also a word_list.txt in the source code, which contains more than 70 thousand words. In this way, we can change the word distance from the given string.
def check_spelling(checked_word, dist, word_list): alphabet = set('qwertyuiopasdfghjklzxcvbnm') scoring_matrix = scoring_matrix = build_scoring_matrix(alphabet=alphabet, diag_score=2, off_diag_score=1, dash_score=0) answer = set([]) for word in word_list: edit_distance = get_edit_distance(checked_word, word, scoring_matrix) if edit_distance <= dist: answer.add(word) return answer
Get_edit_distance is actually a function used to calculate the editing distance according to the formula given above. I tested it a little and looked for a similar word with the editing distance of 1, which could be completed in over a dozen seconds (probably because my computer has a high configuration ). But the spelling check is still in pursuit of real-time performance. Once I write a wrong word and press the space, there should be a prompt. A dozen seconds is not acceptable.
-------- Fan Wai --------
Here we tested the iteration efficiency of set and list, and found that the iteration efficiency of list is higher than that of set, while the efficiency of set is far higher than that of list to find whether an element is in the object, in addition, the memory occupied by the set is lower than the list.
-------- Off-peak ------
When users enter a word, we can first find whether the word is in a correctly spelled word set (it is very efficient to find whether an object is in a set ). If the dist is not in the word list, the word set with DIST 1 or 2 will be displayed.
I suddenly thought of the spelling mistake check in earlier versions of word, and it was very fast. But I had to wait for half a day for the spelling mistake check. Is this the way I used it?
After entering the lookup program, you can check the word length before matching. If the length difference is greater than the given Dist, you do not need to enter the match and discard it directly. The efficiency should be improved.
In the Internet era, the performance of cloud computing is self-evident, while online editing must be saved in real time, and the bandwidth for transmitting a word is also very narrow. Therefore, computing can be placed on the cloud, this is almost real-time. The operation of checking whether a word is in the correct spelling set can be done by JavaScript, which can not only relieve the burden on the server, but also the user side does not feel any lag.
I wrote three articles about string matching in one breath, which is so refreshing that the plagiarism check is coming soon. Because I am looking for a job recently, the fourth article may be a little late, so stay tuned.
Source code: GitHub
Character string matching 3: spelling error check program