Use Matlab to find the "shape near word" of English Words

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Recently, I have been studying Bo's vocabulary, which is about 12000 words.

It is very challenging to remember so many words in a short time. My personal habits are as follows:

Divide unfamiliar long words into familiar small words, make sentences with small words, and include the meaning of the new words;
Read more sentences of this new word to deepen the meaning in the context;
You can use a word that is very similar to this new word to remember the new word.
Ebbinhot has reviewed several times to reduce forgetting.

The third method is to use the "shape near word" to remember.

I think this method is very useful, just like a friend introducing you to new friends.

In addition, when memorizing a word, it is often easy to confuse a word with its near-word shape. Therefore, it is better to search for near-word shape when memorizing a word.

Put some near words together to facilitate memory and discrimination.

For example:

complexitycomplicacycomplicatecomplicitysimplicity

Although many English learners and research institutions have compiled word libraries on the internet, unfortunately, no dictionary can provide word search functions.

Although wildcard characters can be used in dictionary software such as Kingsoft, It is very troublesome to use wildcard characters to search for near words.

I made a simple small program, which helped me memorize words.

Method

This problem is actually a word form approximation problem [4]. in fact, it is the same with spell checking [3] to automatically provide spell suggestion and spell correction based on word similarity,

Regular Expressions can also be used. However, you need to define more than n Deformation Rules by yourself. This is very troublesome.

Here, we mainly calculate the editing distance between two words (edit distance) [1.

A toy program is developed in Matlab to implement the near-word search function using the classic Levenshtein Distance [2] algorithm (for algorithm details, see Wikipedia and related articles. word-to-word matching uses dynamic planning, so the speed is fast.

In addition, Jaro-Winkler distance [5] and phonetic distance [6] can be considered. I only use the L distance here.

An intuitive example of L's algorithm:

Usage

Provide the words you want to query;
Provide the similarity threshold n (n is the edit distance, which means that two words can be matched through several-dimensional editing operations, insertion, deletion, and replacement );
Provide the dictionary you want to query (a txt word list is used here );

After the algorithm is run, the word list is traversed to calculate the edit distance between each word in the dictionary and the word to be queried. Finally, a threshold value is used to filter out the most similar words.

Effect

The word library uses the four or six-level word library.

Set edit distance to 3.

There are four simple near words inserted into the vocabulary of level 4 and level 6 (counted as its own ):

Code

Main function:

%% this is the main function% Input :%wordToMatch - input word%distThresh- edit distance threshold, usually use 3%dicPath- file path of the word list file, txt format with every%  line of a single word% % Output :%command window output % % Created by visionfans @ 2011.07.20function findSimilarWords(wordToMatch, distThresh, dicPath )global word;word = wordToMatch;%% check parametersswitch nargincase 0,error('Wrong arguments!');case 1,distThresh = 3;dicPath = '46.txt';end%% load word listwordList = loadWordList(dicPath);%% calculate edit distanceeditDist = cellfun(@calcEditDist,wordList);%% filter the similar wordssimilarWords = wordList(editDist < distThresh);%% display resultsfprintf('There are %d similar words with "%s" : \n', length(similarWords), word);cellfun(@(x)fprintf('\t%s\n', x),similarWords);end

L distance calculation function:

%% this function is used to calculate the Levenshtein Edit Distance% % S1 and S2 are two words you want to calculate their edit distance% % Created by visionfans @ 2011.07.20function dist = calcEditDist(s1,s2)        global word;    if nargin == 1        s2 = word;    end            %% calculate the edit distance with DP    m = length(s1);    n = length(s2);    if m*n == 0        dist = Inf;        return;    end        table = zeros(m,n);    table(:,1) = 0:m-1;    table(1,:) = 0:n-1;        for i=2:m        for j=2:n            if s1(i-1)==s2(j-1)                table(i,j) = table(i-1,j-1);            else                table(i,j) = 1 + min([(min(table(i-1,j),table(i,j-1))),table(i-1,j-1)]);            end        end    end        %% set result    dist = table(m,n);    return;end

Dictionary loading function:

%% this function is used to load the dictionary file% The dictionary file is a text file with the format of every line be a% single word.% % You can find a word list file with adequate common words here:%    Kevin's Word List Page - http://wordlist.sourceforge.net/% % Created by visionfans @ 2011.07.20function wordList = loadWordList(dictPath)    fprintf('Loading word list ...\n');    fid = fopen(dictPath);    i = 1;    tline = fgetl(fid);    while ischar(tline)        wordList{i,1} = tline;        tline = fgetl(fid);        i = i+1;    end    fclose(fid);end

Supplement

There are many more comprehensive dictionary files, which can be found in [7.

Thanks to the Jukuu network engineer YNYS for providing the word list and Thanks.

This article has been extended to [8]. If you are interested, you can perform the test.

--------------------------------------------------------------------------- For personal use, so I am too lazy to change C.

References

[1] Edit distance, http://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html

[2] Levenshtein distance-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Levenshtein_distance

[3] How to Write a Spelling Corrector, http://norvig.com/spell-correct.html

[4] Approximate string matching-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Approximate_string_matching

[5] Jaro-Winkler distance-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Jaro-Winkler_distance

[6] Soundex-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Soundex

[7] Dictionary & Glossary Links; Downloadable Word Lists, http://www.net-comber.com/wordurls.html

Vocabulary, http://download.csdn.net/source/3455828

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Matlab to find the "shape near word" of English Words

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Matlab to find the "shape near word" of English Words

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support