Use Matlab to find the "shape near word" of English Words

Source: Internet
Author: User
Introduction

Recently, I have been studying Bo's vocabulary, which is about 12000 words.

It is very challenging to remember so many words in a short time. My personal habits are as follows:

  1. Divide unfamiliar long words into familiar small words, make sentences with small words, and include the meaning of the new words;
  2. Read more sentences of this new word to deepen the meaning in the context;
  3. You can use a word that is very similar to this new word to remember the new word.
  4. Ebbinhot has reviewed several times to reduce forgetting.

The third method is to use the "shape near word" to remember.

I think this method is very useful, just like a friend introducing you to new friends.

In addition, when memorizing a word, it is often easy to confuse a word with its near-word shape. Therefore, it is better to search for near-word shape when memorizing a word.

Put some near words together to facilitate memory and discrimination.

For example:

complexitycomplicacycomplicatecomplicitysimplicity

Although many English learners and research institutions have compiled word libraries on the internet, unfortunately, no dictionary can provide word search functions.

Although wildcard characters can be used in dictionary software such as Kingsoft, It is very troublesome to use wildcard characters to search for near words.

I made a simple small program, which helped me memorize words.

Method

This problem is actually a word form approximation problem [4]. in fact, it is the same with spell checking [3] to automatically provide spell suggestion and spell correction based on word similarity,

Regular Expressions can also be used. However, you need to define more than n Deformation Rules by yourself. This is very troublesome.

Here, we mainly calculate the editing distance between two words (edit distance) [1.

A toy program is developed in Matlab to implement the near-word search function using the classic Levenshtein Distance [2] algorithm (for algorithm details, see Wikipedia and related articles. word-to-word matching uses dynamic planning, so the speed is fast.

In addition, Jaro-Winkler distance [5] and phonetic distance [6] can be considered. I only use the L distance here.

An intuitive example of L's algorithm:

Usage
  1. Provide the words you want to query;
  2. Provide the similarity threshold n (n is the edit distance, which means that two words can be matched through several-dimensional editing operations, insertion, deletion, and replacement );
  3. Provide the dictionary you want to query (a txt word list is used here );

After the algorithm is run, the word list is traversed to calculate the edit distance between each word in the dictionary and the word to be queried. Finally, a threshold value is used to filter out the most similar words.

Effect

The word library uses the four or six-level word library.

Set edit distance to 3.

There are four simple near words inserted into the vocabulary of level 4 and level 6 (counted as its own ):

Code

Main function:

%% this is the main function% Input :%wordToMatch - input word%distThresh- edit distance threshold, usually use 3%dicPath- file path of the word list file, txt format with every%  line of a single word% % Output :%command window output % % Created by visionfans @ 2011.07.20function findSimilarWords(wordToMatch, distThresh, dicPath )global word;word = wordToMatch;%% check parametersswitch nargincase 0,error('Wrong arguments!');case 1,distThresh = 3;dicPath = '46.txt';end%% load word listwordList = loadWordList(dicPath);%% calculate edit distanceeditDist = cellfun(@calcEditDist,wordList);%% filter the similar wordssimilarWords = wordList(editDist < distThresh);%% display resultsfprintf('There are %d similar words with "%s" : \n', length(similarWords), word);cellfun(@(x)fprintf('\t%s\n', x),similarWords);end

L distance calculation function:

%% this function is used to calculate the Levenshtein Edit Distance% % S1 and S2 are two words you want to calculate their edit distance% % Created by visionfans @ 2011.07.20function dist = calcEditDist(s1,s2)        global word;    if nargin == 1        s2 = word;    end            %% calculate the edit distance with DP    m = length(s1);    n = length(s2);    if m*n == 0        dist = Inf;        return;    end        table = zeros(m,n);    table(:,1) = 0:m-1;    table(1,:) = 0:n-1;        for i=2:m        for j=2:n            if s1(i-1)==s2(j-1)                table(i,j) = table(i-1,j-1);            else                table(i,j) = 1 + min([(min(table(i-1,j),table(i,j-1))),table(i-1,j-1)]);            end        end    end        %% set result    dist = table(m,n);    return;end

Dictionary loading function:

%% this function is used to load the dictionary file% The dictionary file is a text file with the format of every line be a% single word.% % You can find a word list file with adequate common words here:%    Kevin's Word List Page - http://wordlist.sourceforge.net/% % Created by visionfans @ 2011.07.20function wordList = loadWordList(dictPath)    fprintf('Loading word list ...\n');    fid = fopen(dictPath);    i = 1;    tline = fgetl(fid);    while ischar(tline)        wordList{i,1} = tline;        tline = fgetl(fid);        i = i+1;    end    fclose(fid);end

Supplement

There are many more comprehensive dictionary files, which can be found in [7.

Thanks to the Jukuu network engineer YNYS for providing the word list and Thanks.

This article has been extended to [8]. If you are interested, you can perform the test.

--------------------------------------------------------------------------- For personal use, so I am too lazy to change C.

References

[1] Edit distance, http://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html

[2] Levenshtein distance-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Levenshtein_distance

[3] How to Write a Spelling Corrector, http://norvig.com/spell-correct.html

[4] Approximate string matching-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Approximate_string_matching

[5] Jaro-Winkler distance-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Jaro-Winkler_distance

[6] Soundex-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Soundex

[7] Dictionary & Glossary Links; Downloadable Word Lists, http://www.net-comber.com/wordurls.html

Vocabulary, http://download.csdn.net/source/3455828

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.