MMSEG Custom Word Segmentation Thesaurus

Last Update:2015-03-18 Source: Internet

Author: User

Tags fread

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

below to introduce you how to build your own Coreseek word thesaurus. Coreseek itself with a thesaurus is not very large, directly using it may return a large number of useless results. It is essential to create a dedicated word-breaker for search results.

I. First to Sogou http://pinyin#sogou#com/dict/download the thesaurus you want

II. Since the library is not a text file, we cannot use it directly, so we need to convert it to a text file first. Search the Internet for a sogou to Google's gadget, use it to download all the thesaurus you downloaded into a text file. Merge to a file named Words.txt. Files to be saved with UTF8 encoding, if you want to directly use my tool below to convert the file name must be words.txt. If you want to convert yourself please refer to the official online method http://www#coreseek#cn/opensource/mmseg/

III. Now we have an initial thesaurus, but the thesaurus is not ready to be used directly, and it is necessary to reorganize and convert the format used by Coreseek. Here I provide a small program that I have written to facilitate the conversion. The source program is as follows:

/**
Last edit 2012-8-11
[Email protected] www.4ji.cn
**/
Ini_set (' max_execution_time ', ' 6000 ');

$buffer =ini_get (' output_buffering ');
if ($buffer) Ob_end_flush ();

echo ' Processing new thesaurus ...
‘;
Flush ();
$filename = "Words.txt";
$handle = fopen ($filename, "R");
$content = Fread ($handle, FileSize ($filename));

Fclose ($handle);

$content =trim ($content);
$arr 1 = explode ("\ r \ n", $content);
$arr 1=array_flip (Array_flip ($arr 1));
foreach ($arr 1 as $key = = $value) {
$value =dealchinese ($value);
if (!empty ($value)) {
$arr 1[$key] = $value;
}
else{
unset ($arr 1[$key]);
}

}

Echo ' Processing original thesaurus ...
'; flush ();
$filename 2 = "Unigram.txt";
$handle 2 = fopen ($filename 2, "R");
$content 2 = fread ($handle 2, FileSize ($filename 2));
Fclose ($handle 2);
$content 2=dealchinese ($content 2, "\ r \ n");
$arr 2 = explode ("\ r \ n", $content 2);
Echo ' Delete the same terms ...
'; flush ();
$array _diff=array_diff ($arr 1, $arr 2);

echo ' Format Thesaurus ...
'; flush ();
$words = ";
foreach ($array _diff as $k = + $word) {
$words. = $word. " \t1\r\nx:1\r\n ";
}
Echo $words;
File_put_contents (' Words_new.txt ', $words, file_append);
Echo ' done! ';

function Dealchinese ($str, $join = ") {
Preg_match_all ('/[\x{4e00}-\x{9fff}]+/u ', $str, $matches); Match Chinese characters to all
$str = Join ($join, $matches [0]); Regroup from matching results
return $str;
}
?>

Here's how to use it:

1. Put the Words.txt, conversion tool words_format.php and c:\coreseek\etc\unigram.txt three files in the same directory as the server that can run PHP.

2. Then visit words_format.php.

3. Wait for the program to run, the length of time depends on how much your word, too many words in the middle may be suspended animation. After the operation will be in the same directory production words_new.txt the file added to the original unigram.txt, save the backup.

4. Copy the above-obtained file unigram.txt to C:\coreseek\bin and enter the directory under the command line C:\coreseek\bin execute mmseg-u unigram.txt the command executes, A file named Unigram.txt.uni will be generated in the directory where the Unigram.txt is located, and the file will be renamed Uni.lib to complete the construction of the dictionary.

5. Test whether the new thesaurus is being used to solve participle. Create a new text file Test.txt under C:\coreseek\bin. Enter the keywords you want to test. For example: Four Seasons clothing network in large fabric accessories, and then save. Be sure to include a keyword in your new thesaurus. For example, the Four Seasons Clothing network is my new keyword. Then execute the mmseg-d C:\coreseek\bin test.txt>result.txt under the command line just now. After execution, open the result file for the new production result.txt. If you see the word segmentation results similar to the Four Seasons clothing net/x in the big/x Fabric/X Accessories/ To check what went wrong, re-production.

View Help:/usr/local/mmseg3/bin/mmseg

6. Copy the resulting uni.lib to the C:\coreseek\etc overwrite the original file and you're done.

MMSEG Custom Word Segmentation Thesaurus

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More