MMSEG Custom Word Segmentation Thesaurus

Source: Internet
Author: User
Tags fread

below to introduce you how to build your own Coreseek word thesaurus. Coreseek itself with a thesaurus is not very large, directly using it may return a large number of useless results. It is essential to create a dedicated word-breaker for search results.

I. First to Sogou http://pinyin#sogou#com/dict/download the thesaurus you want

II. Since the library is not a text file, we cannot use it directly, so we need to convert it to a text file first. Search the Internet for a sogou to Google's gadget, use it to download all the thesaurus you downloaded into a text file. Merge to a file named Words.txt. Files to be saved with UTF8 encoding, if you want to directly use my tool below to convert the file name must be words.txt. If you want to convert yourself please refer to the official online method http://www#coreseek#cn/opensource/mmseg/

III. Now we have an initial thesaurus, but the thesaurus is not ready to be used directly, and it is necessary to reorganize and convert the format used by Coreseek. Here I provide a small program that I have written to facilitate the conversion. The source program is as follows:

/**
Last edit 2012-8-11
[Email protected] www.4ji.cn
**/
Ini_set (' max_execution_time ', ' 6000 ');


$buffer =ini_get (' output_buffering ');
if ($buffer) Ob_end_flush ();

echo ' Processing new thesaurus ...
‘;
Flush ();
$filename = "Words.txt";
$handle = fopen ($filename, "R");
$content = Fread ($handle, FileSize ($filename));

Fclose ($handle);

$content =trim ($content);
$arr 1 = explode ("\ r \ n", $content);
$arr 1=array_flip (Array_flip ($arr 1));
foreach ($arr 1 as $key = = $value) {
$value =dealchinese ($value);
if (!empty ($value)) {
$arr 1[$key] = $value;
}
else{
unset ($arr 1[$key]);
}

}

Echo ' Processing original thesaurus ...
'; flush ();
$filename 2 = "Unigram.txt";
$handle 2 = fopen ($filename 2, "R");
$content 2 = fread ($handle 2, FileSize ($filename 2));
Fclose ($handle 2);
$content 2=dealchinese ($content 2, "\ r \ n");
$arr 2 = explode ("\ r \ n", $content 2);
Echo ' Delete the same terms ...
'; flush ();
$array _diff=array_diff ($arr 1, $arr 2);

echo ' Format Thesaurus ...
'; flush ();
$words = ";
foreach ($array _diff as $k = + $word) {
$words. = $word. " \t1\r\nx:1\r\n ";
}
Echo $words;
File_put_contents (' Words_new.txt ', $words, file_append);
Echo ' done! ';

function Dealchinese ($str, $join = ") {
Preg_match_all ('/[\x{4e00}-\x{9fff}]+/u ', $str, $matches); Match Chinese characters to all
$str = Join ($join, $matches [0]); Regroup from matching results
return $str;
}
?>

Here's how to use it:

1. Put the Words.txt, conversion tool words_format.php and c:\coreseek\etc\unigram.txt three files in the same directory as the server that can run PHP.

2. Then visit words_format.php.

3. Wait for the program to run, the length of time depends on how much your word, too many words in the middle may be suspended animation. After the operation will be in the same directory production words_new.txt the file added to the original unigram.txt, save the backup.

4. Copy the above-obtained file unigram.txt to C:\coreseek\bin and enter the directory under the command line C:\coreseek\bin execute mmseg-u unigram.txt the command executes, A file named Unigram.txt.uni will be generated in the directory where the Unigram.txt is located, and the file will be renamed Uni.lib to complete the construction of the dictionary.

5. Test whether the new thesaurus is being used to solve participle. Create a new text file Test.txt under C:\coreseek\bin. Enter the keywords you want to test. For example: Four Seasons clothing network in large fabric accessories, and then save. Be sure to include a keyword in your new thesaurus. For example, the Four Seasons Clothing network is my new keyword. Then execute the mmseg-d C:\coreseek\bin test.txt>result.txt under the command line just now. After execution, open the result file for the new production result.txt. If you see the word segmentation results similar to the Four Seasons clothing net/x in the big/x Fabric/X Accessories/ To check what went wrong, re-production.

View Help:/usr/local/mmseg3/bin/mmseg


6. Copy the resulting uni.lib to the C:\coreseek\etc overwrite the original file and you're done.



MMSEG Custom Word Segmentation Thesaurus

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.