Self-built personalized Coreseek Word Segmentation Library

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Here to introduce how to build their own Coreseek word thesaurus. Coreseek itself with a thesaurus is not very large, direct use of it may return a lot of useless results. It is necessary to create a special word-breaker in order to search results accurately.

I. First to Sogou http://pinyin#sogou#com/dict/download the thesaurus you want

Ii. since the thesaurus is not a text file, we cannot use it directly, so we need to convert it to a text file first. Search the Internet to find a small tool to Google, with which you can download all the thesaurus into a text file. Merge to a file named Words.txt. Files are saved with UTF8 encoding, and the filename must be words.txt if you want to convert directly using the tools I have below. If you want to convert yourself please refer to the method of the official web http://www#coreseek#cn/opensource/mmseg/

Iii. now 11545.html "> We have a preliminary thesaurus, but the thesaurus is not directly available, and it is necessary to reorganize and transform the format used by Coreseek." Here I provide a small program that I write to facilitate the conversion. The source program is as follows:

/**
Last edit 2012-8-11
copyrigh@ www.4ji.cn
**/
Ini_set (' max_execution_time ', ' 6000 ');


$buffer =ini_get (' output_buffering ');
if ($buffer) Ob_end_flush ()

Echo handles the new thesaurus ...
';
Flush ();
$filename = "Words.txt";
$handle = fopen ($filename, "R");
$content = fread ($handle, FileSize ($filename));

Fclose ($handle);

$content =trim ($content);
$arr 1 = explode ("\ r \ n", $content);
$arr 1=array_flip (Array_flip ($arr 1));
foreach ($arr 1 as $key => $value) {
$value =dealchinese ($value);
if (!empty ($value)) {
$arr 1[$key] = $ Value;
}
else{
unset ($arr 1[$key]);
}

}

Echo handles the original thesaurus ...
'; flush ();
$filename 2 = "Unigram.txt";
$handle 2 = fopen ($filename 2, "R");
$content 2 = fread ($handle 2, FileSize ($filename 2));
Fclose ($handle 2);
$content 2=dealchinese ($content 2, "\ r \ n");
$arr 2 = explode ("\ r \ n", $content 2);
Echo Deletes the same entry ...
'; flush ();
$array _Diff=array_diff ($arr 1, $arr 2);

Echo Format Thesaurus ...
'; flush ();
$words = ';
foreach ($array _diff as $k => $word) {
$words. = $word. \t1\r\nx:1\r\n ";
}
//echo $words;
file_put_contents (' Words_new.txt ', $words, file_append);
Echo ' done! ';

Function Dealchinese ($str, $join = ') {
Preg_match_all ('/[\x{4e00}-\x{9fff}]+/u ', $str, $matches); Match the Chinese characters all out
$str = Join ($join, $matches [0]);//regroup from match result
return $str;
}

Use the following methods:

1. Put the Words.txt, conversion tools words_format.php and c:\coreseek\etc\unigram.txt three files into the same directory as the server that can run PHP.

2. Then visit words_format.php.

3. Waiting for the program to run, the length of time depends on the number of your words, too much of the middle may be suspended death. After running, it will be produced in the same directory Words_new.txt add this file to the back of the original unigram.txt and save the standby.

4. Copy the resulting file unigram.txt to C:\coreseek\bin and then enter the directory C:\coreseek\bin the command line to execute mmseg-u unigram.txt after the command is executed, A file named Unigram.txt.uni will be generated in the directory of Unigram.txt, and the file will be renamed Uni.lib to complete the construction of the dictionary.

5. Test the new thesaurus to be able to solve participle. Create a new text file Test.txt under C:\coreseek\bin. Enter the keywords you want to test. For example: Four Seasons Clothing network of large fabric accessories, and then save. Be sure to include a keyword in your new thesaurus. For example, the Four Seasons Clothing network is my new keyword. Then execute the mmseg-d C:\coreseek\bin test.txt>result.txt at the command line just now. Open the new production results file Result.txt after execution. If you see the word segmentation results similar to the Four Seasons clothing net X Fabric x Accessories/ X words to prove that the thesaurus has been correctly generated, if you see new keywords are divided into: four x season x x installed x net x Fabric x Accessories x the words of the new thesaurus is not correct. To check what went wrong, re-production.

6. Copy the Uni.lib to the C:\coreseek\etc and overwrite the original file.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.