A Chinese Word Segmentation search tool under asp.net, asp.net Word Segmentation
Jieba is a retrieval library in python. Someone transplanted this library to the asp.net platform, which can completely replace the combination of lucene.net and pangu word segmentation.
I wrote this because I was asked during the interview yesterday about how to search for keywords on the website? I am talking about SQL fuzzy query, SQL statement optimization, and cache. I have been familiar with keyword word segmentation before, but there is no mature word segmentation search library on the. net platform. Unlike lucene in java, although it is also transplanted to. net, the update is slow. When I learned python, I noticed the python word segmentation search and the word cloud. I thought there was no python word segmentation retrieval library to be transplanted. net, check the python jieba library and try to port it!
Original article Introduction: jieba Chinese word segmentation. NET version: jieba. NET
On the. NET platform, the common word segmentation component is pangu, but it has not been updated for a long time. The most obvious thing is the built-in dictionary. The jieba dictionary has 0.5 million entries, while the pangu dictionary is 0.17 million, which will produce different word segmentation effects. In addition, for Unlogged words, jieba uses the HMM Model Based on the Chinese character tokenization capability and uses the Viterbi algorithm. The effect looks good.
Code address github: https://github.com/anderscui/jieba.NET
You can search and download the file directly in the nuget Package Manager of VS2013:
Some people in the comments said that it would be nice to tell the MIIT virgins about the installation of 24-port switches and other technical devices each month after their subordinate departments, I tested it myself:
Var segmenter = new JiebaSegmenter (); Console. writeLine ("original retrieval statement: after passing through the subordinate departments each month, the MIIT virgin officer shall personally inform 24-port switches and other technical devices for installation"); var segments1 = segmenter. cut ("the Ministry of Industry and Information Technology (MIIT) officer should personally explain the installation of 24 ports of switches and other technical devices through subordinate departments every month", cutAll: true); Console. writeLine ("[full mode]: {0}", string. join ("/", segments1); var segments2 = segmenter. cut. writeLine ("[exact mode]: {0}", string. join ("/", segments2); var segments3 = segmenter. cut. writeLine ("[New Word Recognition]: {0}", string. join ("/", segments3); var segments4 = segmenter. cutForSearch. writeLine ("[Search Engine mode]: {0}", string. join ("/", segments4); var segments5 = segmenter. cut. writeLine ("[eliminate ambiguity]: {0}", string. join ("/", segments5); Console. read ();
Running result:
Good, except for the full mode, the rest can satisfy the order we read.