. NET Core Chinese sub-phrase piece jieba.net Core

Last Update:2017-05-15 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Characteristics

Three types of word breakers are supported:
- Precision mode, Try to cut the sentence most accurately, suitable for text analysis;
- Full mode, The sentence all can be the word words are scanned out, the speed is very fast, but can not solve the Ambiguity. specifically, the word segmentation process does not rely on the frequency of finding the maximum probability path, nor the use of hmm;
- Search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine Participle.
Support Traditional participle
Support for adding custom dictionaries and custom words

Jieba.net Core Usage

The download code opens with VS 2017, or opens the project using VS Code.

Select Jieba.net as the starter project, the Program.cs code is as Follows:

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

    class Program    {         static void main (string[] args)         {             encoding.registerprovider ( codepagesencodingprovider.instance);             Var segmenter = new jiebasegmenter ();             var segments = segmenter. Cut ("i came to Tsinghua University in beijing",  cutall: true);             console.writeline ("full mode": {0} ",  string. Join ("/ ",  segments));             Segments = segmenter. Cut ("i came to Beijing Tsinghua university");  //  default to exact mode              console.WriteLine ("exact mode": {0} ",  string. Join ("/ ",  segments));             Segments = segmenter. Cut ("he came to NetEase hang research building");  //  default to exact mode, also using HMM model              console.writeline ("new word recognition": {0} ",  string. Join ("/ ",  segments));             Segments = segmenter. Cutforsearch ("xiao Ming's master's degree from the Chinese Academy of sciences, after studying at Kyoto University in japan"); //  search engine mode              console.writeline ("search engine mode": {0} ",  string. Join ("/ ",  segments));             Segments = segmenter. Cut ("married and not yet married");             console.writeline (" "ambiguity cancellation": {0} ",  string. Join ("/ ",  segments)); &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp;     console.readkey ();        }     }

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

The result of running the program is as Follows:

650) this.width=650; "src=" http://images2015.cnblogs.com/blog/443844/201704/443844-20170418194540274-274601975. PNG "style=" margin:0px;padding:0px;border:0px; "/>

The Jiebasegmenter.cut method can be cutall to support both modes, precision mode and full Mode. Precision Mode is the most basic and natural mode, trying to cut the sentence most accurately, suitable for text analysis , and the whole pattern, all the words in the sentence can be scanned, faster, but can not solve the ambiguity , Because it does not scan the maximum probability path, and does not find the Non-login word through the Hmm.

Cutforsearch uses the search engine mode, on the basis of accurate mode to split long words again , improve recall rate, suitable for search engine participle .

POS Labeling

Part-of-speech tagging is based on Ictclas compatible notation, for a list of the notation used in Ictclas and jieba, refer to: pos tagging.

In TestDemo.cs, the Poscutdemo method is part of the POS Notation.

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

         public void poscutdemo ()          {            var posseg  = new possegmenter ();             var s =  "a cloud of high-energy ions gigantic in the distant and mysterious space of the earth";             var tokens = posseg.cut (s);             console.writeline (string. Join (" ",  tokens. Select (token => string. Format ("{0}/{1}",  token. Word, token. Flag)));         }

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

The result of the call is as Follows:

650) this.width=650; "src=" http://images2015.cnblogs.com/blog/443844/201704/443844-20170418194933446-889993404. PNG "style=" margin:0px;padding:0px;border:0px; "/>

Keyword extraction

Now try to extract the KEYWORDS. Jieba. NET provides TF-IDF and Textrank two kinds of algorithms to extract keywords, TF-IDF corresponding class is Jiebanet.analyser. Tfidfextractor, Textrank is Jiebanet.analyser. Textrankextractor.

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

        public void extracttagsdemo ()          {            var  text =                  Programmer (english Programmer) is a professional who engages in program development and MAINTENANCE. Programmers are generally divided into program designers and program coders, but the boundaries are not very clear, especially in China. The software practitioners are divided into four Categories: Junior programmer, senior programmer, System analyst and Project Manager. ";             var extractor = new  tfidfextractor ();            var  Keywords = extractor. Extracttags (text);            foreach  (var  keyword in keywords)             {                 console.wriTeline (keyword);            }         }        public void  ExtractTagsDemo2 ()         {             var text = @ "in Mathematics and computer science/math, The Algorithm/calculation method (algorithm) is a concrete step of calculation, often used in computation, Data processing and automatic inference. To be precise, an algorithm is an effective way of representing a finite-length List. The algorithm should contain clearly defined instructions for calculating functions.                    The instructions in the         algorithm describe a calculation that, when run, can start with an initial state and initial input (possibly null). After a series of limited and clearly defined states, the output is eventually produced and stopped at a final State. A transition from one state to another is not necessarily deterministic. Some algorithms, including random inputs, are included in the randomization Algorithm.                    The conceptual part of the         formal algorithm is derived from an attempt to solve the decision problem posed by hilbert, and later attempts to define an effective computational or effective method for Shaping. These attempts included Courtes Godel, Jacques Erbrand and Stephen Core Cleny, respectively, in 1930, 1934 and 1935.The return function, Alonzo Chow kit, was presented in 1936 by the λ calculus, 1936 Emil leon post formulation 1 and Alan Spirit 1937 proposed Turing. Even at the present moment, it is often difficult to define an intuitive idea as a formalized algorithm. ";             var extractor = new  tfidfextractor ();            var  Keywords = extractor. Extracttags (text, 10, constants.nounandverbpos);             foreach  (var keyword in keywords)              {                 console.writeline (keyword);             }        }

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

The Extracttagsdemo method is to extract all the KEYWORDS.

650) this.width=650; "src=" http://images2015.cnblogs.com/blog/443844/201704/443844-20170418195605352-470402075. PNG "style=" margin:0px;padding:0px;border:0px; "/>

The ExtractTagsDemo2 method is to extract the first ten keywords that contain only nouns and verbs

650) this.width=650; "src=" http://images2015.cnblogs.com/blog/443844/201704/443844-20170418200346931-1517766175. PNG "style=" margin:0px;padding:0px;border:0px; "/>

The return result of the Extracttagswithweight method contains the corresponding weight value in addition to the Keyword.

Returns the beginning and end of a word in the original

650) this.width= 650, "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

        public void tokenizedemo ()          {            var  Segmenter = new jiebasegmenter ();             var s =  "yonghe Garment Ornament co., ltd.";             var tokens = segmenter. Tokenize (s);            foreach  (var  Token in tokens)             {                 console.writeline (" word {0,-12} start: {1,-3} end: {2,-3} ",  token. Word, token. Startindex, token. EndIndex);            }        }

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

Calling the Tokenizedemo method returns the corresponding location

650) this.width=650; "src=" http://images2015.cnblogs.com/blog/443844/201704/443844-20170418202612602-1933734487. PNG "style=" margin:0px;padding:0px;border:0px; "/>

New words added

Code Join

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

            var segmenter = new  jiebasegmenter ();            var  Segments = segmenter. Cut (@ ". NETCore2.0 release time,. net core 2.0 Preview and. Net standard 2.0 preview are released in mid-or late May. ");             console.writeline (" Precision mode ": {0}",  string. Join ("/ ",  segments));             Segmenter. Addword ("release time");             segmenter. Addword (". NETCore2.0 ");            segments =  Segmenter. Cut (@ ". NETCore2.0 release time,. net core 2.0 Preview and. Net standard 2.0 preview are released in mid-or late May.             console.writeline ("");": {0}",  string. Join ("/ ",  segments));

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

Call Segmenter. Addword Add a new word, here to add the time of release and. NETCore2.0

650) this.width=650; "src=" http://images2015.cnblogs.com/blog/443844/201704/443844-20170418201639134-923070241. PNG "style=" margin:0px;padding:0px;border:0px; "/>

You can see that the newly added word is Recognized.

Dictionary Join

The dictionary format is as follows: the dictionary format is the same as the main dictionary format, that is, a line contains: words, word frequency (can be omitted), part of speech (can be omitted), separated by a space. When the word frequency is omitted, word breakers will use the automatically calculated frequency to guarantee that the term is Separated.

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

Innovation Office 3 I cloud computing 5 Catherine NZ Taichung Machine Learning 3 Deep learning 8linezero 2

650) this.width=650; "src="/img/fz.gif "alt=" Copy code "style=" margin:0px;padding:0px;border:none; "/>

Then use Segmenter. Loaduserdict () method, Passing in the dictionary path.

For more details, You can view the code and README.MD

. NET Core Chinese sub-phrase piece jieba.net Core

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More