. NET Core Chinese Word Segmentation component jieba. NET Core, corejieba.net

Source: Internet
Author: User

. NET Core Chinese Word Segmentation component jieba. NET Core, corejieba.net

Jieba. NET Core, a Chinese Word Segmentation component of. NET Core, requires Chinese Word Segmentation due to actual requirements.

A jieba. NET file is found, but no. NET Core version is found. If someone mentions. NET Core in issue, jieba. NET supports. NET Core.

Jieba. NET Core: https://github.com/linezero/jieba.NET

Jieba word segmentation features

Features
  • Three word segmentation modes are supported:
    • Accurate mode, which is suitable for text analysis;
    • In full mode, all words that can be converted into words in a sentence are scanned. The speed is very fast, but ambiguity cannot be solved. Specifically, the word splitting process does not use Word Frequency to find the maximum probability path or HMM;
    • The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.
  • Supports traditional Chinese Word Segmentation
  • Allows you to add custom dictionaries and words.
Jieba. NET Core usage

Download the Code to open it with VS 2017 or use VS Code to open the project.

Select jieba. NET as the initial project. The Program. cs code is as follows:

Class Program {static void Main (string [] args) {Encoding. registerProvider (CodePagesEncodingProvider. instance); var segmenter = new JiebaSegmenter (); var segments = segmenter. cut ("I came to Beijing Tsinghua University", cutAll: true); Console. writeLine ("[full mode]: {0}", string. join ("/", segments); segments = segmenter. cut ("I came to Beijing Tsinghua University"); // The default is the precision mode Console. writeLine ("[exact mode]: {0}", string. join ("/", segments); segments = segmenter. cut ("he has come to Netease hang Yan building"); // The precise mode is used by default, and the HMM Model Console is also used. writeLine ("[New Word Recognition]: {0}", string. join ("/", segments); segments = segmenter. cutForSearch ("James graduated from the Institute of computing science, Chinese Emy of sciences, and later at Kyoto University, Japan"); // search engine mode Console. writeLine ("[Search Engine mode]: {0}", string. join ("/", segments); segments = segmenter. cut ("Married monk not married"); Console. writeLine ("[eliminate ambiguity]: {0}", string. join ("/", segments); Console. readKey ();}}

The running result is as follows:

 

The JiebaSegmenter. Cut Method supports two modes: exact mode and full mode.Exact ModeIs the most basic and natural pattern. It is suitableText AnalysisIn full mode, all words in a sentence that can be used as words are scanned,Faster, but cannot solve AmbiguityBecause it does not scan the maximum probability path or use HMM to find Unlogged words.

CutForSearch adopts the search engine mode,Segmentation of long words based on the exact modeTo improve the recall rate,Suitable for search engine Word Segmentation.

Part-of-speech tagging

Part-of-speech tagging is compatible with ictclas. For a list of tags used in ictclas and jieba, see part-of-speech tagging.

In TestDemo. cs, the PosCutDemo method is part-of-speech tagging.

Public void PosCutDemo () {var posSeg = new PosSegmenter (); var s = "a group of high-energy ionic clouds quickly float in the distant and mysterious space "; var tokens = posSeg. cut (s); Console. writeLine (string. join ("", tokens. select (token => string. format ("{0}/{1}", token. word, token. flag ))));}

The call result is as follows:

Keyword extraction

Now let's try to extract the keywords. Jieba. NET provides two algorithms: TF-IDF and TextRank to extract keywords. The class of TF-IDF is JiebaNet. Analyser.TfidfExtractor, TextRank is JiebaNet. Analyser.TextRankExtractor.

Public void ExtractTagsDemo () {var text = "Programmer (English Programmer) is a professional who is engaged in program development and maintenance. Programmers are generally divided into programmers and programmers, but the boundaries between them are not very clear, especially in China. Software practitioners are divided into four categories: Junior programmers, Senior programmers, system analysts, and project managers. "; Var extractor = new TfidfExtractor (); var keywords = extractor. extractTags (text); foreach (var keyword in keywords) {Console. writeLine (keyword) ;}} public void ExtractTagsDemo2 () {var text = @ "in mathematics and computer science/computing, Algorithm/Algorithm (Algorithm) it is a specific computing step and is often used for computing, data processing, and automatic reasoning. Precisely speaking, an algorithm is an effective method that represents a finite-length list. The algorithm should contain clearly defined commands for computing functions. The commands in an algorithm describe a computation. When running a computation, it can start with an initial state and an initial input (which may be empty, after a series of finite and clearly defined states, the final state is generated and stopped. The transfer from one State to another is not necessarily determined. Some randomization algorithms include some random input. The concept of a formal algorithm is partly derived from an attempt to solve the problem of identification proposed by Hilbert, and then an attempt to define a valid calculation or valid method to form. These attempts include the recursive functions proposed by Coulter Godel, jacke elbrown, and Stephen Cole Klein in 1930, 1934, and 1935, respectively. The Lambda algorithm proposed by aronzo Qiu in 1936, emil Leon Post's Formulation 1 in 1936 and Alan Turing's Turing Machine in 1937. Even now, intuitive ideas are often difficult to define as formal algorithms. "; Var extractor = new TfidfExtractor (); var keywords = extractor. extractTags (text, 10, Constants. nounAndVerbPos); foreach (var keyword in keywords) {Console. writeLine (keyword );}}

The ExtractTagsDemo method extracts all keywords.

 

The ExtractTagsDemo2 method extracts the first 10 keywords that only contain nouns and verbs.

ExtractTagsWithWeightIn addition to keywords, the returned results of the method also contain the corresponding weight values.

Returns the start and end positions of words in the original text.
Public void TokenizeDemo () {var segmenter = new JiebaSegmenter (); var s = "Yonghe clothing & accessories Co., Ltd."; var tokens = segmenter. tokenize (s); foreach (var token in tokens) {Console. writeLine ("word {0,-12} start: {1,-3} end: {2,-3}", token. word, token. startIndex, token. endIndex );}}

The corresponding location is returned when the TokenizeDemo method is called.

 

Add new words

Add code

Var segmenter = new JiebaSegmenter (); var segments = segmenter. cut (@". release Date of NETCore2.0 ,.. NET Core 2.0 preview and. NET Standard 2.0 Preview was released around middle May or later. "); Console. writeLine ("[exact mode]: {0}", string. join ("/", segments); segmenter. addWord ("release time"); segmenter. addWord (". NETCore2.0 "); segments = segmenter. cut (@". release Date of NETCore2.0 ,.. NET Core 2.0 preview and. NET Standard 2.0 Preview was released around middle May or later. "); Console. WriteLine (" [exact mode]: {0} ", string. Join ("/", segments ));

Call segmenter. AddWord to add new words. The release time and. NETCore2.0 are added here.

We can see that the newly added words are recognized.

Add a dictionary

The dictionary format is as follows: the format of the dictionary is the same as that of the main dictionary, that is, a row contains: word, word frequency (can be omitted), and part of speech (can be omitted), separated by spaces. When the word frequency is omitted, the word divider uses the automatically calculated word frequency to ensure that the word is separated.

Innovation Office 3 I cloud computing 5 terene nz middleware machine learning 3 deep learning 8 linezero 2

Then, use the segmenter. LoadUserDict () method to pass in the dictionary path.

For more details, see the code and readme. md.

 

Reference: http://www.cnblogs.com/anderslly/p/jiebanet.html

If you think this article is helpful to you, click"Recommendation", Thank you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.