A word breaker for. NET (jieba Chinese participle. NET version: Jieba.net)

Source: Internet
Author: User
Tags ming idf

Brief introduction

Usually use Python to write small programs. In the text analysis related things inevitably to Chinese word segmentation, so I met with Python to achieve the stuttering Chinese participle. Jieba is very simple to use, and the result of the participle is also impressive and interesting to the experience of its online demo site (note the third line of text).

. NET platform on the common sub-phrase pieces are pangu participle, but has not been updated for a long time. The most obvious is the built-in dictionary, Jieba's dictionary has 500,000 entries, and Pangu's dictionary is 170,000, which will result in a significantly different word segmentation effect. In addition, for the non-login word, Jieba "adopts the HMM model based on Chinese characters ' ability, using the Viterbi algorithm", the effect looks good.

Based on the above two points, plus the interest in Chinese word segmentation, try to transplant Jieba to. NET platform, the code has been put on GitHub: Jieba.net. In the trial Jieba. NET, the implementation of the next Jieba is briefly introduced.

Analysis on the realization of Jieba

Jieba itself provides less documentation, but we can be in the "Python Chinese word module stuttering segmentation algorithm process of understanding and analysis", "Jieba Word source code study (1)" This series of articles in a glimpse of jieba realization of the overall idea. In short, its core module and word segmentation process is roughly:

    • prefix dictionary (Trie): Used to store the main dictionary, you can also dynamically delete entries, the dictionary can be understood as Jieba "know" the word, or said the word has been signed;
    • directed acyclic graph (DAG): Through a prefix dictionary, you can find all possible lexical results of a sentence;
    • Maximum probability path : Through the DAG, you can understand all the result of the word, each result corresponds to a path and its probability. Because the probability of different entries is different, different results correspond to different probabilities, we find the most probable path. Here, we have made the most reasonable division of the registered words;
    • hmm model and Viterbi algorithm : After the maximum probability path, we may encounter some non-signed words (not included in the prefix dictionary words), then through the HMM and Viterbi to try further division, to get the final result

This process is very similar to the human word segmentation process. For example, to see this phrase: "Linguists attend academic conferences ", we will divide it into: "Linguists attend academic conferences." Although the process is instantaneous, it does contain the first three steps of the process: before the participle, there is a "prefix dictionary" in the brain, which includes languages, linguistics, linguists, and so on, and the brain knows that there is a possibility of multiple participle, but it finally chooses the most probable result. Abandoned such results as "linguists attending academic conferences".

The preceding sentence contains only the signed words, and then look at another sentence: " He came to NetEase hang research building." The average person can quickly make a division: " he came to NetEase (Hang) building", in addition to the word "Hang", the others belong to the signed words, easy to divide out. For " Hang", we have to think about whether they are two words or a new word. Finally, it may be accompanied by a process: I know that NetEase is a research and development center or research institute or the like in Hangzhou, then "hang research" may be related to an abbreviation, ah, and know a new word. Although this process is different from Hmm, at least we know that Jieba is actually trying to find an unregistered word in some way.

However, it can be imagined that the HMM model based on the probability of State transition (suggested reference to the "Hmm model of Chinese word segmentation") can be found to be more natural or normal words, for new names, organization names or network words (like the Big Ben), the effect will not be very good.

Jieba. NET usage

Jieba. NET current version is 0.37.1, consistent with Jieba and can be installed through NuGet:

Pm> Install-package jieba.net

After installation, copy the resources directory to the directory where the assembly is located. The following are examples of Word segmentation, pos tagging, and keyword extraction, respectively.

participle
var segmenter = new Jiebasegmenter (); var segments = Segmenter. Cut ("I came to Tsinghua University in Beijing", Cutall:true); Console.WriteLine ("Full mode": {0} ", String. Join ("/", segments)); segments = Segmenter. Cut ("I came to Tsinghua University in Beijing");  The default is exact mode Console.WriteLine ("Exact mode": {0} ", String.) Join ("/", segments)); segments = Segmenter. Cut ("He came to NetEase Hang Research Building");  The default is the exact mode, and also uses the HMM model Console.WriteLine ("New word recognition": {0} ", String. Join ("/", segments)); segments = Segmenter. Cutforsearch ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after study at Kyoto University in Japan"); Search engine mode Console.WriteLine ("Search engine mode": {0} ", String.) Join ("/", segments)); segments = Segmenter. Cut ("Married and not married"); Console.WriteLine ("Ambiguity cancellation": {0} ", String. Join ("/", segments));

The result of the operation is:

"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University "precise mode": I/Come/Beijing/Tsinghua University "new word recognition": he/came//NetEase/Hang/building "search engine mode": Xiao Ming/master/graduate/In/China/Science/college/Academy/ Chinese Academy of Sciences/calculation/Calculation Institute/,/after/in/Japan/Kyoto/University/Japan Kyoto University/Further study "ambiguity cancellation": Married//and/not married/married/

The Jiebasegmenter.cut method can be cutall to support both modes, precision mode and full mode. Precision Mode is the most basic and natural mode, trying to cut the sentence most accurately, suitable for text analysis , and the whole pattern, all the words in the sentence can be scanned, faster, but can not solve the ambiguity , Because it does not scan the maximum probability path, and does not find the non-login word through the hmm.

Cutforsearch uses the search engine mode, on the basis of accurate mode to split long words again , improve recall rate, suitable for search engine participle .

POS Labeling

Part-of-speech tagging is based on Ictclas compatible notation, for a list of the notation used in Ictclas and Jieba, refer to: POS tagging.

var posseg = new Possegmenter (); var s = "A group of gigantic high-energy ion clouds, in the distant and mysterious space in the fast-drifting"; var tokens = Posseg.cut (s); Console.WriteLine (String. Join ("", tokens. Select (token = string. Format ("{0}/{1}", token. Word, token. Flag)));

Run results

A group/m gigantic/I/uj high ENERGY/n ion/N cloud/ns,/x in/P remote/A and/C mystery///uj space/
Keyword extraction

Look at the following text from Wikipedia on the algorithm:

In mathematics and computer Science/math, the algorithm/calculation method (algorithm) is a concrete step of calculation, often used in computation, data processing and automatic inference. To be precise, an algorithm is an effective way of representing a finite-length list. The algorithm should contain clearly defined instructions for calculating functions. The instructions in the algorithm describe a calculation that, when run, can start with an initial state and initial input (possibly null), and eventually produce output and stop at a final state through a series of limited and clearly defined states. A transition from one state to another is not necessarily deterministic. Some algorithms, including random inputs, are included in the randomization algorithm. The conceptual part of the formal algorithm derives from an attempt to solve the decision problem posed by Hilbert, and later attempts to define an effective computational or effective method for shaping. These attempts included recursive functions presented by Courtes Godel, Jacques Erbrand, and Stephen Core Cleny in 1930, 1934 and 1935 respectively, Alonzo Chow Kit in 1936, Emil in 1936 formulation Leon post 1 and Alan 1937 Turing. Even at the present moment, it is often difficult to define an intuitive idea as a formalized algorithm.

Jieba. NET provides TF-IDF and Textrank two kinds of algorithms to extract keywords, tf-idf corresponding class is jiebanet.analyser. Tfidfextractor, Textrank is jiebanet.analyser. Textrankextractor.

var extractor = new Tfidfextractor ();
Extract the first ten keywords that contain only nouns and verbs var keywords = extractor. Extracttags (text, ten, Constants.nounandverbpos); foreach (var keyword in keywords) { console.writeline (keyword);}

The running result is

The algorithm defines the calculation attempt to formalize the status of the input containing

The returned results of the corresponding Extracttagswithweight methods contain the corresponding weight values in addition to the keywords. Textrankextractor interface and Tfidfextractor exactly the same, no longer repeat.

Summary

Word segmentation, pos tagging and keyword extraction are the three main functional modules of Jieba, Jieba. NET is now as consistent as possible on functions and interfaces with Jieba, but may later provide additional extensions on JIEBA basis. Jieba. NET development has just begun, there are many details that need to be perfected. We are very welcome to the trial and feedback, but also hope to discuss together with you to achieve a better Chinese word thesaurus.

Anders Cui
Source: http://anderslly.cnblogs.com
Original link: http://www.cnblogs.com/anderslly/p/jiebanet.html

A word breaker for. NET (jieba Chinese participle. NET version: Jieba.net)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.