Jieba Chinese word segmentation. NET version: jieba. NET
Jieba is very simple to use, and the results of word segmentation are also impressive. If you are interested, you can go to its online demonstration site to experience it (pay attention to the third line of text ).
On the. NET platform, the common word segmentation component is Pangu, but it has not been updated for a long time. The most obvious thing is the built-in dictionary. The jieba dictionary has 0.5 million entries, while the Pangu dictionary is 0.17 million, which will produce different word segmentation effects. In addition, for unlogged words, jieba uses the HMM model based on the Chinese character tokenization capability and uses the Viterbi algorithm. The effect looks good.
Based on the above two points and the interest in Chinese word segmentation, we attempted to Port jieba to the. NET platform and put the code on github: jieba. NET. Before using jieba. NET, let's briefly introduce the implementation ideas of jieba.
Analysis on jieba implementation
Jieba provides a small number of documents, but we can study jieba word segmentation source code in "understanding and analyzing the process of the Python Chinese word segmentation module's jieba word segmentation algorithm" (1) this series of articles give a glimpse of the overall idea of jieba implementation. In short, its core modules and word segmentation process are roughly:
Prefix Dictionary (Trie): used to store the main dictionary and dynamically add or delete entries. This dictionary can be understood as a word known by jieba, or a logged-on word;
Directed acyclic graph (DAG): You can use the prefix dictionary to find all possible words in a sentence;
Maximum probability Path: through DAG, you can understand all the word-based results. Each result corresponds to a path and its probability. Because the probability of occurrence of different entries is different, different results correspond to different probabilities. We can find the path with the highest probability. Here, we have made the most reasonable division of the logged-on words;
HMM model and Viterbi algorithm: after the maximum probability path, we may encounter some unlogged words (words not included in the prefix Dictionary). In this case, HMM and Viterbi are used to further classify the words, get the final result
This process is similar to the word splitting process. For example, if we see this sentence: "A linguistics participates in an academic conference", we will divide it into: "A linguistics participates in an academic conference ". Although this process is completed instantly, it does include the first three steps of the process: before word segmentation, there is a "prefix Dictionary" in the brain ", it includes language, linguistics, linguistics, and so on. The brain knows that this sentence does have the possibility of multiple word segmentation, but it finally selects the most likely result, the results, such as the participation of linguistics in academic conferences, were discarded.
The previous sentence only contains the logged-on words. Let's look at another sentence: "He came to NetEase hangyan building ". The average person can quickly make a division: "Has he come to Netease (hangyan )? Building, except for the word "hangyan", all others belong to the logged-on words, which are easy to be divided. For "hangyan", we need to think about whether they are two words or a new word. Finally, it may come along with a process like this: I know that NetEase has R & D centers or research institutes in Hangzhou, so "hangyan" may be an abbreviation of this, well, I learned a new word. Although this process is different from HMM, we have at least learned that jieba does try to find unlogged words in some way.
However, we can imagine that the words that can be found based on the state transition probability HMM model (we suggest you refer to the Chinese word segmentation HMM model details) should be natural or normal words, new names, organization names, or network words (such as xidapupben) will not work very well.
Jieba. NET usage
The current version of jieba. NET is 0.37.1, which is consistent with that of jieba. You can install jieba through NuGet:
PM> Install-Package jieba. NET
After installation, copy the Resources directory to the directory where the assembly is located. The following are examples of word segmentation, part-of-speech tagging, and keyword extraction.
Word Segmentation
Var segmenter = new JiebaSegmenter (); var segments = segmenter. cut ("I came to Beijing Tsinghua University", cutAll: true); Console. writeLine ("[full mode]: {0}", string. join ("/", segments); segments = segmenter. cut ("I came to Beijing Tsinghua University"); // The default is the precision mode Console. writeLine ("[exact mode]: {0}", string. join ("/", segments); segments = segmenter. cut ("He has come to NetEase Hang Yan Building"); // The precise mode is used by default, and the HMM model Console is also used. writeLine ("[new word recognition]: {0}", string. join ("/", segments); segments = segmenter. cutForSearch ("James graduated from the Institute of Computing Science, Chinese Emy of sciences, and later at Kyoto University, Japan"); // search engine mode Console. writeLine ("[search engine mode]: {0}", string. join ("/", segments); segments = segmenter. cut ("Married Monk not married"); Console. writeLine ("[eliminate ambiguity]: {0}", string. join ("/", segments ));
The running result is:
[Full mode]: I/Come/BEIJING/Tsinghua University/Huada/University
[Exact mode]: I/Come/BEIJING/Tsinghua University
[New word recognition]: He/He came/NetEase/hangyan/building
[Search engine mode]: James/master/graduated/in/China/Science/ /, /post/in/Japan/Kyoto/University/Kyoto University/further studies
[Eliminate ambiguity]: Married/and/not married/
The JiebaSegmenter. Cut method supports two modes: exact mode and full mode. The exact mode is the most basic and natural mode. It tries to make sentences more accurate and suitable for text analysis. The full mode scans all words in a sentence that can be used as words, it is faster, but does not solve the ambiguity, because it does not scan the maximum probability Path, nor uses HMM to find unlogged words.
CutForSearch adopts the search engine mode. Based on the precise mode, CutForSearch splits long words again to improve the recall rate. It is suitable for word segmentation of search engines.
Part-of-speech tagging
Part-of-speech tagging is compatible with ictclas. For a list of tags used in ictclas and jieba, see part-of-speech tagging.
Var posSeg = new PosSegmenter (); var s = "a group of large and unintelligent high-energy ion clouds rapidly drifting in the distant and mysterious space"; var tokens = posSeg. cut (s); Console. writeLine (string. join ("", tokens. select (token => string. format ("{0}/{1}", token. word, token. flag ))));
Running result
A group of/m Great Friends/I/uj high energy/n ions/n cloud/ns, /x in/p distant/a and/c mysterious/a/uj space/n/f fast disease/z location/uv floating/v
Keyword extraction
Let's take a look at the text about algorithms from Wikipedia:
In mathematics and computer science/computing, Algorithm/Algorithm (Algorithm) is a specific computing step, which is often used for computing, data processing, and automatic reasoning. Precisely speaking, an algorithm is an effective method that represents a finite-length list. The algorithm should contain clearly defined commands for computing functions.
The commands in an algorithm describe a computation. When running a computation, it can start with an initial state and an initial input (which may be empty, after a series of finite and clearly defined states, the final state is generated and stopped. The transfer from one state to another is not necessarily determined. Some randomization algorithms include some random input.
The concept of a formal algorithm is partly derived from an attempt to solve the problem of identification proposed by Hilbert, and then an attempt to define a valid calculation or valid method to form. These attempts include the recursive functions proposed by Coulter Godel, Jacke elbrown, and Stephen Cole Klein in 1930, 1934, and 1935, respectively. The lambda algorithm proposed by aronzo Qiu in 1936, emil Leon Post's Formulation 1 in 1936 and Alan Turing's Turing machine in 1937. Even now, intuitive ideas are often difficult to define as formal algorithms.
Now let's try to extract the keywords. Jieba. NET provides two algorithms: TF-IDF and TextRank to extract keywords. The class of TF-IDF is JiebaNet. Analyser. TfidfExtractor, and TextRank is JiebaNet. Analyser. TextRankExtractor.
Var extractor = new TfidfExtractor (); // extract the first 10 keywords that only contain nouns and verbs var keywords = extractor. extractTags (text, 10, Constants. nounAndVerbPos); foreach (var keyword in keywords) {Console. writeLine (keyword );}
The running result is
Algorithm
Definition
Computing
Try
Formalization
Proposal
Status
Command
Input
Include
In addition to keywords, the returned results of the corresponding ExtractTagsWithWeight method also contain the corresponding weight values. The interfaces of TextRankExtractor are exactly the same as those of TfidfExtractor.
Summary
Word segmentation, part-of-speech tagging, and keyword extraction are three main functional modules of jieba. NET Currently tries to be consistent with jieba in terms of functions and interfaces, but other extensions may be provided on jieba in the future. Jieba. NET development has just begun, and many details need to be improved. We welcome your trial and feedback, and hope to discuss with you to achieve a better Chinese dictionary.
Integration of jieba. NET and Lucene. Net
We can see two Chinese word segmentation and Lucene. net Integration Project: Lucene. net. analysis. panGu and Lucene. net. analysis. MMSeg. Refer to the code to implement the simplest integration: jiebaForLuceneNet. The following is a brief introduction.
1. JiebaTokenizer
The main integration point is to customize a Tokenizer Subclass. At this time, you must implement its abstract method IncrementToken, which is used to traverse the tokens generated by text in the text stream, this is where the word segmentation component plays a role.
Public override bool IncrementToken () {ClearAttributes (); position ++; if (position <tokens. count) {var token = tokens [position]; termAtt. setTermBuffer (token. word); offsetAtt. setOffset (token. startIndex, token. endIndex); typeAtt. type = "Jieba"; return true;} End (); return false ;}
The code lines of termAtt and offsetAtt use the word, start index, and end index of each token. These three values are exactly JiebaSegmenter. the Tokenize method is implemented, so you only need to use it when initializing JiebaTokenizer:
Tokens = segmenter. Tokenize (text, TokenizerMode. Search). ToList ();
You can get all tokens obtained by word segmentation, and the TokenizerMode. the Search parameter allows the results of the Tokenize method to contain more comprehensive word segmentation results. For example, the "linguistics" will get four tokens, namely, "[language, (0, 2)], [scientist, (2, 4)], [linguistics, (0, 3)], [linguistics, (0, 4)] ", which is helpful in index creation and search.
2. JiebaAnalyzer
Tokenizer class implements word segmentation, while Analyzer is required to add indexes and search. JiebaAnalyzer only needs to call JiebaTokenizer.
Public override TokenStream (string fieldName, TextReader reader) {var seg = new JiebaSegmenter (); TokenStream result = new JiebaTokenizer (seg, reader); // This filter is necessary, because the parser converts the queries to lower case. result = new LowerCaseFilter (result); result = new StopFilter (true, result, StopWords); return result ;}
In addition to JiebaTokenizer, JiebaAnalyzer also uses LowerCaseFilter and StopFilter. The former can normalize the index and search content, regardless of case sensitivity. The latter can filter out deprecated words. The stopword list used here combines the English stopword of NLTK with the Chinese stopword of Harbin Institute of Technology.
3. Create indexes and search
When creating an index, IndexWriter uses the JiebaAnalyzer instance:
Var analyzer = new JiebaAnalyzer (); using (var writer = new IndexWriter (Directory, analyzer, IndexWriter. maxFieldLength. UNLIMITED) {// replaces older entry if any foreach (var sd in data) {addtow.eindex (sd, writer);} analyzer. close ();}
When searching, we first split the user's input into words:
Private static string GetKeyWordsSplitBySpace (string keywords, JiebaTokenizer tokenizer) {var result = new StringBuilder (); var words = tokenizer. tokenize (keywords); foreach (var word in words) {if (string. isNullOrWhiteSpace (word. word) {continue;} result. appendFormat ("{0}", word. word);} return result. toString (). trim ();}
For example, if the user inputs a "linguistic", the return value of this function is "linguistic linguistics", which prepares for the subsequent search (in addition, we can also add a * for each word so that the result can be searched as long as part of the match ). The final search implementation is:
Private static IEnumerable <News> SearchQuery (string searchQuery, string searchField = "") {if (string. IsNullOrEmpty (searchQuery. Replace ("*", ""). Replace ("? "," ") {Return new List <News> ();} using (var searcher = new IndexSearcher (Directory, false) {var hitsLimit = 1000; // var analyzer = new StandardAnalyzer (Version. paie_30); var analyzer = GetAnalyzer (); if (! String. isNullOrEmpty (searchField) {var parser = new QueryParser (Version. inclue_30, searchField, analyzer); var query = ParseQuery (searchQuery, parser); var hits = searcher. search (query, hitsLimit ). scoreDocs; var results = MapLuceneToDataList (hits, searcher); analyzer. dispose (); return results;} else {var parser = new MultiFieldQueryParser (Version. inclue_30, new [] {"Id", "Title", "Content"}, analyzer); var query = ParseQuery (searchQuery, parser); var hits = searcher. search (query, null, hitsLimit, Sort. RELEVANCE ). scoreDocs; var results = MapLuceneToDataList (hits, searcher); analyzer. close (); return results ;}}}
The searchField parameter can specify a specific field for search. If it is null, all fields are searched. So far, the most basic integration has been implemented.
The implementation and sample code of JiebaTokenizer and JiebaAnalyzer can be found in jiebaForLuceneNet.
4. Luke. Net
Luke. Net can view the index content generated by Lucene. Net, which is particularly helpful in the development and debugging of Lucene.