In the processing of English text, because the English text is born with the word segmentation effect, you can directly through the word between the empty glyd participle (but some names, place names, etc. need to consider as a whole, such as new York). And for the Chinese and other similar forms of language, we need to be based on special processing participle. And the best method in Chinese word segmentation can be said to be jieba participle. Next, we introduce the characteristics, principle and simple application of the Jieba participle.
1. Features
1) support three kinds of word-breaker mode
precise mode: try to cut the sentence most precisely, suitable for text analysis
Full mode: all words in the sentence can be scanned out, very fast, but can not solve the ambiguity
search engine mode: on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle
2) Support Traditional participle
3) Support for custom dictionaries (support for loading new dictionaries or updating their own dictionaries)
2. Principle
The algorithm principle of jieba participle mainly has the following three items:
1) Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
2) using dynamic programming to find the maximum probability path, find out the maximum segmentation combination based on the word frequency.
3) for the non-login words, a HMM model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.
Let's step through the three ways.
The first article: based on the hierarchical tree structure to achieve efficient word-map scanning, the formation of all possible Chinese characters in the sentence situation of the formation of a direction-free graph.
First of all, we introduce the next tier tree, the tier tree Chinese named dictionary tree, prefix tree and so on. Its purpose is primarily to integrate strings into a tree shape. First of all, we look at the "Tsinghua University", "Tsinghua", "fresh", "Chinese", "Chinese" five Chinese words constitute the tier tree. The structure of the tree is as follows
Each block in this tree represents a node, where "root" means the root node, not any character; purple represents a branch node; green represents a leaf node. Each node outside of the root node contains only one character . From the root node to the leaf node, the characters passing through the path are joined together to form a word. The number in the leaf node represents the link that the word is in the dictionary tree (how many words there are in the dictionary), and a link with a common prefix is called a string. In addition, special emphasis should be placed on the following features of the Trie tree:
1) Words with the same prefix must be in the same string, for example, "Tsinghua", "fresh" two words have "Qing" this prefix, then in the Trie tree only need to build a "clear" node, "China" and "new" node share a parent node, so two words will be only three nodes can be stored, This reduces the storage of the dictionary to some extent
Space. )
2) The words in the Trie tree can only share prefixes, not other parts of the word, such as "Chinese", "Chinese" two words although the suffix of the previous word is the prefix of the latter word, but in the tree must be independent of the two links, and can not be built through the first and the end of the two words, which also shows that Trie The tree can only rely on the public prefix to compress the dictionary
Storage space, and cannot share all the same characters in the word; Of course, there is an "exception", for the compound word, there may be two words, the appearance of the end of the word, such as "Tsinghua University" in the example Trie tree seems to be from the "Tsinghua", "university" two words, But the identification of the leaf node is clear.
The Trie tree contains only "Tsinghua" and "Tsinghua University" two words, they share a prefix, rather than the "Tsinghua" and "university" two words, so the word "university" in the case of the Trie tree must start from the root node to reconstruct the word. )
3) Any complete word in the Trie tree must be from the root node to the end of the leaf node, which means that the retrieval of a word must start from the root node to the end of the leaf node.
Searching for a string in the Trie tree, starting from the root node, follows each character of the string verbatim along a chain path, until the leaf node at the bottom of the page is reached to confirm that the string is the word, which has the following two advantages:
1) The words of the common prefix are in the same string, so the range of search terms is greatly reduced (for example, strings with different first words will be excluded). )
2) The Trie tree is essentially a finite state automaton (definite automata, DFA), which means that the behavior of moving from one node (state) of the Trie tree to another (state) is entirely controlled by the state transfer function, whereas the state transfer function is essentially a mapping , which means: search the Trie tree verbatim, from one character to the next
A word transmitting pair is not required to traverse all the child nodes of that node.
A directed acyclic graph (DAG) that generates sentences based on the tier tree after the tier tree is built
If the string to be sliced has m characters, consider the position of the left and right of each character, then there is a m+1 point corresponding to the number of points from 0 to M. Consider the candidate as an edge, and you can generate a segmented word graph from the dictionary. The segmented word graph is a graph with positive weights. The split-word graph of the phrase "divergent views" is shown below.
In the "disagreement" in the segmentation of the word graph: "There" this side of the starting point is 0, the end is 1; "intentional" the beginning of the edge is 0, the end is 2, and so on. The segmentation scheme is the path from the source point 0 to the end point 5, there is a total of two segmentation paths.
Path 1:0-1-3-5 corresponding sharding scheme S1: have/opinions/disagreements/
Path 2:0-2-3-5 corresponding sharding scheme S2: intention/See/disagreement/
The second article: using dynamic programming to find the maximum probability path, find the maximum segmentation based on the word frequency
The tier tree in the Jieba participle marks the frequency of each word (equal to the number of occurrences divided by the total number, which can be approximated as the probability of a word when the overall sample is large), and after knowing the frequency of each word, it is possible to find the most probable word-breaker based on the dynamic programming method. The general dynamic programming is to find the optimal path from left to right, but in this case it is from right to left to find the best path. This is mainly because the center of gravity in Chinese sentences is often in the back, the back is the backbone of the sentence, so the right-to-left calculation of the correct rate is often higher than the right rate from left to right.
The third article: for the non-login words, using the HMM model based on Chinese characters ' word-forming ability, the Viterbi algorithm is used.
Non-logged words are words that are not recorded in the dictionary (tier tree). This type of word is generated using a HMM model (so if you delete a dictionary, you can also do word segmentation). So how is the HMM model constructed? First, four states are defined bems,b is the beginning, begin position, E is end, is the ending position, M is middle, is the middle position, S is singgle, the position of the word alone, there is no former, there is no post. That is, he used the status of (B, E, M, S) These four states to mark Chinese words, such as Beijing can be labeled as be, that is, north/b Beijing/E, indicating that the north is the starting position, Beijing is the end position; the Chinese nation can be labeled as Bmme, which is the beginning, middle, middle, end.
Jieba participle of the author of a large number of corpus training, got the Finalseg directory of three files (prob_trans.py, prob_emit.py, prob_start.py)
There are three main probability tables to be counted:
prob_trans.py
1) position conversion probability, i.e. B (beginning), M (middle), E (end), S (independent into Word) four states of the transfer probability;
{' B ': {' E ': 0.8518218565181658, ' M ': 0.14817814348183422},
' E ': {' B ': 0.5544853051164425, ' S ': 0.44551469488355755},
' m ': {' E ': 0.7164487459986911, ' m ': 0.2835512540013088},
' s ': {' B ': 0.48617017333894563, ' s ': 0.5138298266610544}}
P (e| B) = 0.851;p (m| B) = 0.149, which shows that when we are at the beginning of a word, the probability of the next word ending is much higher than the probability that the next word is the middle word, which is in line with our intuition, because two words are more common than words in multiple words.
prob_emit.py
2) The firing probability of the position to the word, such as P ("and" | M) The probability that the word "and" appears in the middle of a word;
prob_start.py
3) The probability that the word begins with a certain state is actually only two, either B or S. This is the starting vector, which is the initial model state of the HMM system. In fact, the conversion between the BEMs is somewhat similar to the 2-dollar model, which is the probability that the 2-word transfer two-element model takes another word after one word,
is one of the N-ary models. For example: Generally speaking, "China" after the "Beijing" probability is greater than "China" after the "North Sea" probability, that is: Beijing, China than the North Sea appears more probability, is more likely to be a Chinese word. But there is no certainty that the author is using a two-dollar model. But the Jieba participle
Performance is so good, should not use the two-dollar model.
So the process of Jieba participle:
1) Generate Tier tree
2) Given a sentence, use regular to obtain continuous Chinese characters and English characters, cut into a list of phrases, construct a forward-free graph (Mr. Cheng, then go to the dictionary to find the maximum probability path) and use dynamic programming, to get the most probability path. The words that are not found in the dictionary in the Dag are combined into a new fragment phrase,
Use the HMM model for word segmentation.
3) Use the yield syntax in Python to generate a word generator return.
3. Application
1) participle
jieba.cut
The method accepts three input parameters: A string that requires a word breaker, a cut_all parameter to control whether a full mode is used, and a hmm parameter to control the use of HMM models.
jieba.cut_for_search
The method accepts two parameters: A string that requires a word breaker, or whether to use a HMM model. This method is suitable for the search engine to construct the inverted index participle, the granularity is finer. The string to be participle can be a Unicode or UTF-8 string, GBK string.
Note: It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8
jieba.cut
And jieba.cut_for_search
The returned structure is an iterative generator, you can use the For loop to get every word (Unicode) you get after a word breaker, or
jieba.lcut
and jieba.lcut_for_search
return directly to list
jieba.Tokenizer(dictionary=DEFAULT_DICT)
Creates a new custom word breaker that can be used to use different dictionaries at the same time. jieba.dt
as the default word breaker, all global word-breaker-related functions are the mappings for this word breaker.
2) Add a custom dictionary
Load Dictionary:
Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
Usage: jieba.load_userdict (file_name) # file_name The path to a file class object or a custom dictionary
The dictionary format and dict.txt
the same, one word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed. file_name
If you open a file as a path or binary, the file must be UTF-8 encoded.
The word frequency is guaranteed to be separated by the use of automatic calculation when the frequency is omitted.
To adjust the dictionary:
Use add_word(word, freq=None, tag=None)
and del_word(word)
dynamically modify dictionaries in your program.
Use the suggest_freq(segment, tune=True)
word frequency, which adjusts the individual words, so that it can (or cannot) be divided.
Note: The automatically calculated word frequency may not be valid when using the HMM new words Discovery feature.
3) Keyword Extraction
Keyword extraction based on the TF-IDF algorithm:
Import Jieba.analyse
Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
Sentence for the text to be extracted
TopK is the keyword that returns several TF/IDF weights, the default value is 20
Withweight to return the keyword weight value, the default value is False
Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered
Jieba.analyse.TFIDF (Idf_path=none) New TFIDF instance, Idf_path as IDF frequency file
Keyword extraction based on the Textrank algorithm:
Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) are used directly, the interface is the same, note the default filtering of part of speech.
Jieba.analyse.TextRank () New custom Textrank instance
4) POS Tagging
ieba.posseg.POSTokenizer(tokenizer=None)
Creates a new custom word breaker, which tokenizer
specifies the internal use of a jieba.Tokenizer
word breaker. jieba.posseg.dt
label the word breaker for the default part of speech.
5) Parallel participle
Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed
The multiprocessing module, which is based on Python, currently does not support Windows
Usage:
jieba.enable_parallel(4)
# Turn on parallel word-breaker with parameters of parallel processes
jieba.disable_parallel()
# Turn off parallel participle mode
6) Tokenize: Returns the starting and ending position of the word in the original text
Jieba.tokenize (text)
7) Delay loading mechanism
The Jieba uses lazy loading, import jieba
and jieba.Tokenizer()
does not immediately trigger the loading of the dictionary, and starts loading the dictionary building prefix dictionary once it is necessary. If you want to manually initialize the Jieba, you can also initialize it manually.
Jieba.initialize ()
Jieba participle of natural language processing