Map: by Lucas Davies
First, preface
Participle, I think is most large front-end developers, will not be exposed to a concept. This does not affect our understanding of it, after all, we have to develop in many directions. Today to briefly introduce some participle, I try to describe the concept in the language of the introduction, and finally provide a solution, I hope to help you.
Word segmentation is simply to put a sentence, according to the meaning, cut into separate words. That might not make sense, so let's take a look at the scenario it applies to. Word segmentation is the basis of text mining, which is usually used in the fields of natural language processing, Word search, recommendation and so on.
Second, the theory and algorithm of Word segmentation 2.1 what is participle
First understand the concept of participle.
Word segmentation is the process of re-combining successive sequences of words into word sequences according to certain specifications. in English, the words are separated by a space, but there is no explicit delimiter for the Chinese.
It is precisely because of the lack of this explicit delimiter, which leads us to divide the words in Chinese, there will be a lot of deviation.
2.2 Algorithm of Word segmentation
Chinese word segmentation is difficult, but there are mature solutions. The existing Word segmentation algorithm can be divided into three categories:
- A word segmentation algorithm based on string matching
- A word segmentation algorithm based on comprehension
- Algorithm of Word segmentation based on statistics
1. Word segmentation algorithm based on string matching
This word segmentation method, also known as the mechanical segmentation algorithm, it will maintain a large dictionary in advance, and then the sentence and dictionary words to match, if the match is successful, you can do word processing .
Of course, it's actually a little more complicated, because when the dictionary is big enough, it involves different matching algorithms, and it doesn't start here. It is usually based on the Trie tree structure to achieve efficient word-map scanning.
2. Word segmentation algorithm based on comprehension
This segmentation method, by allowing the computer to imitate the understanding of the sentence, to achieve the recognition of the effect of the phrase. The basic idea is to analyze the syntax and semantics of the colleagues in the participle, and use syntactic and semantic information to deal with the ambiguity .
It usually contains three parts: the word breaker subsystem, the French sub-system, and the general control section. Under the coordination of the general control part, the word segmentation subsystem can get the syntactic and semantic information about the words, sentences, etc., to judge the ambiguity of the participle, that is, it simulates the process of human understanding of the sentence. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form that machine can read directly, so the word segmentation system based on understanding is still in the experimental stage.
3. Statistics-based word segmentation algorithm
This paper gives a large number of already participle text, and uses the statistical machine learning model to learn the rule of Word segmentation (called training) so as to realize the segmentation of unknown text.
With the establishment of large-scale corpus, the research and development of statistical machine learning methods, the Chinese Word segmentation method based on statistics has gradually become the mainstream method.
2.3 The demands of participle
Although the algorithm of Word segmentation is very simple to explain, but from the existing experience, there is almost no universal and very good word segmentation system.
Each field has its own unique vocabulary, which makes it difficult to capture all linguistic features through limited training data. For example: Through the People's Daily training of word segmentation system, in the network fantasy novel, the effect of participle will not be good.
This is inevitable, in the word breaker, there is no silver bullet.
Different scenarios, the requirements of the word segmentation is also very different, usually can be differentiated from two dimensions: Word segmentation speed, word segmentation accuracy .
For example, Word search, speed requirements are higher than the accuracy of the requirements. In some question and answer system, it is necessary to understand the text more deeply, the accuracy is higher than the speed requirement.
Different fields, different use scenarios, the requirements of the word segmentation is different, so we can not one-sided to understand the accuracy of the segmentation. And with the increase of new words, training data changes, the accuracy of the word segmentation is also fluctuating. That's why, now boasting about the accuracy of word segmentation, companies are getting fewer reasons.
2.4 Word Segmentation Solutions
Participle is a function that can solve the actual problem, after such a long period of iterative update, the market has produced a number of characteristics of the word breaker system. For example: IK, Jieba, ANSJ, HANLP, Stanford participle, and so on.
Interested to understand each other, next on one of the open Source Library Jieba, to explain.
Third, the advantages of jieba3.1 Jieba
Jieba is an open source, known as Python, the best Chinese sub-phrase pieces. and is based on the MIT protocol, the use of no worries.
Jieba is also very simple to use, a few lines of code can be used to achieve word segmentation and speech tagging, but also a good speed.
It maintains a dictionary internally, according to the People's Daily analysis, the new words outside the dictionary, will be based on the HMM model to identify.
It offers three word segmentation modes: Precision mode, full mode, and search mode. The full pattern is to find all possible words, and the search pattern is to slice the long words on the basis of precise patterns and improve the segmentation rate.
In the speed of the word segmentation, the precision mode can reach 400kb/s, the full mode can reach 1.5mb/s. In addition to the Python version, there are different people based on the Python version of the Jieba, extending the implementation of a variety of languages, including: JavaScript, Java, Golang, R, PHP and so on.
Use of Jieba
The Jieba code is compatible with Python 2/3 and requires a command or installation before it can be used pip install jieba
pip3 install jieba
.
Specific API, will not be expanded to speak, is interested in going to see the documentation on Github (at the end of the article has an address).
Here is a simple code example to feel the convenience and power of Jieba.
# encoding=utf-8import jiebaseg_list = jieba.cut("我来到北京清华大学", cut_all=True)print("Full Mode: " + "/ ".join(seg_list)) # 全模式seg_list = jieba.cut("我来到北京清华大学", cut_all=False)print("Default Mode: " + "/ ".join(seg_list)) # 精确模式seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式print(", ".join(seg_list))seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式print(", ".join(seg_list))
Results of the output:
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学【精确模式】: 我/ 来到/ 北京/ 清华大学【新词识别】:他, 来到, 了, 网易, 杭研, 大厦 (此处,“杭研”并没有在词典中,但是也被Viterbi算法识别出来了)【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
As mentioned earlier, Jieba itself maintains a dictionary of phrases, and if there is a need for a proper noun to be split, you can also customize jieba.Tokenizer(dictionary=DEFAULT_DICT)
a dictionary message.
The segmentation algorithm of 3.2 Jieba
Matching algorithm, said to be complex, here is a brief introduction to the principle of jiaba participle matching.
First of all, Jieba participle has brought a dict.txt dictionary, there are more than 2w entries, including the number of occurrences and part of speech, which is the author's own based on the People's Daily information, training out.
Jieba will first put the data in this dictionary into a Trie tree, Trie tree is a well-known prefix tree, when a word in front of a few words, they have the same prefix, you can use the Trie number to store, with the advantage of fast lookup.
Secondly, in the need of sentence segmentation, and then based on the number of Trie generated before the generation of a directed acyclic graph (DAG), the significance of this step is to eliminate the ambiguity in the word segmentation, improve segmentation accuracy, find out the sentence, all the possible words.
To this step, basically completed, all the dictionary records of the word, the process of word segmentation.
But if you delete dict.txt this dictionary, Jieba can still make participle, just split the word, most of the length is 2. This is because, for the words not included in the dictionary, based on the Hidden Markov model (HMM) to predict the word segmentation, using the VITERBI algorithm.
In the HMM model, the Chinese words are marked by the BEMS four states, B is the beginning begin position, E is end, the ending position, M is middle, is the middle position, S is singgle, the position of the word alone, there is no former, there is no post. That is, he used the status of (B,e,m,s) These four states to mark Chinese words, such as Beijing can be labeled as be, that is, north/b Beijing/E, indicating that the north is the starting position, Beijing is the end position, the Chinese nation can be labeled as Bmme, is the beginning, middle, middle, end.
Through the training of a large amount of corpus, the author obtains the training result under the FINALSEG catalogue, and is interested to study it by himself.
To here is basically clear, the process of jieba participle mainly has the following three steps:
- Load the Dict.txt dictionary and generate the Trie tree.
- The sentence that treats participle, through Trie tree, generate DAG graph, match out all possible words.
- Then the HMM model is used to match the words that are not included in the dictionary.
This is the execution process of jieba participle.
Iv. Jieba (Java or Android) 4.1 Java version of Jieba
Jieba has grown to the present and has supported numerous versions. Java Edition is not the original author development, but Hanban reference to the original author's word theory, to develop.
However, the Java version is not as powerful as the original version of Python, and some castration, such as keyword extraction, is not implemented.
Interested to go straight to see github:github.com/huaban/jieba-analysis/
1. Introduction of dependency (Stable version)
<dependency> <groupId>com.huaban</groupId> <artifactId>jieba-analysis</artifactId> <version>1.0.2</version></dependency>
2. How to use
@Testpublic void testDemo() { JiebaSegmenter segmenter = new JiebaSegmenter(); String[] sentences = new String[] {"这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。", "我不喜欢日本和服。", "雷猴回归人间。", "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", "结果婚的和尚未结过婚的"}; for (String sentence : sentences) { System.out.println(segmenter.process(sentence, SegMode.INDEX).toString()); }}
3. Performance evaluation
The author tests on a test machine and is configured to:
Processor 2 Intel(R) Pentium(R) CPU G620 @ 2.60GHzMemory:8GB
Test results are also ideal, single-threaded, the test text line-by-word, and cyclic calls to tens of thousands of efficiency analysis.
循环调用一万次第一次测试结果:time elapsed:12373, rate:2486.986533kb/s, words:917319.94/s第二次测试结果:time elapsed:12284, rate:2505.005241kb/s, words:923966.10/s第三次测试结果:time elapsed:12336, rate:2494.445880kb/s, words:920071.30/s循环调用2万次第一次测试结果:time elapsed:22237, rate:2767.593144kb/s, words:1020821.12/s第二次测试结果:time elapsed:22435, rate:2743.167762kb/s, words:1011811.87/s第三次测试结果:time elapsed:22102, rate:2784.497726kb/s, words:1027056.34/s统计结果:词典加载时间1.8s左右,分词效率每秒2Mb多,近100万词。2 Processor Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz12G 测试效果time elapsed:19597, rate:3140.428063kb/s, words:1158340.52/stime elapsed:20122, rate:3058.491639kb/s, words:1128118.44/s
4.2 Using Jieba under Android
Jieba (Java) version, itself is a self-dictionary, so the introduction of Android, will increase the size of the APK, there is no good way to circumvent. And because of the device configuration, it will also affect the efficiency of the participle.
However, if you want to use it on an Android device, such as a preprocessing of a search term, it is also possible.
Jieba (Java) uses MAVEN management, so it needs to be Gradle simple to configure for support.
1. Configure Build.gradle
repositories { google() jcenter() mavenCentral()}
2. Introduction of dependency
api 'com.huaban:jieba-analysis:1.0.2'
After the introduction, the use of details is nothing to say, and the Java version is no different.
Reference:
Github.com/fxsjy/jieba
github.com/huaban/jieba-analysis/
54645527
Http://www.infoq.com/cn/articles/nlp-word-segmentation
"Online Round Table" Recommended my knowledge Planet, a year 50 quality problems, on the table online learning.
Public number back to growth " growth ", will be prepared by my study materials, can also reply to " Dabigatran ", learning progress together, you can also reply to " ask questions " and ask me questions.
Recommended reading:
Writing is the core competency | Google Engineer Decrypts "guess Little song" | Illustration: HTTP Range Request | Android P Adaptation Experience | Technology Entrepreneurship Selection Checklist | HTTP Transfer Encoding | What is consuming you? | HTTP Content Encoding | Schematic HTTP Cache | Chat About HTTP Cookies | Auxiliary Mode Combat | Accessibility Auxiliary Mode | Small Program Flex Layout | Good PR makes you more reliable | The way of password management