Pangu word segmentation-function Overview

Source: Internet
Author: User
Pangu word segmentation-function Overview

Author: eaglet
Two years ago, I developed a ktdictseg Chinese Word Segmentation component, which has been favored by many friends since its launch. However, the ktdictseg component has many problems due to the fact that I was in a hurry to develop ktdictseg, but I didn't have a good foundation and had a superficial understanding of word segmentation, I have always wanted to re-open a better open-source word splitting component, but I have never spare time. Last week, I made up my mind to start this task. After two weeks of development (in my spare time), I finally completed the V1.0 version of pangu word segmentation. Pangu word segmentation is completely different from ktdictseg, and almost allAlgorithmI overwrite all of them. The word splitting speed is about five times faster than that of ktdictseg (10 times faster than that of multi-thread), and the memory usage is only half of that of ktdictseg, the accuracy of Word Segmentation is also significantly higher than that of ktdictseg, and the function is also greatly increased. Next, I will briefly introduce the basic functions of the pangu word segmentation component, hoping to help those who have this need.
Pangusegment
Project logo:
Project homepage pangu word segmentation project Homepage
Open source protocol:Apache license 2.0
Commercial Applications: free commercial application authorization
Same project Homepage

Feature Chinese Word Segmentation
    • Chinese unregistered Words Recognition

Pangu word segmentation can automatically recognize Unlogged words that are not in the dictionary

    • Word Frequency first

Pangu word segmentation can solve word segmentation ambiguity Based on Word Frequency

    • Multivariate Word Segmentation

Pangu word segmentation provides multiple outputs to solve the problem of balancing word segmentation granularity and Word Segmentation accuracy.
For details, refer to pangu word segmentation version function introduction-multiple Word Segmentation

    • Chinese Name Recognition

Pangu word segmentation has made great breakthroughs in Chinese name recognition compared with ktdictseg. Here we will give a brief demonstration of the effect of Chinese name recognition.
Input: "James said it is true"
Word splitting result: Zhang San/said/Confirmed/handled/
However, if you enter "Lee San bought a triangle table"
Word splitting result: Li San/I bought/A/triangle/table/
In the first sentence, pangu word segmentation can identify that Michael Jacob is a person's name and output the word according to the person's name, in the second sentence, pangu word segmentation identifies that Zhang San in his sentence is not a person's name, and thus does not output the word according to the Chinese name.
For details, seePangu word segmentation-Chinese Name Recognition

    • Force mona1 Word Segmentation

Some projects need to output a single Chinese character while outputting accurate word splitting results, so that the search component can search for text at any granularity. Pangu word segmentation provides this function. In the word segmentation result, the precise word segmentation has a higher weight value and the weight of a single Chinese character is low. You can set the weight value to know the matching result of the Search Component.
For example "James said it is true"
Word splitting result:Zhang ()/Zhang 3 ()/3 said ()/3 ()/said ()/confirmed) /indeed ()/real ()/In )/
The first digit indicates the position of the word in the sentence, and the second digit indicates the weight, the same below

    • Traditional Chinese Word Segmentation

Pangu word segmentation supports traditional Chinese word segmentation. Search in many Chinese sites does not support traditional Chinese word segmentation, which includes searching for the blog site. You can output "My selections" in the search and view. You will find only one matching record."My choice is the entire word, but if you enter"In this case, all records including "my" and "selected" can be found. In this test, we can analyze the search results of the blog garden. When dealing with traditional Chinese, it is simply divided by spaces or symbols, and continuous Chinese characters cannot be decomposed.
Pangu word segmentation can achieve traditional Chinese Word Segmentation
Or Input"My selections"
The word splitting result is: My/select/

    • Both simplified and Traditional Chinese are output.

This feature is also very interesting if you use Google Search"My choice ", you will find that it can be both simplified and traditional"All my selections are found. To achieve this function, you must output both simplified and Traditional Chinese Characters During word splitting.
Or Input"My selections"
The word splitting result is )/

    • Chinese part-of-speech output

Pangu word segmentation can output the Chinese part of speech of Logon words to users for further processing.

    • Support for full-width characters

Pangu word segmentation can recognize full-width letters and numbers

English Word Segmentation

    • English Word Segmentation

English words are usually separated by spaces and other symbols. This is relatively simple, and pangu word segmentation is naturally no problem in English.

    • English Special Words Recognition

Some English abbreviations are mixed with letters, symbols, or letters and numbers, which cannot be separated by space characters. s. a. As long as the word is entered into the dictionary, pangu word segmentation can be used to separate the entire word. For letters and numbers, pangu word segmentation is automatically output as the whole word.

    • English original word output (available in later versions)
    • English case and case (available in later versions)
Other functions
    • Deprecated word Filtering

Some punctuation marks, hyphens, and auxiliary characters sometimes need to be filtered out during word segmentation. pangu word segmentation provides a stopword.txt file. You only need to add the words to be filtered to this file, and the disabled word filter can be opened to filter out these words.

    • Set the word segmentation weight

Pangu word segmentation allows you to set custom weights for the following features
Unregistered word weight
The most matched word weight.
Secondary matching word weight
Match the word weight again
Weights of words forcibly output
Numeric weight
English vocabulary weight
Symbol weight
When both simple and traditional Chinese characters are output at the same time, the Chinese character output weight is not the original text.

    • Dictionary management

Pangu word segmentation provides a dictionary management tool, dictmanage. With this tool, you can add, modify, and delete words in the dictionary.

    • Dynamic dictionary Loading

Use the dictionary tool to add, modify, and delete words in the dictionary to maintain the dictionary. pangu word segmentation automatically loads new dictionary files without restarting them.

    • Keyword highlighting component

Lucene provides a keyword highlighting component, but this component does not provide very good support for Chinese characters. Especially if there are multiple word segmentation, the processing will be worse. Pangu word segmentation provides a keyword highlighting component pangu. Highlight for Chinese and English. Its support for Chinese is better than that for Lucene.

    • Synonym output (available in later versions)
    • Lucene.net interface and example

In the pangu4lucene package, I made a simple news search Web example of pangu + Lucene.Program, The release package contains instructions for use.

Performance indicators
Core Duo 1.8 GHz Single thread word segmentation speed is 390 k characters per second, 2 thread word segmentation speed is 690 k characters per second.

Other Instructions
The pangu word segmentation provides a dictionary of 0.17 million common Chinese words, but this dictionary is still incomplete. To make Word Segmentation more accurate, you need to properly maintain this dictionary.
The recognition capability of Chinese names depends on chssinglename.txt, chsdoublename1.txt, chsdoublename2.txtThree files, which respectively represent the name of the list, the first word of the double-word name, and the last word of the double-word name. If some names are not separated, you need to maintain these three files.

V1.0.0.2 demo

From this test, pangu word segmentation has outstanding recognition capabilities in Chinese names, and has also improved the recognition of unregistered words. However, this word splitting result has a problem. It is a one-time payment of one hundred yuan. The normal understanding should be a one-time/payment/one hundred/yuan, but the pangu word splitting is a/sex/one hundred/yuan. Although it is a bit yellow, it is not completely inaccurate in terms of semantics.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.