Big Data text analysis: Spiritual Nine natural language Chinese semantic word segmentation system

Source: Internet
Author: User

Natural language usually refers to a language that evolves naturally with culture. English, Chinese, and Japanese are examples of natural languages, while Esperanto is an artificial language, which is a language created for specific purposes.

Natural language has two attributes: Language attributes and natural attributes. the "language" attribute is expressed as the inherent regularity of some accepted conventions; Natural "attribute is that there is no man-made, strict grammatical rules of the system to contract people's language expression, which is different from the programming language. Natural language needs to follow certain internal laws, but to a greater extent, "existence is reasonable".

a natural language processing system must take into account many of the language itself and the structure of knowledge --such as what is the word, how the word is composed of sentences, what the meaning of the word, the meaning of the word to the sentence meaning of what contribution, but these are still far from enough. For example, if a system is to answer questions or participate directly in a conversation, it needs to know not only the knowledge of a lot of language structures, but also the general knowledge of the human world and the ability of human reasoning. Therefore, many linguists usually divide the analysis and understanding of language into the following main levels: Lexical analysis, syntactic analysis, semantic analysis, and discourse analysis.

from the perspective of natural language , it is insufficient to measure the logical language: the type of the initial term is not diverse, the type of quantifier is poor, the scope of the existence quantifier cannot be extended dynamically in the Formula series, and the efficiency of language communication is not high because of the lack of context. The Nlpir text Search and mining system fully solves these problems. Nlpir is a set of software specifically for processing and processing of the original text set, providing a visual display of middleware processing effects, as well as processing tools for small-scale data. Users can use the software to process their own data.

The word segmentation principle of nlpir Text Search and mining system mainly uses the following algorithms:

1. Chinese character segmentation based on dictionaries and rules

When slicing, the string to be segmented is used to match the entry in the dictionary, and if the match succeeds, it is cut into a word. This kind of method includes the maximal matching word segmentation method, the whole segmentation algorithm and so on.

1) Maximum matching segmentation method

the maximal matching Word segmentation method is also divided into forward maximum matching, inverse maximum matching and bidirectional maximum matching method. Forward maximum matches from left to right take the longest word each time, the inverse maximum match each time from right to left to take the longest word, bidirectional matching is to carry forward, reverse matching, and then the two matching results in different places to use a certain rule to disambiguation.

The maximum matching method may not be able to deal with partial ambiguity and cross ambiguity. However, this method is simple and fast in segmentation.

2) Full segmentation algorithm

use dictionary matching to get all the possible segmentation results for a sentence. As the number of results of total segmentation increases exponentially with the increase of sentence length, the space-time overhead of this method is large , and for long and more ambiguous sentences, it takes a long time to traverse all the segmentation paths.

3) Chinese word segmentation algorithm based on comprehension

The process of ambiguity elimination in Word segmentation is an understanding process, which requires not only lexical information but also syntactic and semantic information. So at present, some researchers try to simulate the process of human understanding, and add syntactic and semantic analysis to deal with ambiguity in the process of word segmentation. Because of the complexity of Chinese language knowledge, it is difficult to organize various language information into the form that machine can read directly, so the word segmentation system based on understanding is still in the experimental stage.

2 Segmentation method of statistical learning based on large-scale corpus

This kind of method mainly uses the various probability information obtained from the large-scale corpus to divide the Chinese string. This method often does not need the manual maintenance rule, also does not need the complex linguistics knowledge, and the expansibility is better, is the present participle algorithm more commonly the practice.

3 Chinese Character segmentation method combining rule and statistic method

Now most of the word segmentation algorithms are based on the combination of rules and statistics, which can reduce the dependence of statistics on corpus, make full use of the existing lexical information, and make up the deficiency of the rule method. The common method is to use the dictionary for initial segmentation, and then use other probabilistic statistical methods and simple rule disambiguation to identify non-signed words.


Big Data text analysis: Spiritual Nine natural language Chinese semantic word segmentation system

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.