Semantic analysis refers to the use of a given natural language (including chapters and sentences) translates into a formal expression that reflects its meaning, that is, to translate natural language that humans can understand into a form language that can be understood by computers, and to communicate with one another. It is oriented to the whole sentence, not only the semantic relationship between the main predicate and its argument, but also the semantic information contained in the non-main predicate, such as quantity (quantity), attribute (attribute) and frequency (frequency), etc.
The semantic analysis of natural language processing is the bottleneck of natural language processing technology to deep application . There are two main methods of semantic analysis on concept and relationship level: statistic-based feature vector extraction method and semantic dictionary (WordNet, hownet, etc.) Semantic similarity calculation method. For the specific application of the two methods have a large deficiency, the former because the relationship between the statistical model is only applicable to paragraph, text or multi-document and other coarse-grained semantic analysis, and not suitable for the application of sentence vocabulary level, the latter can easily deal with the relationship between the entity concept.
Nlpir text Search and mining system for the needs of Internet content processing, the fusion of natural language understanding, Web search and text mining technology, provides two of technology for the development of the basic toolset. It provides a visual display of the effect of middleware processing, and can also be used as a processing tool for small-scale data.
First, Chinese participle
1, the Word segmentation method based on string matching. This method, according to different scanning methods, find the thesaurus for Word segmentation.
2, the total segmentation method. It first cuts out all the possible words that match the thesaurus, and then uses the statistical language model to determine the optimal segmentation results.
3, word-word word segmentation method. Can be understood as the classification of the word problem, that is, natural language processing in the sequence labeling problem.
4. Chinese character segmentation in dictionaries and rules
When slicing, the string to be segmented is used to match the entry in the dictionary, and if the match succeeds, it is cut into a word.
5. Segmentation method of statistical learning based on large-scale corpus
This kind of method mainly uses the various probability information obtained from the large-scale corpus to divide the Chinese string. This method often does not need the manual maintenance rule, also does not need the complex linguistics knowledge, and the expansibility is better, is the present participle algorithm more commonly the practice.
6. Chinese character segmentation method combining rule and statistic method
Now most of the word segmentation algorithms are based on the combination of rules and statistics, which can reduce the dependence of statistics on corpus, make full use of the existing lexical information, and make up the deficiency of the rule method.
Second, the word mark
a text string in addition to participle, but also need to do part-of-speech tagging, named entity recognition, new word discovery and so on. Usually there are two kinds of schemes, one is the first participle, then the part-of-speech labeling, the other is to use a model to complete these tasks.
Three, the language model
A language model is a probabilistic model used to calculate the probability of a sentence generation, i.e. P (W_1,w_2,w_3...w_m), m represents the total number of words.
N-gram language model is simple and effective, but it only considers the position of the word, does not consider the similarity between words, Word method and word meaning, and there are sparse data problems, so later, gradually put forward more language models
Neural network language model, which is based on N-gram, first, each word w_{m-n+1},w_{m-n+2} ... w_{m-1} maps to the word vector space, and then combines the word vectors of each word into a larger vector as a neural network input, and the output is P (w_m).
How the Nlpir of Big Data semantic analysis is implemented