The basic unit of information is a sentence, a sentence can be expressed in a coherent and comprehensible semantic. The main role of the sentence is often a keyword, phrase or phrase, and the other elements of the sentence only play a role in the further modification of the connection, they are only the basic information refinement and supplement. Once you have access to these meaningful key messages, you can get basic information about the sentences. Therefore, the new feature language and meaningful string discovery are very meaningful to Chinese natural language understanding. It can not only improve the accuracy of word segmentation, but also have a wide range of applications, such as discovering new words and phrases that are used frequently. Its application areas are mainly as follows:
first, it is the analysis of index words of information retrieval. For example,the "Computing Technology Institute" is a complete query term, and "computing" or "technology" is hardly anyone to query. When users want to search for Volkswagen, the search results obtained by entering "VW" may be less accurate, including a lot of information such as "people", "working class" such as non-Volkswagen, and as a meaningful string of "Shanghai Volkswagen" with semantic integrity, can eliminate ambiguity, generally more accurately describe the needs of users. Therefore, meaningful strings can be applied to information retrieval of query Word correction and related search analysis.
Secondly, it has applied value in the mining and tracking of social hotspots. A meaningful string is a clue, is a very valuable social phenomenon, they often contain the netizens on the current social phenomena of the position and point of view. Therefore, the excavation of new feature language and meaningful string is of great significance to social hotspot excavation and public opinion monitoring.
Thirdly, it can be used for information analysis and feature extraction. In addition to words and words, the commonly used text feature items include phrases, semantic concepts, meaningful strings, and so on. The extraction of meaningful strings is of great significance for improving text classification and clustering performance.
The forth is of great significance to the enlargement of dictionaries and the construction of corpus. As the speed of new words is accelerating, the field of new words is increasing, and the use of traditional artificial methods to collect new words is time-consuming, laborious, and of poor timeliness. If the computer's computing power and automatic detection method can be used to quickly output new word candidates for manual screening, this will greatly reduce the burden on people. If we can automatically extract the new words into a meaningful string, we will promote the automatic construction of corpus. In addition, meaningful string mining can be used to excavate the key frequent patterns, which is of great significance for higher level text automatic content extraction, topic detection and machine translation.
new feature words and meaningful strings refer to the series with statistical meanings, and the new feature words and meaningful strings are divided into the following categories . (The first two categories are all words, and the last three categories include both words and phrases):
(1). Named entities, such as "DPP", "Brazil team" and so on;
(2). New words, refers to narrow new words, such as "blog", "House", "bump shirt" and so on;
(3). Domain terminology, refers to the domain related common terms, such as "Computational Linguistics", "non-login Words", "necrosis of the femoral head" and so on;
(4). Fixed collocation, mainly refers to the common use of common corpus commonly used collocation, such as "housing demand" and so on;
(5). Idioms, allegorical sayings and other idioms, such as "The Wise man, there will be a loss", "the benevolent See," and so on.
There are many scholars who use statistical methods to extract meaningful strings, that is, according to the frequency of a string, mutual information (Mutual information,mi), adjacency category (Accessor Variety, AV) and other statistics to determine whether the string is a meaningful string. This method is good for high frequency and meaning string processing, but it is difficult to extract the meaningful string of low frequency effectively.
The Nlpir text Search and mining system, which is based on the need of Internet content processing, integrates the techniques of natural language understanding, Web search and text mining, and provides a set of basic tools for two-times development of technology.
Nlpir can meet the needs of the application to deal with big data text in all aspects, including big data complete technology chain: Network Crawl, body extract, Chinese and English word segmentation, pos tagging, entity extraction, word frequency statistics, keyword extraction, semantic information extraction, text categorization, affective analysis, semantic depth extension, simplified encoding conversion , automatic phonetic notation, text clustering, and so on.
Ling JIU software: New features of Big Data language discovery