Prospect of Language Information Processing Technology
Yu shiwen
1.Significance of developing Language Information Processing Technology
I think that Chinese Information Processing can be roughly divided into two levels. One is the text level, that is, the Chinese character information processing; the other is the language level. This article only discusses the Chinese Information Processing Problem. The natural languages used by all humans to exchange information, disseminate knowledge, and develop culture (such as Chinese and English) have deep similarities, chinese Information Processing has more commonalities with information processing in other languages. Of course, Chinese Information Processing also has its own characteristics. This article will naturally explore more about the characteristics of Chinese Information Processing.
With the increasing informatization of society, people are increasingly eager to communicate with computers in natural languages. Natural language understanding is a fascinating and challenging topic in computer science. From the perspective of computer science, the task of natural language understanding is to establish a computing model that can understand natural language as humans do. Due to the inherent complexity of natural language, people still cannot understand the language mechanism. It is extremely difficult to give an essential definition of "Understanding. Because Language is the carrier of information, the understanding of natural language by computer is generally judged based on the practical viewpoint of information processing. If the computer implements (1) Human-Computer session or (2) machine translation or (3) automatic summarization and other language information processing functions, it is considered that the computer has the capability of natural language understanding. In addition to analyzing the articles or discourse input to a computer, these practical systems also need to have language generation functions. Therefore, in computer science, apart from "natural language understanding ", the term "natural language processing" or "Language Information Processing" is often used. To implement various functions of language information processing, we are developing technologies such as lexical analysis, syntax analysis, semantic analysis, and Context Analysis for natural languages, it is accumulating language data resources such as electronic dictionaries and corpus. Some of these technologies and resources have formed products, and some will be integrated into new information processing systems. The development of Chinese Information Processing technology has great potential.
Due to the close relationship between language and thinking and culture, language research has become a breakthrough in the development of modern western philosophy and humanities. Language Science is a leading science in the humanities and a bridge between the humanities and natural sciences. It has the same status as philosophy and mathematics in the entire scientific system. Due to the introduction of mathematical methods and computer technology in contemporary linguistic studies, linguistics itself has also made a leap, and many branches have emerged. Among them, computational linguistics is the most active branch. At present, foreign language research is centered around a central topic, which is a linguistic issue related to the development of intelligent computers. There is a significant gap in Chinese linguistics research in this regard. The combination of experts in the computer field and linguistics to carry out research on language information processing can not only shorten this gap, but also drive the development of the whole humanistic science.
The essence of intelligence is one of the challenges of contemporary science. To realize natural language understanding, we must understand how people understand the language and how children learn their mother tongue. Different linguistic theories have different interpretations of human language phenomena, and various arguments cannot be held because the brain acts as an intelligent activity (including language activity) the material basis of the function has not been thoroughly understood. Establish a cognitive model on the computer to simulate the process of language understanding (the current natural language processing system is the prototype of this model ), it can provide a "window" for observing the activity of the black box of the brain ". The use of computers not only successfully simulates logical thinking, but also explores simulated image thinking and inspiration. The study of natural language understanding can contribute to the breakthrough of intelligent science.
2.Difficult Process of Language Information Processing Research
The application of digital computers in the field of non-numerical data was first attempted in the field of language information processing. Soon after the advent of electronic computers, machine translation experiments began. However, whether compared with the development speed of computer technology or the development speed of Computer Application Technology in other fields, the development of language information processing is quite slow and the road is tortuous. In the late 1950s s and early 1960s s, the United States experienced the first wave of Machine Translation Research. In 1966, the alpac report published by the Language automatic processing Advisory Board of the U.S. Emy of Sciences threw a lot of cold water on machine translation, and the Language Information Processing went through a period of silence. Since the late 1970s s, due to the rapid advances in computer technology and the development of linguistic theories, some machine translation systems and natural language interfaces of databases have become more practical and more driven by social needs, language Information Processing Research re-enters the boom period, a significant indication that a considerable number of language information processing products have entered the market. However, the road is not smooth. The two major international Machine Translation Research Projects (EuroTra in the European Community and ODA in Japan and four neighboring countries) originally planned to be completed in early 1990s failed to achieve the expected goal. The corpus-based statistical methods proposed by some scholars in the first year of 1990s also encountered many obstacles. At home and abroad, a considerable number of experts are calmly thinking about the current situation, theoretical basis, and technical route of natural language processing. Some scholars believe that it has not yet been able to cross the "semantic obstacle ", at the same time, new breakthroughs are brewing. In recent years, the Internet has been expanding rapidly, and a large amount of information is surging. the main carrier of this information is still natural language, people are eager to develop natural language information processing technologies to achieve automatic text classification, literature retrieval, information extraction, language translation, automatic summarization, and automatic calibration, and to accelerate the exchange of information, knowledge, and culture, promoting social, economic, and scientific progress is clearly a challenge facing every country. The development of language information processing technology has given rise to a new powerful impetus.
China is one of the first countries in the world to carry out machine translation research. However, the large-scale and systematic research on natural language processing started only in the middle of 1980s, And it was relatively late. In view of China's national conditions, Chinese scholars have concentrated their primary efforts on the development of practical systems. The basis of theoretical research is relatively weak and there are few theoretical results. Although some systems have achieved considerable economic benefits, in general, there is still a gap between our Language Information Processing Research and the current international level. This phenomenon may also exist in other fields of science and technology. What we need to focus on is some special questions about Chinese in the field of language information processing.
At the level of Semantic Analysis and context analysis, I noticed the commonalities of various natural languages. I cannot believe that in terms of semantic analysis, Chinese will surpass other languages to reach the other side of the victory in advance. On the contrary, I see more difficulties in Chinese analysis at the level of syntactic analysis.
Compared with English and Japanese, analyticdb is a typical analyticdb language. Its external characteristics are both a lack of morphological changes and a lack of adhesive components as syntactic symbols. I think that in the existing Chinese Grammar System, the phrase (phrase) standard syntax proposed by Mr. Zhu dexi is the most in line with the actual needs of Chinese and information processing. The phrase-based syntax reveals the influence of Chinese external features on Chinese syntactic analysis, which is the essential cause of the difficulty of Automatic Chinese analysis. The phrase-based syntax system can be summarized as follows: (1) the relationship between words and phrases is "Composition", and from phrase to sentence is a "Implementation" relationship; the constructor principles of Chinese phrases are basically the same as the constructor principles of sentences. (2) The same word class in Chinese can serve as a variety of syntactic components without morphological changes. (3) the constructor of various phrases can also be phrases of various types. The predicates are in the same position as other syntaxes, And the predicates themselves can be subject-and-Verb Structures. (4) although the internal word order of various types of phrases is fixed, the word order of Chinese sentences is quite flexible. (5) Although the imaginary words in Chinese have important syntactic functions, however, in many cases, it can be omitted. (6) Writing by sentence (there is no space between words) the written Chinese language loses a lot of language information. The translation quality of Chinese-English Machine Translation is far inferior to that of English-Chinese machine translation, which proves the correctness of Rational Thinking in practice.
Considering that syntax analysis plays an important role in most language information processing systems, it is meaningful to realize the special difficulties of Automatic Chinese analysis. Only Recognizing difficulties can we find countermeasures to overcome them. Of course, our country's language information processing research also has its own advantages. Chinese scholars started this research under the guidance of advanced linguistic theories and in a relatively advanced computer environment, avoiding some detours that developed countries have explored in the early days. Chinese is one of the most important languages in the world. In China, the human resources and language data resources required by a large number of experts in language engineering are abundant, and the price is relatively low. Chinese scholars can give full play to their own talents and assume the responsibilities, actions, creations, and contributions granted to themselves by social development.
3.Thoughts on the Development Strategy of Language Information Processing Technology
3.1Basic Engineering--Build a large-scale integrated language knowledge base
It is not difficult for people to communicate with each other in natural language, because communication is always conducted in a certain environment and the knowledge background of both parties (including language knowledge and real world knowledge) there must be a common part, and the purpose of communication is generally preset. The current computer system does not have this knowledge. The knowledge of the real world is boundless and must be oriented to specific fields. However, language knowledge is common. It is essential to build a large-scale integrated language knowledge base. This knowledge base includes both lexical and syntactic knowledge, as well as semantic and even pragmatic knowledge. The basic language unit in this knowledge base includes both words and elements and phrases. This knowledge base includes both the original corpus and, it also contains a multi-level processing corpus. A dictionary database with high knowledge content and standardized storage formats is an essential component. To implement machine translation, this knowledge base not only contains Chinese knowledge, but also translation knowledge of Chinese and other languages. After more than a decade of hard work, our country has accumulated a lot in this area, but it is scattered and the quality is also uneven. Now both integration and development are required. The "Modern Chinese grammar information Dictionary" developed by Peking University Institute of computational linguistics can be used as the building material of this integrated language knowledge base, the Grammar Dictionary developed by Peking University computing language institute can also be used as a reference to build this language knowledge base.
3.2Theoretical Exploration--Suitable theoretical system and computing model for Chinese
In this respect, we should learn advanced foreign theories and methods and be in line with international research. For example, the theoretical models of computational linguistics based on complex feature sets and integration algorithms proposed by foreign scholars are worthy of our reference. Foreign Scholars advocate Semantic Analysis and corpus-based statistical methods, however, if we ignore or despise the study of Chinese grammar rules applicable to computer processing, we will not consider the reality of Chinese. The author believes that, based on the actual situation in China, machine processing and expert verification are combined with the rules and statistical methods, it is possible to achieve multi-level processing for the selected large enough corpus in a short period of time. For this purpose alone, if properly organized, we may be ahead of others. We have a corpus with processing depth and accuracy satisfying the requirements, and it is possible to construct a probability syntax to end the language rules. This is not the case if the language fact does not comply with the rules.
Considering the special difficulties of Automatic Chinese analysis, I think that the practical significance and use value of restricted Chinese studies are great. Restricted Chinese is not an option. It can be a milestone in the history of the development of language information processing. Restricted Chinese may become a common language for Chinese people and grandchildren around the world. It will promote exchanges and cooperation between Chinese and other Chinese people. The Study of restricted Chinese will promote the standardization and modernization of Chinese and improve the international position of Chinese. Restricted Chinese may be a high-speed train loaded with Chinese culture on the information highway.
3.3Product Development--Mutual support with theoretical research and basic engineering
Although the theory and technology of language information processing are not yet mature, if the existing technology and language data resources are used properly, it can also develop products suitable for the market or improve the intelligence level of information technology products. Because I always feel that I have not invested enough in theoretical research and basic engineering, I have to devote some energy to product development to support theoretical research and basic engineering with its benefits, this makes a virtuous circle between theories, foundations, and applications. Such a technical route is undoubtedly desirable in general. However, when it comes to a small unit, it is often difficult in the field of language information processing.
3.4Talent Training--Vigorously cultivate talents in computational linguistics
In order to promote the development of natural language processing technology, to enhance China's competitiveness in this high-tech field, and to cultivate talents supporting the interdisciplinary natural language processing technology-computational linguistics, especially for young people. Many foreign universities have linguistics. Over the past 10 years, they have also established computational linguistics or majors. Most important American universities have obtained a doctorate degree in computational linguistics. China does not have such a system or major, nor does it have a master or doctoral degree in computational linguistics. At present, only doctoral and master students of computational linguistics research can be trained in other disciplines (computer science or linguistics. The author hopes to establish doctoral and doctoral points of Computational Linguistics in some qualified university experiments to accelerate the cultivation of senior professionals in the field of language information processing.
(This article is published in Computer World 1997, 1st, 127)