Recently prepared to learn the natural language processing related knowledge, the main reference is "statistical natural language processing and Zongchengqing" and "Natural Language processing with Python", recommended to read. the first article is mainly about the basic knowledge of NLP and concept introduction, in fact, I am also about NLP reading notes, I hope to help you .
I. Concept INTRODUCTION
Natural Language processing
the production of natural language processing (Natural Language processing, or NLP) dates back to the the 1950s as a comprehensive interdisciplinary discipline that integrates linguistics, mathematics (algebra, probability), computer science and cognitive science. How to make the computer correctly, effectively understand and deal with human language, namely "Understanding what people say" is today has the enormous challenge theory and the technical question. In recent years, applications include word recognition, speech synthesis, network information monitoring, bad information filtering and early warning, image recognition, emotion calculation, understanding technology and question answering system.
Chinese
Chinese information processing is one of the important branches of NLP, the current international influential technical evaluation, including machine translation evaluation, data extraction evaluation, syntax analysis and evaluation are closely related to China. Chinese-language processing has common problems of NLP, such as the recognition of new words, ambiguity resolution and so on, as well as the problems such as Chinese automatic word segmentation and definition of part of speech.
Book Introduction
"Statistical natural Language Processing" details the domestic scholars in the Chinese corpus and vocabulary knowledge construction, automatic word segmentation (including Word segmentation method and named entity recognition) and POS tagging, syntactic analysis and oral information processing and other recent research results, but also includes the International Computing Language Conference (ACL, just held in Beijing) the best paper part.
the 1th to 9th chapter introduces the theory of statistical natural language processing, including the preparation knowledge, formal language and automata, corpus and vocabulary Knowledge Base, language model, Hidden Markov model, Chinese automatic word segmentation and POS tagging, syntactic analysis and semantic disambiguation, and the 10th to 15th chapter mainly introduces the application of statistical natural language processing, Including machine translation, speech translation, text classification, information retrieval and question answering system, automatic digest and information extraction, oral information processing and man-machine conversation.
about "Understanding"
The standard of "understanding" will always come to mind the Turing of the British mathematician Turing, in 1950: If the performance of a computer system (ACT), reaction (react), and interaction (interact) are the same as conscious individuals, This computer system should be considered to be conscious.
In the field of natural language processing, Turing experiments are often used to determine whether a computer system "understands" the specific criteria of a natural language, such as: a question-and-answer system (question-answering) to test whether a computer system can correctly answer the questions in the input text. ; Using the digest Generation (summarizing) system to test whether the computer system has the ability to generate text summaries automatically, and to test whether a computer system has the ability to translate one language into another through a machine translation,mt.
two. Natural language processing research content and basic methods
Research Content
Natural Language Processing research content is very extensive, the following research direction:
MT (machine translation): Realautomatic translation of a language into another language.
Automatic Digest (automatic summarizing or automatic abstracting):The main content and meaning of the original document are automatically summed up and extracted to form a summary or abbreviation.
Information Retrieval (information retrieval): Also known as information retrieval, is the use of computer systems from a large amount of documents to meet user needs of relevant documents. Multi-lingual IR is called cross-language information retrieval.
Document Classification (documents categorization): AndClassification of text or information, the use of computer systems for a large number of documents according to a certain classification criteria (such as topic or content Division) to achieve automatic classification.
Question and answer systems (question-answering system): Through the computer system to understand the problems raised by people, using automatic inference and other means, in the relevant knowledge resources to automatically solve the answer and respond accordingly. The question-and-answer technique is sometimes combined with the multi-modal input and output technology of voice technology, and the artificial interaction technology to form a man-machine dialogue system (human-computer dialogue systems).
text editing and automatic proofreading (automatic proofreading): Automatic checking, proofreading and arranging of text spelling, words and even grammar, document format, etc.
Information filtering (information filtering): Automatically identify and filter document information that meets specific criteria through a computer system. It is mainly used for information security and protection.
language Teaching (language teaching): The use of computer-aided teaching Tools, language teaching, training and counseling.
word recognition (optical character RECOGNITION,OCR): Automatic recognition of printed or handwritten text by computer system to convert it into electronic text that can be processed by computers. Relatively speaking, the main content of word recognition is character (Chinese characters) image recognition, but for high-performance word recognition system, the relevant language understanding technology is indispensable.
speech Recognition (speech recognition): Converts the speech signal recognition of the input computer into a written language representation. Speech recognition is also called automatic speech recognition (automatic speech Recognition,asr).
text-to-language conversion (text-to-speech conversion): The written text is automatically converted to the corresponding speech representation, also called speech synthesis (speech synthesis).
speaker Recognition/authentication/verification (speaker recognition identification verification): An acoustic analysis of a speaker's verbal sample, in turn judging (confirming or verifying) the identity of the speaker.
in fact, almost all of the research we can think of in human language implies computational linguistics, which is not listed here.
face Difficulties
natural language processing involves several aspects, such as morphology, grammar, semantics and pragmatics, and its final application goals include machine translation, information retrieval, question answering system and other widely used fields. The key problems that need to be faced are the problem of ambiguity resolution (disambiguation) and the phenomenon of unknown language.
There are many ambiguous phenomena in natural language, regardless of lexical level, grammatical level, no matter what kind of language unit, ambiguity always puzzles people.
EG1 Put The block in the box on the table.
"On the table" can be modified box or block. Thus, two different syntactic structures can be obtained:
A.put the block [in the box on the table].
B.put [The Block in the box] on the table.
In this sentence add a preposition phrase "in the kitchen" can get 5 of the possible analysis structure, in fact, the result of this ambiguity structure analysis of the number of prepositional phrases increased exponentially.
EG2 about Lu Xun's writings.
Can be understood as "about [Lu Xun/works]", can also be understood as "[about/Lu Xun] works." There is a lot of ambiguity in Chinese, we say "noon canteen" does not mean to eat the canteen; we praised a person who said "this man is really bull" is not to say that this person is a real cow.
Eg3 also need to solve the ambiguity phenomenon in the Knowledge Atlas.
Another is an unknown word, unknown structure and other unexpected situations, And each language with the social development and dynamic change, the new vocabulary, meaning, sentence structure is constantly emerging. Especially in the oral dialogue or computer network dialogue (MSN, QQ,), a variety of strange network words and structures are more common.
Natural language processing systems must therefore have a better ability to deal with unknown language phenomena, and fault tolerance for various possible input forms (robustness of the system). Of course, there are many other problems, such as how to deal with differences in different languages, how to extract text features, lack of resources, low coverage, difficulties in knowledge representation.
basic methods
Years, nearly 30 years of time hastily passed, when I was in the prime of the young people, now, has become a white-haired old man, I struggled for this cause for most of my life time, during which difficult to say table. For more than 30 years, whether in prosperity or adversity, I have always imag and geta with the deep feelings of inextricably linked, which is of course mainly for our common natural language processing business feelings. --Feng Zhiwei
Or recommend you read this very classic NLP books, hope that the article to help you, at least a simple understanding of the back may also write a few of their own interest in the book after reading sense. See above this paragraph, very moved, hope oneself also can persist in the heart of ideal, 10 years such as a day to insist on blogging and teaching career bar! ^_^
(by:eastmount 2016-08-04 night 8 o'clock http://blog.csdn.net/eastmount/)
"Statistical natural language Processing" reading notes I. Introduction to basic knowledge and concepts