This article complies with the CC copyright agreement. For more information, see matrix67.com.
This article is a continuation of the Chinese word segmentation algorithm. Here, we will continue to discuss the content of the previous Article: if a computer can automatically split a sentence, can it further organize the sentence structure and even understand the meaning of the sentence? These two articles are closely related. Therefore, I renamed the previous article "Automatic Chinese Word Segmentation and semantic recognition (I)". This article is naturally the next one. I have already made speeches on this topic in many different places. Here I want to write them down and share them with others.
What is the syntax structure? Let's look at some examples. The white swan swimming in the water is ambiguous. It may refer to "a goose swimming in the water during the day ", it may also mean "a white swan is swimming in the water ". Different Word splitting schemes have different meanings. Is there any sentence that the word splitting scheme is unique, but it also produces different meanings? Yes. For example, "The door has no lock", which may refer to "The door has not been locked" or "there is no lock on the door ". Although this sentence can only be divided into "door/no/lock", the word "Lock" may be either a verb or a noun, therefore, the entire sentence has a different meaning. Is there a sentence whose word segmentation scheme is unique and the meaning of each word does not change, but the entire sentence is still ambiguous? Possible. Let's look at this sentence: "The Hunter's dog is killed ". This sentence may refer to "biting the hunter's dog to death" or "biting the hunter's dog to death ". How does this ambiguity come about? After understanding the two different meanings, you will find that the underlying components in a sentence can be combined in different order, resulting in ambiguity.
In the previous article, we saw that using the probability transfer method, we can effectively perform word segmentation for a sentence. In fact, using the same model, we can also mark each word as a part of speech. A better way is to regard the use of different parts of speech of the same word as different words, so as to unify word segmentation and word-of-speech tagging. However, all such work is to analyze sentences linearly from left to right, and the sentence structure is actually much more complicated than this, these words are combined sequentially and hierarchically. To correctly parse a sentence, the computer should analyze the level of the syntactic structure after word segmentation and tagging.
How to describe the syntactic structure of a sentence in a computer? In 1957, Noam Chomsky published the book "syntactic structure", which clearly describes the hierarchical structure of the language in a formal manner. This is the so-called "generate Syntax" model. This book is one of the few real books in the 20th century. It is concise in words and has a clear idea. It shocked many fields including linguistics and computer theory. I remember someone on Quora asked who are the best minds of the world today. The answer is Noam Chomsky.
Take a very long and complex sentence, for example, "the car has been repaired by a car driver." We can always analyze its structure at the top and bottom. The top structure of this sentence is "the car has been repaired ". How can I fix a car? The car was repaired by the master. What kind of masters have completed the car? Oh, the car was repaired by the driver. Of course, we can continue to expand infinitely and replace every bottom-layer component in the sentence with a more detailed and complex description, just like the sentence extension exercise in the primary language. This is the core idea of generating syntax.
Those familiar with compilation principles may know "context-independent grammar ". In fact, the extension rules mentioned above are essentially context-independent grammar. For example, if a sentence can be in the form of "what?", we can record this rule
Sentence → noun phrase + Verb Phrase
A noun phrase refers to a component with a noun function. It may be a noun or its internal structure. For example, it may consist of an adjective phrase plus "of" and another noun phrase, such as "cheap car "; it may also be composed of "verb phrases + + noun phrases", such as "broken cars "; it may even be composed of "noun phrase + + noun phrase", such as "teacher's car ". We also write down the rules for generating nouns and phrases:
Noun Phrase → noun
Noun Phrase → adjective phrase + + Noun Phrase
Noun Phrase → verb phrase + + Noun Phrase
Noun Phrase → noun phrase + + Noun Phrase
Zookeeper
Likewise, verb phrases have many specific forms:
Verb Phrase → verb
Verb Phrase → verb phrase +
Verb Phrase → prefix phrase + Verb Phrase
Zookeeper
The above mentioned prefix phrases also have their own generation rules:
Prefix phrase → prefix + Noun Phrase
Zookeeper
We construct the sentence task, that is, starting from the initial node of "sentence", constantly calling rules to generate more and more complex sentence structures, and then selecting words of the corresponding part of speech from the dictionary, fill in this framework:
The task of analyzing the syntax structure is the part of speech of the words in a sentence from left to right. In turn, we need to find a "syntax structure tree" that meets the requirements ". This can be achieved using Earley parser.
In this way, the problem of syntax structure seems to have been solved perfectly. In fact, we are still far away. Syntax generation has two major problems. First, sentences with the correct syntax structure are not necessarily good sentences. Chomsky himself provides a classic example: colorless green ideas sleep furiously. Adjectives, adjectives, nouns, verbs, and adverbs are a sequence that fully complies with the syntax requirements, but casually piecing together will make a lot of jokes-what is "the colorless green idea is sleeping violently "? By the way, insert an advertisement. If you still like the artistic conception of this sentence, welcome to the ideagenerator I used to do. However, if we do not involve sentence generation and focus only on sentence structure analysis, this defect does not seem to have a significant impact on us. The second problem of syntax generation is quite troublesome: Starting from the same part of speech sequence, different syntax structure trees may be constructed. Compare the following two examples:
The teacher was amused by the late students.
The room where the phone is eavesdropped has been found.
They are all "Nouns + prepositions + Verbs + nouns + Verbs +", but their structures are not the same. The former is a pleasure for teachers, "late" indicates "Students", the latter is found in the room, and "phone eavesdroppers" are used together to modify the room. However, simply using the previous model, we cannot tell which sentence should be the syntax tree. How to Strengthen the model and algorithm of Syntactic Analysis and let the computer build a correct syntax tree has become a big problem.
Let's look at a simpler example. It is also a "Verb + adjective + noun". We have two solutions to construct a syntax tree:
A friend who has not been trained in Chinese grammar may ask, is the syntax structure of "Light candle" and "Kick new ball" really different? We can prove that there are actually differences. When we make a sentence "kick the ball", you will find that for this sentence, both syntaxes are valid, so there is ambiguity: kick the ball (the structure is the same as that of the Light candle), or kick a broken ball (the structure is the same as that of the new ball ).
But why is there only one way to "Light the candle? This is because we usually do not place the word "bright" directly before the term as an attribute. we generally do not say "a bright candle" or "A Bright Star. Why is there only one way to understand "playing new skins? This is because we usually do not place the "new" directly behind the verb as a complement, nor say "the ball is kicking new" or "the clothes are washed new. However, "broken" can be both an attribute and a complement, so "kicking the ball" produces two different meanings. If we write down whether every adjective can be used as an attribute, whether it can be used as a complement, and then add restrictions to the generation rules, can this problem be solved perfectly?
The rule-based syntax analyzer does this. Chinese linguistics have listed the characteristics of all words:
Bright: part of speech = adjective, complement = true, attribute = false
New: part of speech = adjective, complement = false, attribute = true
Zookeeper
Of course, each verb also has a lot of attributes:
Point: part of speech = verb, can take object = true, can carry complement = true?
Kick: part of speech = verb, can take object = true, can carry complement = true?
Pollution: part of speech = verb, can take object = true, can carry complement = false?
Queuing: part of speech = verb, band object = false, band complement = false
Zookeeper
The terms are no exception:
Candle: part of speech = noun, subject = true, object = true, quantified = true
Ball: part of speech = noun, subject = true, object = true, quantified = true
Zookeeper
Some people may find it strange: "being a subject" is also an attribute. Is it true that some nouns cannot be a subject? Haha, there are not only many such terms, but also many such words: highly toxic, head-watching, powerful, on-track, and survival are not placed in front of the verb. Do some nouns cannot be used as objects? There are also many such words: la, fangling, heartbeat, and ugly words are not placed behind the verb. In this way, it is not surprising that there are words not modified by quantifiers. In fact, these strange terms cannot be used to add quantifiers.
Another important thing is that these properties can be passed up ". For example, we stipulate that rules should be applied.
Noun Phrase → adjective phrase + Noun Phrase
The second component of a noun phrase is dependent on whether it can be used as the subject, as the object, or modified by quantifiers. Generally speaking, if the ball can be used as the subject, the new ball can also be used as the subject. With the "Word Knowledge Base", we can ensure that this knowledge can be retained at a higher level, so that we can add constraints to the syntax generation rules. For example, we can specify to apply rules
Verb Phrase → verb phrase + Noun Phrase
The precondition is that the property of the verb phrase "can take object" is true, and the attribute of the noun phrase "can be used as object" is true. In addition, we stipulate that
Verb Phrase → verb phrase + adjective phrase
The "band complement" attribute of a verb phrase must be true, and the "can complement" attribute of an adjective phrase must be true. This prevents the combination of the "kick" and "new" in the "Kick new ball", because the "new" cannot be used as a complement.
Finally, we stipulate that rules should be applied.
Noun Phrase → adjective phrase + Noun Phrase
The adjective phrase must be an attribute. This avoids the combination of "bright" and "Candle" in the "Light candle", because "bright" is generally not an attribute. In this way, we solve the structural analysis problem of "Verb + adjective + noun.
Of course, this is just a simple example. In question 6, 7, and 8, you can see that a syntax generation rule often has many restrictions, these restrictions are not only simple "Functional consistency" and "consistent before and after", but also complicated restrictions that even require the use of if... Then... . Here, you can see that there are various weird differences between words and Words in Chinese, and the nature of which word has is purely a knowledge base, there is no rule to follow. A Practical syntactic structure analysis system usually has hundreds of attribute tags. The computing Language Institute of Peking University compiled the modern Chinese Syntax information dictionary, which contains 579 attributes. Our ideal goal is to find every factor in Chinese that may affect the syntax structure, tag each word in the dictionary, and list every generation rule in Chinese Syntax, find the application conditions for each rule generation and the attributes of which sub-components will be inherited by the entire component after the rule is applied, under what circumstances will new attributes be generated. From the perspective of generative linguistics, computers should be able to correctly parse all Chinese sentences.
In this way, can a computer obtain all the information required to understand semantics from a sentence? The answer is no. There are still some sentences in which there is no ambiguity from word segmentation to word meaning to structure, but the entire sentence is still ambiguous. Consider this sentence "Chicken does not eat". It has two meanings: Chicken does not eat, or we do not eat chicken. However, this ambiguity is not caused by word segmentation, word meaning, or structure. The syntax structure of the two meanings is identical. They are "chicken" and "Don't eat ". But why does ambiguity still emerge? This is because there is a deeper semantic structure within the syntax structure, and the two are not the same.
Chinese is so strange that things at the subject position may be both the initiator of the action and the receiver of the action. You can also say "I have finished eating. However, ambiguity arises from the fact that "chicken" can both be eaten and eaten.
The objects at the object position are not necessarily the operator of the Action. "Come to the guest" and "Stay with a person" are the opposite of the object but the act sender. I remember a certain number of times the logic class teacher lamented that the Chinese predicates were very nonstandard and clearly the sun was shining on me. Why did they say "I am basking in the sun? In fact, there are many more strange examples of the wide range of Chinese verb-object combinations: "writing" is what we are actually writing, and "writing" is the result of writing, "Writing a brush" is a writing tool, "writing a mirror" is a writing method, "writing to the ground" is a place to write, "writing a dog", etc, what is "Write a dog? Can we say "Write a dog? Of course, this is what I wrote. "What did students write this week?" and "I wrote a dog ". As you can imagine, foreigners who want to learn Chinese have read what it will look like. Although through syntactic analysis, we can determine which verb is associated with everything in a sentence, we still need a new model to look at the semantics of this association.
Chinese linguistics divide the semantic relationship between things and verbs into 17 types, which are called 17 types of "semantic roles ", they are responsible for action, sense, action, motivation, event, result, department, tool, material, method, content, and events, objects, places, goals, starting points, and time. As you can see, the semantic roles are divided in great detail. It is also the sender of an action. Action refers to an action in the true sense, for example, "he" in "he eats"; action refers to the experience of a perceived activity, for example, "He knows this", and "he" is the subject of the state of nature, for example, "he" in "he is ill "; power is the sender of the natural power, for example, the flood in "flood drowned the village ". The specific division of the semantic role and the number 17 are controversial. However, the model itself can answer the question "What Is Semantics" very well.
There is a projection theory in Chinese, that is, the structure of a sentence is projected by the predicates in this sentence. After a verb is given, the number of semantic roles that the verb can contain is basically determined. Therefore, the structure of the complete sentence is actually determined. For example, when it comes to the verb "rest", you will feel that it lacks an action and does not have anything else. We can only say "Old Wang has a rest", not "Old Wang has a rest hand" or "Old Wang has a rest sofa ". Therefore, we believe that "rest" has only one "argument ". Its "meta-structure" is:
Rest <Shi>
Therefore, once we see the word "rest" in a sentence, we need to find the necessary means for "rest" in or out of the sentence. In this process, there is a handsome name called "Price allocation ". "Rest" is a typical "monovalent verb ". What we usually deal with is the second-price verb. However, their specific arguments may be different:
Eat <Shi, suffering>
Go to <action, goal>
Drowning <power, suffering>
There are also three price verbs, such
Send <action, event, and event>
There are even zero-price verbs, such
Rain <strong>
Next we want to teach computers how to price the verb. Previously, we have provided a method to parse the syntactic structure, so that the computer can determine the relationship between each verb and which words. The essence of semantic analysis is to determine the specific relationship between them. Therefore, semantic recognition is transformed into semantic role tagging. However, the position where a semantic role appears is not fixed. It can also be followed by a verb, but also before a verb. How can a computer identify a semantic role? Before answering this question, let's ask ourselves: how do we know that "I" in "I have eaten" is "eat, what if "apple" is "eaten" after "Apple has finished? Everyone will say that "I" can only be "Eat", because it is obvious that I will not "be eaten "; of course, "apple" can only be "eating" because Apple obviously cannot "eat. That is to say, both arguments of "eat" have semantic requirements. Let's write the "eat" argument meta structure in more detail:
Eat <Shi [semantic class: Human | animal], subject [semantic class: Food | drug]>
The meta-structure of the word "drowning" can be added as follows:
Submerged <power [semantic: natural thing], subject [semantic: Building | space]>
Therefore, to enable the computer to automatically label semantic roles, we need to establish two huge databases: Semantic Dictionary and theoretic meta-structure dictionary. Such human engineering has long been done. The "million five semantic engineering" launched by Beijing Language and Technology University in is a large-scale semantic tree constructed manually. It divides words into four categories: Things, movements, time and space, and attributes. Things are classified into things and things, and things are classified into concrete and abstract things, the specific things are further divided into biological and non-biological elements, which are divided into five categories: human, animal, plant, microbial, and biological components, non-biological components include natural objects, artificial objects, abandoned objects, geometric figures, and non-biological components, among them, there are seven categories of artificial things: Facilities, carrying things, appliances, raw materials, dissipation things, information things, and money. The entire semantic tree has 414 nodes, of which 309 are leaf nodes and the maximum depth reaches 9 layers. In terms of the meta-structure, the machine Dictionary of Modern Chinese language verb is jointly completed by Tsinghua University and Renmin University. the dictionary contains syntax and semantic information such as pinyin, meaning, classification, number of arguments, semantic roles of arguments, and Semantic Restrictions of arguments.
Speaking of semantic Engineering, I have to mention Mr. Dong Zhendong's net. This is a knowledge base that integrates semantic classification and semantic relationships. It not only reflects the commonality between words through the semantic tree, but also reflects the personality of each word through semantic relationships. It not only tells you that both "doctor" and "patient" are people, but also tells you that "doctor" can send a "Healing" action to "patient. The concept of HowNet is very similar to that of WordNet engineering. The latter is an English word semantic relation dictionary established by Princeton in 1985. Behind it is also a concept of semantic relationship networks, the relationship between words involves synonyms, antonyms, upper and lower-bit words, whole and part, subsets and supersets, materials and finished products, and so on. If you have installed Mathematica, you can use the worddata function to obtain WordNet data. As for the Chinese knowledge bases mentioned above, don't ask me, and I don't know where to get them.
I think everyone will cheer up. Ah, now, in the field of Chinese Information Processing, the syntax and semantics have been effectively solved. Actually not. The above meta-semantic role model has many problems. It is easy to think of the problem of metaphor, such as "Information drowned me" and "sorrow drowned me ". Once a new verb usage occurs, we can only update the meta-structure:
Drowning <power [semantic class: natural thing | abstract thing], subject [semantic class: Building | space | human]>
But what is more troublesome is the following violation of Semantic Rules. One is a negative sentence, for example, "James cannot eat ideas ". One is a question, for example, "how can James eat his thoughts ". What is more troublesome is the extraordinary phenomenon. You can search the website on the news site to find all kinds of situations that do not comply with the Semantic Rules. I searched for a "eat metal" and immediately saw a news title titled "A French old man eats metal for a living". To solve these problems, you need to add a lot of patches to the price model.
However, the price allocation model only solves the semantic problem of vertices. What about other words? Fortunately, we can develop a similar price Allocation Theory for nouns. We usually think that "teacher" is a zero-price term, while "teacher" is a one-price term, because when talking about "teacher", we usually say "Who's teacher ". "Attitude" is a second-price term, because we usually say "who has an attitude towards whom" is complete. In fact, adjectives also have configuration prices. "Excellent" is a one-price adjective, and "friendly" is a second-price adjective, for a similar reason. There is still a lot more complicated about the configuration price theory, so we will not detail it here.
However, there are still many problems that cannot be solved by the configuration price theory. For example, there is a semantic point. "Cut Down", "Cut tired", "cut down Blunt", and "cut fast" are the complements behind the verb, but the actual meaning is different. "Cut Down" refers to "tree cut down", "Cut Down" refers to "people cut down ", "Cutting Down" refers to "cutting down an ax", and "cutting fast" refers to "cutting fast ". It seems that each argument of a verb has not only semantic class restrictions, but also the "Evaluation Method" restrictions.
There is also a semantic relationship between the two verbs. In "Hold on", "Hold on" and "Do not put" constitute a repeated relationship. In "talking about angry people", "speaking up" and "Angry People" constitute a conditional relationship, that is, after each occurrence of the "speaking up" event, will produce the "Irritating" result. You may say, are there any differences between the two situations? Yes, and I can prove this. Let's make a sentence "It's useless", and you will find that it has a ambiguity: it can be understood as a repeating relationship like "Hold on", and it has never been used; it can also be understood as a conditional relationship like "talking about angry people". It will not be useful if you keep it. Therefore, the combination of a verb and a verb does produce a different semantic relationship, which requires another model to process.
The semantics of virtual words is more troublesome. Don't think that "it" means that the book is finished. "This book has been read for three days." It means that it has not been read. Now there is no final conclusion on how many meanings there are. Adverbs are also virtual words, and the meaning of them is also unpredictable. Comparing "Zhang San and Li Si are married" with "Zhang San and Li Si are married", you will find that the semantics of the word "du" is not that simple.
However, in actual product applications, the problems mentioned above are not big. In this article, we talk about the rule-based linguistic processing methods. At present, it is more practical to analyze probability statistics of Large-Scale Real corpus and machine learning algorithms. This path can ignore many specific linguistic problems and the results are quite satisfactory. The maximum entropy model and Conditional Random Field are currently very common natural language processing methods. If you are interested, you can study them in depth. However, these methods also have their own shortcomings, that is, their unpredictability. No matter which path, it seems that there is still a long distance from the target. We look forward to a new set of language models in the natural language processing field in the future to solve all the problems mentioned above.