Demand analysis
in the natural language processing of the man-machine dialogue, the user's statement has a variety of expressions, such as "I Like You", "You are I like" expression is the same meaning, how to make computer understanding of these diverse sentences, in the face of all kinds of synonymous problems, can make the same question and answer, this article tries to pass The syntax tree, the dependency tree and other tools convert a variety of questions into a more uniform form of sentences to facilitate computer recognition of these statements.
Feature representation
Our goal is to solve the diversity of Chinese sentences, considering that if words are used directly as a feature, it is possible to complicate problems due to the combination of diversity, for example: ① you are loved by me; ② She is loved by me, can be converted into the same form ① I like you; These same forms are categorized as statements, and the computer organizes the answers according to these classifications.
In the above example, we can consider labeling to solve the problem of complex composition, can be based on domain knowledge of semantic labeling, but also the use of grammatical parts of speech tagging. Considering the difficulty of domain knowledge induction, this paper uses the second method to solve the problem. For example: ① I like you, I like her, can be labeled as PN + V + PN.
Feature Selection
There are two kinds of tools in natural language processing, such as syntax tree, dependency tree , we can consider using syntax tree, part-of-word labeling of dependent tree, grammatical edge, tree structure, etc., to represent the same class of statements, the same class of statements give it a rule to convert it to a unified form of the statement.
Model selection
after the feature is selected, how to use the same class of statements to record a rule to convert to a uniform form of the statement, this article uses the synchronization tree to replace the grammar model.
Data collated fromsentence Compression as Tree transduction.pdfAndA statistical machine translation model based on the synchronization tree substitution grammar. pdfExcerpt fromthe principle of artificial intelligence and its Application (2nd edition)
Insentence Compression as Tree transductionArticle, an abstract abstract reads:
"This paper presents a tree-to-tree transduction method for sentence compression. Our model was based onSynchronous Tree Substitution Grammar, a formalism that allows local distortion of the tree topology and can thus naturally capture structural mismatches. " The author proposes a method based onSynchronization Tree Substitution grammar(stsg,synchronous Tree substitution grammar), the trees-tree transform sentence compression algorithm.
InA statistical machine translation model based on synchronous tree substitution grammarArticle, an abstract abstract reads:
"Proposed a kind of based onSynchronization Tree Substitution grammarMachine translation model. Compared to the phrase-based model, this model models the long-range structural and discontinuous phrase translation, relative to the synchronization-basedContext-Independent grammarModel, which can model tree node sequencing at any level. ”
Context Free Grammar is a method that Chomsky proposes to formally describe the knowledge of natural language grammar. In this grammar, the use of grammatical knowledgeRewrite rulesRepresentation of the. Here's an example:
We have a subset of English "The Professor trains Jack."For context-independent grammatical analysis.
Manually-writtenRewrite rulesFor:
Statement → sentence Terminator
Sentence → Noun phrase verb phrase
verb phrases → verb noun phrases
noun phrases → noun words
noun phrases → special nouns
The →the of the Crown Word
Noun →professor
Verb →wrote
Noun →book
Verb →trains
Special noun →jack
Terminator →.
Using the rewrite rules above, you can rewrite the sentence into a grammar analysis tree (the parse tree, the syntax trees),
The context-free grammar reflects the hierarchical nature of the natural language structure and uses it for natural languageGrammarThe formal description is both rigorous and convenient for computer implementation.Knowledge Points:
In the rewrite rule, asTerminatorThere are English words the professor, wrote, book, Trains, Jack and ".", the rest areNon-Terminator。 It can also be understood that, in theleaf knot PointIt isTerminator,non-leaf knot pointIt isNon-Terminator。 And, "statement" is a special non-terminator, calledStart character, can be seen asroot node。 The grammar is called context-independent because the left side of these rewrite rules is orphaned and non-terminator, which can be replaced by the symbol string on the right, regardless of the context that appears on the left, "the word" does not affect "noun" instead of "Professor".
context-independent grammars reflect only the hierarchy and generation process of a sentence itself, it is impossible to have sex with another sentence. While natural language is context-related, the relationship between sentences is objective. As a result, Chomsky proposedTransform Grammar(Transformational grammar). The transformation grammar holds that the structure of English sentences hasDeepAndsurfaceTwo levels. For example, the sentence "She Read me a story." and "She read a story to me." The surface structure is different, but they refer to the same thing, that is, the deep structure of the two sentences is the same. For example, the active sentence and the passive sentence. In the transformation grammar, the transformation between the deep structure of the sentence and the surface structure isTransform RulesThe transformation rule transforms a sentence from one structure to another.Knowledge Points:
Transformation of grammarMain algorithm Ideasis to use context-independent grammar to establish the deep structure of the corresponding sentence, and then apply the transformation rules to transform the deep structure into a surface structure that conforms to people's habits. At present, many bilingual translations, sentences and sentence component displacement are solved by using this idea.
In fact, the transformation rule is to record these two sentences parallel to the expected syntax tree structure, a new sentence in the future, as long as the syntax tree structure with the above source syntax tree structure exactly match, you can apply this rule, changed to target syntax tree, and then get the target sentence.
In my understanding, the synchronous tree substitution grammar is the transformation grammar. With the above knowledge point, we can look at the definition of the synchronization tree substitution grammar :
In the above definition, referring to "elementary tree", let's look at the definition of a meta-tree:
from the definition and examples, we can see that the meta-tree is a complete subtree or a subtree is missing some parts , but it must be ensured that the first layer of each node in the meta-tree node must be complete, like PP (to) this is It's not legal . The tree kernel vectors used should also be referred to as sub-trees.
Why did you mention the meta-tree? Because of the generalization of the rules involved, for example, S (VBS WJ) can either represent S (VBA (P-NG VO) WJ) or represent S (VBA (xx xx) WJ), which prevents too much of the extracted transformation rules from being recorded.
In the algorithm of sentence reduction based on synchronization tree substitution grammar, the rules are constrained by the tree-tree composition being either a paired alignment component or a deleted component. Transformation Rule Extraction Learning algorithm is generally:
Algorithm implementation
ChomskyProposedTransform Grammar (transformational grammar) , the transformation of the grammar that the structure of the sentence hasDeepAndsurfaceTwo levels, for example:
She Read Me a story.AndShe read a story a stories to me.
The above two sentencesSurface StructureDifferent, but they refer to the same thing, that is, theirDeep Structureis the only one.
This project usesSTSG (Synchronous tree replacement grammar)And Dependency Tree (sentence backbone extraction)Complete the transformation of Chinese sentences from surface structure to deep structure.
The main principle of "sentence skeleton extraction" Using a dependency treeSTSGSame, just STSG useSyntax TreeTo extract the rules, while the syntax tree usesSpeech NodeTo represent the grammatical information of a sentence, the structure of the tree is more complex,sentence component ShiftThe problem is difficult to extract moregeneralization capabilityRules of Transformation;Dependency TreeUseSemantic EdgeTo represent the syntax information of the sentence, the structure of the tree is relatively simple, then can be extracted to a moregeneralization capabilityRule of transformation. However, at present, this rule can only be written manually, the lack of like STSGrule self-learning algorithmSuch a strong support, andManual authoring rules are difficult to cover in a comprehensive。
Current rule usage Seven tuples:< side attribute name, DEP term item, DEP Word, dep post position, gov Word item, gov part of speech, Gov post position >。 Seven tuple is a feature, see the 3rd section feature selection.
Project Code
Http://git.oschina.net/Keyven/IKeyven
Reference
Principles and applications of Artificial Intelligence (2nd edition)
sentence Compression as Tree transduction.pdf
A statistical machine translation model based on the synchronization tree substitution grammar. pdf
A sentence recognition algorithm based on syntax tree