How to calculate the semantic similarity of sentences, it is easy to think of the vector space model (VSM) and the method of editing distance, such as A: "My dad is Li Gang", B: "My son is Li Gang", using VSM method A (I, dad, yes, Li Gang) B (I, son, yes, Li Gang) to calculate the cosine of the angle between the two vectors, do not repeat it; the editing distance makes it better to replace "dad" and "son" with D (A, B) = replace_cost;
These two methods are quite awkward. They belong to the baseline in baseline. Let's take a look at the two examples. A: "How to build a building ?", B: "How to play golf balls ?", C: "How to build a house ?", If VSM is used, it is obvious that because B and C share the same word "how", the BC similarity is higher than that of AC, and the editing distance is the same;
It is not difficult to solve this problem. As long as all sentences are expanded through the synonym dictionary, "how", "how", "building", and "House" are synonyms or synonyms, after expansion, you can calculate VSM or edit distance to solve this problem. This method solves the problem of low recall rate to a certain extent, but it is inevitable to introduce noise after expansion, especially if the original sentence contains synonyms. For example, "Soy Sauce" and "sweater ". In Chinese characters, some single words express quite a lot of meaning. In Mr. Dong Zhendong's HowNet, he has a good semantic relationship explanation for these Chinese characters, the tree structure of words from HowNet to the object meta can be used to measure the word granularity.
The problem seems to have been answered, but it is far from enough. The VSM method considers words in sentences as independent features, ignoring the influence of sentence sequence and location on sentence semantics. edit distance considers the word sequence relationship in sentences, however, this relationship involves mechanical replacement, movement, deletion, and addition. In reality, each word expresses a different amount of information, the amount of information or semantic information that the same word contains in different word combinations is very different. What about syntax analysis is used to calculate the similarity of the syntax tree? This is more reliable than the first two methods, because the syntax tree describes the position of words in sentences. The actual results need to be confirmed by the experiment.
by the way, there is also a method of translation model, a major innovation of IBM in the field of machine translation, A large amount of corpus is required for training to obtain the desired translation results. Of course, this includes the intermediate word alignment results. If we can use web resources to establish a high-quality corpus to align two pair of similar Sentence Alignment words through em iteration, sentence similarity will be generated by word alignment .. Think of a good way!