Calculation of similarity in short text

Source: Internet
Author: User

The similarity calculation method of short text can be divided into two kinds: The method based on deep learning and the method based on non-deep learning. Scientific research is basically from the depth of learning, but the individual think to the single language of the short text similarity calculation to make flowers more difficult, relatively based on deep learning cross-language similarity calculation slightly better. Engineering is mostly not the method of deep learning, mainly to get tagged words more difficult (unless the company to pay for someone to sign). In the following, I will combine the tasks of the similarity calculation that I have done, from both deep learning and non-deep learning, to describe the similarity of text.

First of all, the text similarity calculation based on deep learning. Allow me to talk about the background of the story before you start the show. We want to make a network query recommendation, that is, similar to Figure 1: The user input query content, the Web page gives some reasonable recommendations (these recommendations can be the previous site log records before the user's query information). In order to select the most appropriate referral to the user from the log, we need to calculate the similarity between the user input query content Q and the candidate Recommendation S (S={S1,S2,S3......,SN}) and return the corresponding recommendation to the user based on the similarity.

Figure 100 Query recommendations in the search box

In the above tasks, we use the Siamese Network (see Figure 2) to calculate the text similarity. Considering that the candidate S will be relatively large and the actual corpus is relatively short, we use CNN as the basic mechanism of the Siamese network, so that we can improve the speed and also guarantee performance. One of the benefits of using CNN is that we can share two of CNN's parameters Well, so we compared the model performance results in three scenarios. One is two CNN does not share parameters, the other is two CNN shared some parameters, the last is two CNN all shared parameters. The results of the experiment show that, in the same language (query and recommendation of the same language), the more two CNN shared parameters the better the performance of the model, in the case of different languages (queries and recommendations belonging to different languages) two CNN shared parameters more model performance worse.

Figure 2 Siamese Network

The differences between the two are mainly caused by the consistency between the query language and the candidate recommendation language. When the two are consistent, the terminology is the same, the use of a set of parameters to better fit their distribution, but when the two are inconsistent, the language environment is very different, it is difficult to use a set of parameters in both the distribution of the model. Using CNN to do similarity calculations is still hard to get the semantic information of the whole sentence, and using the expansion CNN (IDCNN) To do the Siamese network's basic unit effect should be better. In addition, the similarity of two sentences of any two words of the matrix as a picture processing can also be well suited to query recommendations, and listen to the teacher at the NLPCC meeting, in the case of less data, this method is more practical.

There are many methods of text similarity calculation in non-deep learning, mainly based on vocabulary, based on sentence structure, and based on word vectors. We mainly introduce the sentence similarity calculation based on the weighted sum of sentence word vectors. Whether deep learning or non-deep learning, one of the difficult points of similarity calculation is how to represent sentences reasonably, and non-deep learning methods mainly include SVM (space vector model) and the addition of sentence word vectors in the process of vectorization of sentences. The former will be sparse data, sentence expression too long and so on. So we directly use the word vector weighted and expressed sentences, here the weight allocation is mainly to see whether the word keyword (jieba,pynlpir can be directly calculated).

The sentence is expressed as the weighted sum of the word vectors, and then the method of calculating the sentence similarity has many advantages compared to the deep learning method: One is not to spend a lot of manpower and material resources, one is the applicability of the relatively strong (after all, unlike the deep learning training model, the performance of a test set will have a great difference). However, there are problems in itself, mainly the performance of no deep learning method of good. The simple addition of sentence word vectors does not involve sentence pattern, grammar, structure, etc. Such results are difficult to contain the whole semantic information of the sentence, and the possible errors in the word segmentation (feeling influence is particularly large) will multiply to the downstream similarity calculation. The result of participle and the mismatch of the existing word vectors (downloaded online) can also affect the results.

In summary, the calculation of text similarity is still a lot of difficulties need us to overcome, these difficulties often involve the underlying knowledge of NLP, feel heavy and long way.

Calculation of similarity in short text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.