To compare the similarity between an article and other articles, you can take the following steps.
1. Definition: how to identify the similarities between the two articles?
A) There are several identical words or keywords.
B) There are several identical sentences.
C) There are several identical paragraphs.
2. design algorithms and coefficients: the calculation formula needs to be set based on the test data and experience. First, we can assume that.
A) assume that the similarity of the subject word accounts for 20% of the similarity of the entire article, that is, 0.2 *
B) assume that sentence similarity accounts for 50% of the similarity of the entire article, that is, 0.5 * B.
C) assume that the similarity of a paragraph accounts for 30% of the similarity of the entire article, that is, 0.3 * C
3. The similarity of the entire article is 0.2 * A + 0.5 * B + 0.3 * C, and the rest is how to calculate A, B, and C.
A) similarity of subject words, which can be calculated based on the proportion of hit times of the subject words in the two articles
B) Sentence Similarity. You can use punctuation marks to break sentences and calculate the number of identical sentences and the proportion of all sentences in the two articles.
C) the similarity of paragraphs can be calculated by using the dynamic planning algorithm. For details, see Introduction to algorithms. The dynamic planning section calculates the maximum length of the same string in two articles.