When we want to find a document pair with a relatively high level from 1 million documents, we need to compare them with each other for a total of billion times. If each comparison takes 1 microsecond, it takes six days to complete the calculation. Application of the problem: 1. check and review of the paper. I have heard of this word since I was a university student. This is an application of this question,
When we want to find a
From: http://blog.chinaunix.net/uid-26548237-id-3541783.html
1. vector space model Vector space model, as a vector identifier, is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and related rules. Documents and problems are expressed by vectors. Each dimension is equivalent to an independent phrase. If this term appears in the document, its value in the vector is not zero.
above)
Why is it called a vector space model? In fact, we can think of each word as a dimension, and the frequency of the word as its value (there is a direction), that is, vector, so that each article of the word and its frequency constitutes an i-dimensional space diagram, two of the similarity of the document is the proximity of two space graphs. If the article is only two dimensions, then the space map
case: A similar article is recommended when reading an article.This case is simple and rough, especially when I read the novel, when the book shortage, really want to have such a function. (PS: I work for a fiction company now)So , how do you measure the similarity between articles?Before you start, talk about Elasticsearch.The index used by Elasticsearch is called an inverted index. Split the document into
containing the same word, as a similar document, the advantage of this method is efficient, but the disadvantage is not accurate, This interface provides a number of parameters that you can configure to select Interestingterm.Morelikethis MLT = new Morelikethis (ir);Reader target = ...Orig source of Doc want to find similarities toQuery query = Mlt.like (target);Hits Hits = Is.search (query);The usage is simple so that you can get a similar documentT
In text processing, for example, product comment mining, you sometimes need to know the similarity between each comment and the description of the item, so as to measure the objectivity of the comment. Is there a program for calculating Text Similarity in python? Congratulations, not only is it, but it is very powerful. Next we will try gensim's powerful pre_file.py
#-*-Coding: UTF-8-*-import MySQLdbimpor
Open-source: Calculate a fingerprint for each document and then use the fingerprint for similarity calculation
TextsimilarityTextsimilarity =New Textsimilarity();
// ComputingArticleSimilarity fingerprint
IntSourcefingerprint = textsimilarity. calctextfingerprint (sourcetext );
IntDestfingerprint = textsimilarity. calctextfingerprint (desttext );
// Compare the fingerprint to calculate the
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.