TF-IDF algorithm--correlation calculation of each article in key words and text sets

Last Update:2018-08-03 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Key words and text sets each article relevance calculation: Suppose there are tens of thousands of articles in the corpus, each article length is different, you enter the keyword or sentence, by the code to TF-IDF value to retrieve a high degree of similarity of the article.

1. TF-IDF Overview

TF-IDF is a statistical method used to evaluate the importance of a word to one of the files in a set of files or a corpus. The importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the corpus. The various forms of TF-IDF weighting are often used by search engines as a measure or rating of the degree of relevance between files and user queries.

The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category distinguishing ability and suitable for classification.

TFIDF is actually: TF * IDF. The TF-IDF value is proportional to the frequency of occurrence of the word, inversely to the number of occurrences in the corpus.

2. Word frequency (TF) and inverse document frequencies IDF

TF frequency (term Frequency), IDF reverse file frequencies (inverse document Frequency).

Word frequency (tf)= The total number of occurrences of a term in an article/the total number of words in an article, and TF represents the frequency with which the entry appears in document D.

Inverse Document frequency (IDF) = log (the total number of documents in the thesaurus/number of documents containing the word +1), in order to avoid a denominator of 0, add 1 to the denominator: The main idea of IDF is that if the number of documents that contain the word is less, that is, the smaller the denominator in log, the larger the IDF, the better the class-distinguishing ability of the word.

3. Discontinued words and corpus (good words are divided)

The discontinued words are broadly divided into two categories. One is the functional words contained in the human language, these functional words are very common, compared with other words, functional words have no practical meaning, such as ' the ', ' is ', ' in ', ' which ', ' on ' and so on. But for search engines, the use of discontinued words can cause problems when the phrase to be searched contains functional words, especially compound nouns like ' The Who ', ' the ' or ' take '. Other kinds of words include lexical words, such as ' want ', and so on, these words are widely used, but for such a word search engine can not guarantee to give the real relevant search results, difficult to help narrow the search scope, but also reduce the efficiency of search, so often will remove these words from the problem, thereby improving the search performance.

Each line of the corpus is an article, each article is preceded by a title, followed by a summary. Contains more than 20,000 articles for testing purposes only.

Disable glossary and corpus see Baidu Cloud Link: Link: https://pan.baidu.com/s/1wNNUd0Pe20HFLAyuNcwDrg Password: 367d

4. Python Code implementation

1 #-*-coding:utf-8-*-2 """3 Created on Tue Jul 10:57:03 20184 5 @author: Lenovo6 """7 ImportJieba8 ImportMath9 ImportNumPy as NPTen  Onefilename ='sentence similarity degree/title.txt'#Corpus AFilename2 ='sentence similarity degree/stopwords.txt'#stop using the word lists, use the hit Stop glossary -  - defStopwordslist ():#get an inactive vocabulary word theStopwords = [Line.strip () forLineinchOpen (filename2, encoding='UTF-8'). ReadLines ()] -     returnStopwords -  -Stop_list =stopwordslist () + defget_dic_input (str): -DIC = {} +      ACut =jieba.cut (str) atList_word = (','. Join (cut)). Split (',') -      -      forKeyinchList_word:#Remove input Stop words -         ifKeyinchstop_list: - List_word.remove (Key) -              inLength_input =Len (List_word) -      to      forKeyinchList_word: +Dic[key] =0 -      the     returnDiC, Length_input *          $ defGET_TF_IDF (filename):Panax Notoginsengs = input ("Please enter the keyword sentence to retrieve:") -      theDIC_INPUT_IDF, Length_input =Get_dic_input (s) +f = open (filename,'R', encoding='Utf-8') ALIST_TF = [] theLIST_IDF = [] +Word_vector1 =Np.zeros (length_input) -Word_vector2 =Np.zeros (length_input) $      $Lines =F.readlines () -Length_essay =Len (lines) - f.close () the      -      forKeyinchDIC_INPUT_IDF:#the IDF values for each word are stored sequentially in the LIST_IDFWuyi          forLineinchlines: the             ifKeyinchline.split (): -Dic_input_idf[key] + = 1 WuList_idf.append (Math.log (Length_essay/(dic_input_idf[key]+1))) -      About      forIinchRange (Length_input):#storing the IDF value in a matrix vector $Word_vector1[i] =List_idf.pop () -      -      forLineinchLines#The tf value of each word in each line is sequentially stored in LIST_TF -Length =Len (Line.split ()) ADIC_INPUT_TF, length_input=Get_dic_input (s) +          the          forKeyinchline.split (): -             ifKeyinchStop_list:#Remove discontinued words in an article $Length-= 1 the             ifKeyinchDIC_INPUT_TF: theDic_input_tf[key] + = 1 the          forKeyinchDIC_INPUT_TF: thetf = Dic_input_tf[key]/length - list_tf.append (TF) in              the          forIinchRange (Length_input):#store each row of TF values in a matrix vector theWord_vector2[i] =List_tf.pop () About          the         #print (WORD_VECTOR2) the         #print (Word_vector1) theTF_IDF = Float (np.sum (WORD_VECTOR2 *Word_vector1)) +         ifTF_IDF > 0.3:#filter out articles with a high degree of similarity -             Print("TF_IDF Value:", TF_IDF) the             Print("article:", line)Bayi  theGET_TF_IDF (filename)

5. Output results

Input: AI Development trend

The output is as follows: output TF-IDF value greater than 0.3, retrieved from more than 20,000 articles 12

TF-IDF algorithm--correlation calculation of each article in key words and text sets

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More