[Python] calculates the text TF-IDF value using the Scikit-learn tool

Source: Internet
Author: User
Tags ming idf

The calculation of TF-IDF values may be involved in the process of text clustering, text categorization, or comparing the similarity of two documents. This is mainly about the Python-based machine learning module and the Open Source tool: Scikit-learn.
I hope the article is helpful to you.related articles are as follows:
        [Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box
        Python simple implementation of cosine similarity calculation based on VSM
        Named entity recognition, ambiguity resolution and reference digestion based on VSM
        [Python] using Jieba tools Chinese word segmentation and Text clustering concepts
    • I. Scikit-learn concept
      • 1. Conceptual knowledge
      • 2. Installing the Software
    • Two. TF-IDF Basic knowledge
      • 1.tf-idf
      • 2. Example Introduction
    • Three. TF-IDF calls two methods
      • 1.CountVectorizer
      • 2.TfidfTransformer
      • 3. Others example

I. Scikit-learn concept

1. Conceptual Knowledge official website: http://scikit-learn.org/stable/
Scikit-learn is a simple and effective tool for data mining and data analysis, which is a Python-based machine learning module based on BSD open source licenses.

S the basic functions of cikit-learn are mainly divided into six parts: classification (classification), regression (Regression), Clustering (clustering), Data dimensionality reduction (dimensionality reduction), Model selection, data preprocessing (preprocessing).
Scikit-learn in the machine learning model is very rich, including SVM, decision Tree, GBDT,KNN, etc., can be based on the type of problem to choose the appropriate model, in particular, can refer to the official website documents, recommend you download resources, modules, documents to learn from the official website.

Scikit-learn installation needs numpy, scipy, matplotlib and other modules, Windows users can go to: http://www.lfd.uci.edu/~gohlke/ Pythonlibs directly download the compiled installation package and dependencies, can also be downloaded to this site: http://sourceforge.jp/projects/sfnet_scikit-learn/.
Reference article: Open-source machine learning tools Scikit-learn Getting Started-Xuan Sen

2. Installing the SoftwarePython 2.0 I recommend using the " pip install scikit-learn" or " easy_install scikit-learn" fully automatic installation, and then through "From sklearn import feature_extraction" Import .
If the error "Unknown encoding:cp65001" appears during installation, enter "Chcp 936" to change the encoding from Utf-8 to Simplified Chinese GBK.

two. TF-IDF Basic knowledge

Refer to the official documentation:
Gensim in tf-idf:http://radimrehurek.com/gensim/models/tfidfmodel.html


TF-IDF (term frequency-inversdocument Frequency) is a weighted technique commonly used in information processing and data mining. The technique uses a statistical method to calculate the importance of a term throughout the corpus based on the number of occurrences of the word in the text and the frequency of the document appearing throughout the corpus. Its advantage is the ability to filter out some common but insignificant words, while preserving important words that affect the entire text. The calculation method is shown in the following formula.

wherein, the formula Tfidfi,j represents the word frequency tfi,j and the inverted text word frequency idfi product. The larger the TF-IDF value, the greater the importance of the feature Word to the text.

TF (term Frequency) indicates how often a keyword appears in the entire article.
IDF (inversdocument Frequency) indicates that the inverted text frequency is calculated. Text frequency refers to the number of occurrences of a keyword in all articles of the entire corpus. The inverted document frequency, also known as the inverse document frequency, is the inverse of the document frequency and is used primarily to reduce the effect of some common words in all documents that have little effect on the document.
The following formula is TF the formula for calculating the word frequency.

Among them,Ni,j is the number of feature words in the text DJ , the number of characters in the text DJ . The result of the calculation is the word frequency of a feature term.
The following formula is IDF calculation formula.

Among them, | D| Represents the total number of text in the corpus, representing the number of feature word ti contained in the text. To prevent the term from being absent in the corpus, that is, the denominator is 0, it is used as the denominator.

2. Example

Example of a sample of the imitation Ruan Yi-Feng God's example for a brief explanation, recommend everyone to read:
Application of TF-IDF and cosine similarity (i): Automatic extraction of keywords
Here's an example to explainTF-IDFmethod of weight calculation.
Suppose there is now an article "Big Data analysis of Guizhou", this article contains10000" Guizhou", "Big Data", "Analysis" each appeared -times, "the" appeared -(assuming that no stop word is removed), the previousTFthe word frequency calculation formula can be calculated to get the frequency of three words, namely:

The library is now expected to exist +"Guizhou", which contains a total Aboutcontains "Big data" for a total +contains the "analysis" of the total " -" article, including" the "Total"899"article. Then they areIDFThe calculation is as follows:

Can be found by IDF, when a word in the corpus of the number of occurrences of the more, its IDF value is lower, when it appears in all documents, its IDF evaluates to 0, and usually these occurrences of a lot of words or words "of", "I", "," and so on, it does not play a role in the weight calculation of the article.
The TF-IDF values are also calculated as follows:

By TF-IDF calculation, "Big Data" appears in a very high frequency in an article, which can reflect the topic of this article is about "big data" direction. If you select only one word, "big data" is the key word in this article. Therefore, you can use the TF-IDF method to count the keywords of the article. At the same time, if the tf-idf of "Guizhou", "Big Data" and "analysis" are calculated at the same time, the tf-idf of these words can be summed up, and the whole document value will be used for information retrieval.
The advantages of the TF-IDF algorithm are simple and fast, and the results are more realistic. The disadvantage is that the importance of a word is simply measured by the frequency of words, which is not comprehensive enough, and sometimes important words may not appear many times. Moreover, this algorithm cannot reflect the position information of the word.

three. TF-IDF Calculation

The calculation method of TF-IDF weights in Scikit-learn is mainly used in two classes: Countvectorizer and Tfidftransformer.


The Countvectorizer class converts words in text to a word frequency matrix, such as a matrix containing an element a[i][j], which represents the word frequency of J words under Class I text. It calculates the number of occurrences of each word through the Fit_transform function, obtains the keyword of all the text in the word bag through get_feature_names (), and can see the result of the frequency matrix through the ToArray ().
The code is as follows:

# coding:utf-8from Sklearn.feature_extraction.text Import countvectorizer# Corpus corpus = [' This is the first    document. ' , ' This is the    second second document. ', ' and the third one, ' are this first    document? ', ' #将文本中的词语转换为词频矩阵v Ectorizer = Countvectorizer () #计算个词语出现的次数X = Vectorizer.fit_transform (corpus) #获取词袋中所有文本关键词word = vectorizer.get_ Feature_names () Print word# view Word frequency results Print x.toarray ()
the output is as follows:
>>> [u ' and ', U ' document ', U ' first ', U ' is ', U ' one ', U ' second ', U ' a ', U ' third ', U ' this '][[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]]>>>

As you can see from the results, a total of 9 feature words are included:
[u ' and ', U ' document ', U ' first ', U ' is ', U ' one ', U ' second ', U ' a ', U ' third ', U ' this ']
The number of feature words in each sentence is also included in the output. For example, the first sentence of the word "This is", which corresponds to the word frequency of [0, 1, 1, 1, 0, 0, 1, 0, 1], assuming that the initial sequence number is counted from 1, then this frequency indicates that there is a 2nd position in the words "document" A total of 1, 3rd position words "First" a total of 1 times, the 4th position of the word "is" a total of 1 times, the 9th position of the word "this" a total of 1 words. Therefore, each sentence will be given a word frequency vector.


Tfidftransformer is used to count the TF-IDF values of each word in the Vectorizer. The specific usage is as follows:

# coding:utf-8from Sklearn.feature_extraction.text Import countvectorizer# Corpus corpus = [' This is the first    document. ' , ' This is the    second second document. ', ' and the third one, ' are this first    document? ', ' #将文本中的词语转换为词频矩阵v Ectorizer = Countvectorizer () #计算个词语出现的次数X = Vectorizer.fit_transform (corpus) #获取词袋中所有文本关键词word = vectorizer.get_ Feature_names () Print word# view Word frequency results Print x.toarray () from Sklearn.feature_extraction.text import tfidftransformer# Class Call Transformer = Tfidftransformer () print transformer# counts the word frequency matrix X as TF-IDF value TFIDF = Transformer.fit_transform (X) #查看数据结构 TFIDF[I][J] Represents the TF-IDF weight in Class I text print Tfidf.toarray ()
the output results in the following:

3. Others example

If you need to do word frequency statistics and calculate TF-IDF values at the same time, use the core code:
Vectorizer=countvectorizer ()
Transformer=tfidftransformer ()
Tfidf=transformer.fit_transform (Vectorizer.fit_transform (Corpus))
Here is an example of a liuxuejiang158 great God, for everyone to learn, recommended to read the original text:
Python Scikit-learn calculates TF-IDF word weights-Liuxuejiang

# coding:utf-8__author__ = "Liuxuejiang" Import jiebaimport jieba.posseg as Psegimport osimport sysfrom sklearn Import feat Ure_extractionfrom sklearn.feature_extraction.text Import Tfidftransformerfrom Sklearn.feature_extraction.text Import Countvectorizerif __name__ = = "__main__": corpus=["I came to Tsinghua University in Beijing," #第一类文本切词后的结果, between the words separated by a space "he came to NetEase hang Research building", #第二类文本的切 Word result "Xiao Ming Master graduated with Chinese Academy of Sciences", #第三类文本的切词结果 "I love Beijing Tian ' an gate"] #第四类文本的切词结果 Vectorizer=countvectorizer () #该类会将文本中的词语转换为词频矩阵, matrix element a[i][j] Table The word Frequency transformer=tfidftransformer () #该类会统计每个词语的tf-idf weight value tfidf=transformer.fit_transform (vectorizer.fit_) in the Class I text. Transform (Corpus)) #第一个fit_transform是计算tf-IDF, the second fit_transform is to convert text to Word frequency Matrix word=vectorizer.get_feature_names () # Get all the words in the word bag Model Weight=tfidf.toarray () #将tf-IDF matrix extraction, element a[i][j] denotes the tf-idf weight of J words in Class I text for I in range (len (weight)): #打印每类文本的 TF-IDF word Weights, the first for traversing all text, the second for facilitating the word weight under a certain type of text print U "-------here output the word", I,u "class text TF-IDF weight------" for J in range (l En (Word)): print Word[j],weight[i][j]
the output is as follows:
-------here output the No. 0 class of text words TF-IDF weight------           #该类对应的原文本是: "I came to Tsinghua University in Beijing, China 0.0 Beijing 0.52640543361 Building 0.0 Tiananmen Square 0.0 Xiao Ming 0.0来 to 0.52640543361 Hang research 0.0 graduation 0.0 Tsinghua University 0.66767854461 Master 0.0 Academy of Sciences 0.0 netease 0.0-------Here Output the words of class 1th text TF-IDF weight------           #该类对应的原文本是: " He came to NetEase hang research building "China 0.0 Beijing 0.0 Building 0.525472749264 Tiananmen Square 0.0 Xiao Ming 0.0来 to 0.414288751166 Hang research 0.525472749264 graduation 0.0 Tsinghua University 0.0 master 0.0 Academy 0.0 NetEase 0.525 472749264-------Here Output the 2nd class of text words TF-IDF weight------           #该类对应的原文本是: "Xiao Ming Master graduated from the Chinese Academy of Sciences" China 0.4472135955 Beijing 0.0 Building 0.0 Tiananmen Square 0.0 Xiao Ming 0.4472135955来 to 0.0 Hangzhou research 0.0 Graduation 0.4472135955 Tsinghua University 0.0 Master 0.4472135955 Academy of Sciences 0.4472135955 netease 0.0-------Here Output the 3rd class text word TF-IDF weight------            #该类对应的原文本是: "I love Beijing Tian ' an gate" China 0.0 Beijing 0.61913029649 Building 0.0 Tiananmen Square 0.78528827571 Xiao Ming 0.0来 to 0.0 hang research 0.0 graduation 0.0 Tsinghua University 0.0 master 0.0 Academy 0.0 netease 0.0

a great God blog that recommends several machine learning and NLP fields:
apply Scikit-learn to do text classification-Rachel-zhang
Python Scikit-learn calculates TF-IDF word weights-Liuxuejiang
Start machine learning with Python (5: Text feature extraction and vectorization) (strong push)-LSLDD
Another talk about Word2vec-felven (strong push)
Using Word2vec to cluster keywords-felven (strong push)
Python TFIDF Processing of document content
Python TF-IDF calculates 100 documents keyword weights-Chenbjin

Finally hope that the article is helpful to you, if there is insufficient or wrong in the article, also please Haihan ~ or that sentence, very enjoy the current teacher life, regardless of scientific research, projects, or teaching, very substantial, refueling!
But do good, MO ask the future.
Stay with the sky Li Dao, and then Chase alumni interesting.
(By:eastmount 2016-08-08 5 o'clock in the afternoon  http://blog.csdn.net/eastmount/ )

[Python] calculates the text TF-IDF value using the Scikit-learn tool

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.