Python TF-IDF computing 100 documents keyword weight
1. TF-IDF introduction TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and Text Mining. TF-IDF is a statistical method used to assess the importance of a word to a document in a collection or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases proportionally with the frequency of its appearance in the corpus. Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the degree of relevance between a file and a user query. The main idea of TF-IDF is: if a word or phrase appears frequently in an article, TF is high, and rarely appears in other articles, this word or phrase is considered to have good classification ability and is suitable for classification. The TF-IDF is actually: TF * IDF. (1) Term Frequency (TF) refers to the Frequency at which a given word appears in the file. That is, the ratio of the number of times that word w appears in document d to the total number of words in document d. Tf (w, d) = count (w, d)/size (d) is the normalization of the number of words (term count) to prevent it from being biased towards long files. (A word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not .) (2) Inverse Document Frequency (IDF) is a measure of the general importance of words. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word. That is, the logarithm of the ratio of n in the total number of documents to the number of files in the word w docs (w, D. Idf = log (n/docs (w, D) TF-IDF Based on tf and idf for each document d and by the keywords w [1]... A query string q consisting of w [k] calculates a weight value, which indicates the matching degree between query string q and document d: tf-idf (q, d) = sum {I = 1 .. k | tf-idf (w [I], d)} = sum {I = 1 .. k | tf (w [I], d) * idf (w [I])} the frequency of high words in a specific file, and the word's low file frequency in the entire file set can generate a high-weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words. For a detailed introduction and examples of TF-IDF, interested students can read this blog. The following describes how to use the TF-IDF in Python. Second, Python computing TF-IDF in Python, scikit-learn package under the calculation of TF-IDF api, the effect is also very good. First install Scikit-clearn. For installation of different systems, see http://scikit-learn.org/stable/install.html. Local Environment: linux (ubuntu) 64-bit, python2.7.6 1. install the scikit-learn package (first install the dependency package and then install sklearn) sudo apt-get install build-essential python-dev python-setuptools \ python-numpy python-scipy \ libatlas-dev libatlas3gf-basesudo apt-get install python-sklearn or install Through pip, pip is a good Installation Tool for python. Sudo apt-get install python-pip sudo pip install-U scikit-learn check whether the installation is successful. Enter pip list in terminal to list all pip installation items, if sklearn exists, the installation is successful. 2. Install the jieba word segmentation package because the calculation TF-IDF is to calculate the word segmentation result, so here we need to use jieba Chinese word segmentation. For the usage of jieba word segmentation, see the previous blog post: Python jieba word segmentation sudo pip install jieba 3. calculate TF-IDF scikit-learn package for TF-IDF word segmentation weight calculation mainly uses two classes: CountVectorizer and TfidfTransformer. CountVectorizer converts words in the text into a word frequency matrix using the fit_transform function. The matrix element a [I] [j] indicates the Word Frequency of j words under the I-th text. That is, the number of times each word appears. The keyword of all texts is displayed through get_feature_names (), and the word frequency matrix is displayed through toarray. For example, copy the code >>> from sklearn. feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer () >>> corpus = [... 'This is the first document. ',... 'This is the second document. ',... and the third one. ',... is this the first document? ',...] >>> X = vectorizer. fit_transform (corpus) >>> X. toarray () array ([0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]...) >>> vectorizer. get_feature_names () (['and', 'document', 'first', 'is', 'one', 'second', ''the, 'third ', the copy code TfidfTransformer is used to calculate the tf-idf weight of each word in vectorizer. The usage is as follows: copy the code >>> from sklearn. feature_extraction.text I Mport CountVectorizer >>> transformer = TfidfTransformer () >>> counts = [[3, 0, 1],... [2, 0, 0],... [3, 0, 0],... [4, 0, 0],... [3, 2, 0],... [3, 0, 2] >>> tfidf = transformer. fit_transform (counts) >>> tfidf. toarray () array ([0. 85 ..., 0 ...., 0. 52...], [1 ...., 0 ...., 0...], [1 ...., 0 ...., 0...], [1 ...., 0 ...., 0...], [0. 55 ..., 0. 83 ..., 0...], [0. 63 ..., 0 ...., 0. 77...]) copy the code about letter Number of specific instructions, please refer to the official documentation: scikit-learn common-vectorizer-usage here I am dealing with 100 documents word segmentation, and then the TF-IDF calculation, the effect is quite good. Copy the code import osimport jiebaimport jieba. posseg as your gimport sysimport stringfrom sklearn import feature_extractionfrom sklearn. feature_extraction.text import TfidfTransformerfrom sklearn. feature_extraction.text import CountVectorizerreload (sys) sys. setdefaultencoding ('utf8') # obtain the file list (the directory contains 100 documents) def getFilelist (argv): path = argv [1] filelist = [] files = OS. listdir (path) for f in files: if (f [0] = '. '): Pass else: filelist. append (f) return filelist, path # Word Segmentation for documents def fenci (argv, path): # directory for saving word splitting results sFilePath = '. /segfile 'If not OS. path. exists (sFilePath): OS. mkdir (sFilePath) # Read the file filename = argv f = open (path + filename, 'r + ') file_list = f. read () f. close () # perform word segmentation for the document. The default mode is seg_list = jieba. cut (file_list, cut_all = True) # process spaces and linefeeds. result = [] for seg in seg_list: seg = ''. join (seg. split ()) If (seg! = ''And seg! = "\ N" and seg! = "\ N"): result. append (seg) # Separate the segmented results with spaces and save them to the local device. For example, "I came to Beijing Tsinghua University" and the word splitting result was written as follows: "I came to Beijing Tsinghua University" f = open (sFilePath + "/" + filename + "-seg.txt ", "w +") f. write (''. join (result) f. close () # Read 100 documents that have been segmented, perform TF-IDF calculation def Tfidf (filelist): path = '. /segfile/'Corpus = [] # access the word segmentation result of 100 documents for ff in filelist: fname = path + ff f = open (fname, 'r + ') content = f. read () f. close () corpus. append (content) vectorizer = CountVectorizer () transformer = TfidfTransformer () tfidf = transformer. fit_transform (vectorizer. fit_transform (corpus) word = vectorizer. get_feature_names () # keyword weight = tfidf for all texts. toarray () # The corresponding tfidf matrix sFilePath = '. /tfidffile 'If not OS. path. exists (sFilePath): OS. mkdir (sFilePath) # Here write the TF-IDF of each document word to the tfidffile folder and save for I in range (len (weight )): print u "-------- Writing all the tf-idf in the", I, u "file into", sFilePath + '/'{string.zfill( I ,5}}'.txt ', "--------" f = open (sFilePath + '/'{string.zfill( I ,5}}'.txt', 'W + ') for j in range (len (word): f. write (word [j] + "" + str (weight [I] [j]) + "\ n") f. close () if _ name _ = "_ main _": (allfile, path) = getFilelist (sys. argv) for ff in allfile: print "Using jieba on" + ff fenci (ff, path) Tfidf (allfile)