Introduction to NLP (2) Exploring the principles of TF-IDF

Source: Internet
Author: User
Tags deprecated idf nltk
About TF-IDF

?? TF-IDF is a common statistical method in NLP. It is used to evaluate the importance of a word for a document in a collection or corpus. It is usually used to extract the features of text, that is, keywords. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases proportionally with the frequency of its appearance in the corpus.
?? In NLP, the formula for calculating the TF-IDF is as follows:

\ [TFIDF = TF * IDF. \]

Here, TF is term frequency, and IDF is the reverse file frequency (inverse Document Frequency ).
?? TF is the term frequency, that is, the frequency at which a word appears in the document. Assume that a word appears once in the entire document, and the entire document has n words, the TF value is I/n.
?? IDF is the frequency of reverse files. Assume that there are N Articles in the document, and a word appears in K articles, the IDF value is

\ [IDF = \ log _ {2} (\ frac {n} {k}). \]

Of course, the formulas for calculating IDF values vary slightly from place to place. For example, in some places, 1 will be added to the K of the denominator to prevent the denominator from being 0, and in some places, 1 will be added to the numerator and denominator. This is the smoothing technique. In this article, we still use the original IDF value calculation formula, because it is consistent with the formula in gensim.
?? Assume that the entire document contains d Articles, the TFIDF value of word I in Article J is

?? The above is the calculation of TF-IDF.

Text introduction and preprocessing

?? We will use the following three sample texts:

text1 ="""Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football. These different variations of football are known as football codes."""text2 = """Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender‘s hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated."""text3 = """Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across the net. A team is allowed only three touches of the ball before it must be returned over the net."""

These three articles are respectively about football, basketball, and volleyball. They constitute a document.
?? Next is the text preprocessing section.
?? First, remove the line break from the text, then the sentence, word segmentation, and then the punctuation. The complete Python code is as follows. The input parameter is text:

Import nltkimport string # text preprocessing # function: Text File clause, word segmentation, and remove the punctuation def get_tokens (text): text = text. replace ('\ n', '') sents = nltk. sent_tokenize (text) # tokens = [] for sent in sents: for word in nltk. word_tokenize (sent): # If word not in string. punctuation: # Remove punctuation tokens. append (Word) return tokens

?? Next, remove the stopwords in the article and count the number of occurrences of each word. The complete Python code is as follows. The input parameter is text:

From nltk. corpus import stopwords # deprecated words # Remove deprecated words from the original text file # generate the Count dictionary, that is, the number of occurrences of each word def make_count (text): tokens = get_tokens (text) filtered = [W for W in tokens if not w in stopwords. words ('English ')] # Remove the deprecated Word Count = counter (filtered) return count

Take text3 as an example. The generated count dictionary is as follows:

Counter ({'ball': 4, 'net': 4, 'teammat': 3, 'returne': 2, 'bat': 2, 'court': 2, 'team': 2, 'touchss': 2, 'touches': 2, 'back': 2, 'players': 2, 'touch': 1, 'must ': 1, 'usually ': 1, 'side': 1, 'player': 1, 'area': 1, 'volleyball': 1, 'hands': 1, 'may': 1, 'Toward ': 1, 'A': 1, 'third': 1, 'two': 1, 'six': 1, 'opposing ': 1, 'within': 1, 'Prevent': 1, 'allowed': 1, ''': 1, 'playing': 1, 'played': 1, 'played': 1, 'volley': 1, 'surface-That ': 1, 'volle': 1, 'Opponents': 1, 'use': 1, 'high': 1, 'tops': 1, 'BATs': 1, 'to': 1, 'game': 1, 'make': 1, 'forths': 1, 'three ': 1, 'trying': 1 })

The TF-IDF in gensim

?? After preprocessing the text, we will get a count dictionary for the above three Example texts, which contains the number of times each word appears in the text. Below, we will use the implemented TF-IDF model in gensim to output the top three words of TF-IDF in each article and Their TFIDF values. The complete code is as follows:

From nltk. corpus import stopwords # deprecated words from gensim import into a, models, matutils # training by gensim's ifidf modeldef get_words (text): tokens = get_tokens (text) filtered = [W for W in tokens if not w in stopwords. words ('English ')] Return filtered # Get textcount1, count2, count3 = get_words (text1), get_words (text2), get_words (text3) countlist = [count1, count2, count3] # training by tfidfmodel in gensimdictionary = sort. dictionary (countlist) new_dict = {v: K for K, V in dictionary. token2id. items ()} corpus2 = Specified dictionary.doc 2bow (count) for count in countlist] tfidf2 = models. tfidfmodel (corpus2) corpus_tfidf = tfidf2 [corpus2] # outputprint ("\ ntraining by gensim TFIDF model ....... \ n ") for I, doc in enumerate (corpus_tfidf): Print (" top words in Document % d "% (I + 1) sorted_words = sorted (Doc, key = Lambda X: X [1], reverse = true) # type = List for num, score in sorted_words [: 3]: Print ("word: % s, TF-IDF: % s "% (new_dict [num], round (score, 5 )))

The output result is as follows:

Training by gensim Tfidf Model.......Top words in document 1    Word: football, TF-IDF: 0.84766    Word: rugby, TF-IDF: 0.21192    Word: known, TF-IDF: 0.14128Top words in document 2    Word: play, TF-IDF: 0.29872    Word: cm, TF-IDF: 0.19915    Word: diameter, TF-IDF: 0.19915Top words in document 3    Word: net, TF-IDF: 0.45775    Word: teammate, TF-IDF: 0.34331    Word: across, TF-IDF: 0.22888

The output is still consistent with our expectation. For example, the keyword football and rugby are extracted from football articles and the keyword plat and CM is extracted from basketball articles, the article on Volleyball extracted the net and teammate keywords.

Practice TF-IDF model by yourself

?? With the above our understanding of the TF-IDF model, in fact we can also practice a hand, this is the best way to learn algorithms!
?? The following is the author's practice of TF-IDF code (followed by text preprocessing code ):

Import math # Calculate tfdef TF (word, count): Return count [word]/sum (count. values () # Calculate how many files in count_list contain worddef n_containing (word, count_list): Return sum (1 for count in count_list if word in count) # Calculate idfdef IDF (word, count_list): Return math. log2 (LEN (count_list)/(n_containing (word, count_list) # base on number 2 # Calculate TF-idfdef TFIDF (word, Count, count_list ): return TF (word, count) * IDF (word, count_list) # test count1, count2, count3 = make_count (text1), make_count (text2), make_count (text3) countlist = [count1, count2, count3] print ("training by original algorithm ...... \ n ") for I, count in enumerate (countlist): Print (" top words in Document % d "% (I + 1) scores = {word: TFIDF (word, count, countlist) for word in count} sorted_words = sorted (scores. items (), Key = Lambda X: X [1], reverse = true) # type = List # sorted_words = matutils. unitvec (sorted_words) for word, score in sorted_words [: 3]: Print ("word: % s, TF-IDF: % s" % (word, round (score, 5 )))

The output result is as follows:

Training by original algorithm......Top words in document 1    Word: football, TF-IDF: 0.30677    Word: rugby, TF-IDF: 0.07669    Word: known, TF-IDF: 0.05113Top words in document 2    Word: play, TF-IDF: 0.05283    Word: inches, TF-IDF: 0.03522    Word: worth, TF-IDF: 0.03522Top words in document 3    Word: net, TF-IDF: 0.10226    Word: teammate, TF-IDF: 0.07669    Word: across, TF-IDF: 0.05113

We can see that the keywords extracted by the TF-IDF model of the author's hands-on practice are consistent with gensim. Why are the last two words inconsistent in basketball because these words are the same as TFIDF, the random selection result is different. But there is a problem, that is, the calculated TFIDF value is different. Why?
?? Check the source code for calculating the TF-IDF value in gensim (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py ):

That is to say, gensim standardizes the TF-IDF vector and converts it into a unit vector. Therefore, we need to add the standardization step in the code just now. The Code is as follows:

Import numpy as NP # normalize vectors, normalizedef unitvec (sorted_words): lst = [item [1] for item in sorted_words] l2norm = math. SQRT (sum (NP. array (LST) * NP. array (LST) unit_vector = [(item [0], item [1]/l2norm) for item in sorted_words] Return unit_vector # TF-IDF test count1, count2, count3 = make_count (text1), make_count (text2), make_count (text3) countlist = [count1, count2, count3] print ("training by original algorithm ...... \ n ") for I, count in enumerate (countlist): Print (" top words in Document % d "% (I + 1) scores = {word: TFIDF (word, count, countlist) for word in count} sorted_words = sorted (scores. items (), Key = Lambda X: X [1], reverse = true) # type = List sorted_words = unitvec (sorted_words) # normalize for word, score in sorted_words [: 3]: Print ("word: % s, TF-IDF: % s" % (word, round (score, 5 )))

The output result is as follows:

Training by original algorithm......Top words in document 1    Word: football, TF-IDF: 0.84766    Word: rugby, TF-IDF: 0.21192    Word: known, TF-IDF: 0.14128Top words in document 2    Word: play, TF-IDF: 0.29872    Word: shooting, TF-IDF: 0.19915    Word: diameter, TF-IDF: 0.19915Top words in document 3    Word: net, TF-IDF: 0.45775    Word: teammate, TF-IDF: 0.34331    Word: back, TF-IDF: 0.22888

The output result is consistent with that obtained by gensim!

Summary

?? Gensim is the most famous module for NLP in Python. If you are free, read the source code! Later, we will continue to introduce TF-IDF in other aspects of the application, welcome to exchange ~

Note: I have now activated the public account: Python crawler and algorithm (No.: easy_web_scrape). Thank you for your attention ~~

The complete code in this article is as follows:

Import nltkimport mathimport stringfrom nltk. corpus import stopwords # deprecated words from collections import counter # Count from gensim import into a, models, matutilstext1 = "" football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Specify Alian rules football; rugby Football (either rugby leleague or Rugby Union); and Gaelic football. these different variations of football are known as football codes. "text2 =" basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. after a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. the team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated. "text3 =" volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents 'playing area before it can be returned. to prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface-That teammate may then volley it back up SS the net or bat it to a third teammate who volleys it has ss the net. A team is allowed only three touches of the ball before it must be returned over the net. "# text preprocessing # function: Text File sentence segmentation, word segmentation, and remove the punctuation def get_tokens (text): text = text. replace ('\ n', '') sents = nltk. sent_tokenize (text) # tokens = [] for sent in sents: for word in nltk. word_tokenize (sent): # If word not in string. punctuation: # Remove punctuation tokens. append (Word) return tokens # Remove the deprecated word from the original text file # generate the Count dictionary, that is, the number of occurrences of each word def make_count (text): tokens = get_tokens (text) filtered = [W for W in tokens if not w in stopwords. words ('English ')] # Remove the deprecated Word Count = counter (filtered) return count # Calculate tfdef TF (word, count): Return count [word]/sum (count. values () # Calculate how many files in count_list contain worddef n_containing (word, count_list): Return sum (1 for count in count_list if word in count) # Calculate idfdef IDF (word, count_list): Return math. log2 (LEN (count_list)/(n_containing (word, count_list) # base on number 2 # Calculate TF-idfdef TFIDF (word, Count, count_list ): return TF (word, count) * IDF (word, count_list) Import numpy as NP # normalize vectors, normalizedef unitvec (sorted_words ): LST = [item [1] for item in sorted_words] l2norm = math. SQRT (sum (NP. array (LST) * NP. array (LST) unit_vector = [(item [0], item [1]/l2norm) for item in sorted_words] Return unit_vector # TF-IDF test count1, count2, count3 = make_count (text1), make_count (text2), make_count (text3) countlist = [count1, count2, count3] print ("training by original algorithm ...... \ n ") for I, count in enumerate (countlist): Print (" top words in Document % d "% (I + 1) scores = {word: TFIDF (word, count, countlist) for word in count} sorted_words = sorted (scores. items (), Key = Lambda X: X [1], reverse = true) # type = List sorted_words = unitvec (sorted_words) # normalize for word, score in sorted_words [: 3]: Print ("word: % s, TF-IDF: % s" % (word, round (score, 5) # training by gensim's ifidf modeldef get_words (text): tokens = get_tokens (text) filtered = [W for W in tokens if not w in stopwords. words ('English ')] Return filtered # Get textcount1, count2, count3 = get_words (text1), get_words (text2), get_words (text3) countlist = [count1, count2, count3] # training by tfidfmodel in gensimdictionary = sort. dictionary (countlist) new_dict = {v: K for K, V in dictionary. token2id. items ()} corpus2 = Specified dictionary.doc 2bow (count) for count in countlist] tfidf2 = models. tfidfmodel (corpus2) corpus_tfidf = tfidf2 [corpus2] # outputprint ("\ ntraining by gensim TFIDF model ....... \ n ") for I, doc in enumerate (corpus_tfidf): Print (" top words in Document % d "% (I + 1) sorted_words = sorted (Doc, key = Lambda X: X [1], reverse = true) # type = List for num, score in sorted_words [: 3]: Print ("word: % s, TF-IDF: % s "% (new_dict [num], round (score, 5)" output result: training by original algorithm ...... top words in document 1 word: football, TF-IDF: 0.84766 word: rugby, TF-IDF: 0.21192 word: Word, TF-IDF: 0.14128top words in document 2 word: Play, TF-IDF: 0.29872 word: inches, TF-IDF: 0.19915 word: points, TF-IDF: 0.19915top words in document 3 word: net, TF-IDF: 0.45775 word: teammate, TF-IDF: 0.34331 word: BAT, TF-IDF: 0.22888 training by gensim TFIDF model ....... top words in document 1 word: football, TF-IDF: 0.84766 word: rugby, TF-IDF: 0.21192 word: known, TF-IDF: 0.14128top words in document 2 word: Play, TF-IDF: 0.29872 word: cm, TF-IDF: 0.19915 word: diameter, TF-IDF: 0.19915top words in document 3 word: net, TF-IDF: 0.45775 word: teammate, TF-IDF: 0.34331 word: Drawing SS, TF-IDF: 0.22888 """

Introduction to NLP (2) Exploring the principles of TF-IDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.