Python jieba word segmentation for keyword extraction and analysis, python keywords
1 Overview
Keyword extraction is to extract the words most relevant to the meaning of this document from the text. This can be traced back to the early days of literature search. When full-text search is not yet supported, keywords can be used as words for searching this paper. Therefore, you can still see the keyword in the paper.
In addition, keywords play an important role in text clustering, classification, automatic summarization, and other fields. For example, when clustering involves several documents with similar keywords as a cluster, the convergence speed of clustering algorithms can be greatly improved. Keywords of these news can be extracted from all the news of a day, you can get a general idea of what happened that day, or combine the microblog of several people in a certain period of time into a long text, and then extract keywords to know what topics they are mainly discussing.
In short, keywords are words that best reflect the topic or meaning of a text. However, the people who write articles on the Internet will not tell you what the keywords in this article are like they write papers. At this time, they need to use computers to automatically extract keywords, the quality of the algorithm directly determines the effect of subsequent steps.
There are two methods for keyword extraction:
- The first type is keyword allocation. There is a given keyword library, and a new document is created to find several words in the dictionary as the keywords of this document;
- The second is keyword extraction, which is a new document. Some words are extracted from this document as the keywords of this document;
Currently, most domain-independent keyword extraction algorithms (domain-independent algorithms mean that keyword extraction algorithms can be applied to any subject or field text) and their libraries are based on the latter. Logically, the latter is more meaningful than the previous ones in actual use.
From the algorithm perspective, there are two types of keyword extraction algorithms:
- The supervised learning algorithm considers the keyword extraction process as a binary classification problem. The candidate words are extracted first, and then tags are defined for each candidate word, either a keyword or a keyword, then train the keyword extraction classifier. When a new document is added, all candidate words are extracted, and then the classifier is extracted using the trained keywords to classify candidate words. Finally, the candidate words labeled as keywords are used as keywords;
- The unsupervised learning algorithm extracts candidate words, scores each candidate word, and outputs the top K candidate words with the highest scores as keywords. According to different scoring strategies, there are different algorithms, such as TF-IDF, TextRank and other algorithms;
In jieba word segmentation system, two keyword extraction algorithms are implemented, which are based on TF-IDF keyword extraction algorithm and TextRank keyword extraction algorithm. These two algorithms are unsupervised learning algorithms, the following describes how to use the jieba word segmentation keyword extraction interface and how it works through the source code.
2 Examples
The following describes the process of extracting keywords using the TF-IDF and TextRank interface in the jieba word segmentation system.
2.1 keyword extraction based on TF-IDF Algorithm
The sample code of keyword extraction based on TF-IDF algorithm is as follows,
from jieba import analyse
# Introduce TF-IDF keyword extraction interface
tfidf = analyse.extract_tags
# Original text
text = "Thread is the smallest unit of program execution. It is an execution flow of a process. \
It is the basic unit of CPU scheduling and dispatching. A process can be composed of many threads.
All resources of a process are shared between threads, and each thread has its own stack and local variables. \
Threads are independently scheduled for execution by the CPU, allowing multiple threads to run simultaneously in a multi-CPU environment. \
The same multi-thread can also achieve concurrent operations, each request is assigned a thread to process. "
# Keyword extraction based on TF-IDF algorithm
keywords = tfidf (text)
print "keywords by tfidf:"
# Output extracted keywords
for keyword in keywords:
print keyword + "/",
Console output,
Keywords by tfidf:
Thread/CPU/process/scheduling/multithreading/Program Execution/each/execution/stack/local variable/unit/concurrency/dispatch/One/share/Request/Minimum/allowed/ allocate/
2.2 keyword extraction based on TextRank Algorithm
The sample code for keyword extraction based on the TextRank algorithm is as follows,
from jieba import analyse
# Introduce TextRank keyword extraction interface
textrank = analyse.textrank
# Original text
text = "Thread is the smallest unit of program execution. It is an execution flow of a process. \
It is the basic unit of CPU scheduling and dispatching. A process can be composed of many threads.
All resources of a process are shared between threads, and each thread has its own stack and local variables. \
Threads are independently scheduled for execution by the CPU, allowing multiple threads to run simultaneously in a multi-CPU environment. \
The same multi-thread can also achieve concurrent operations, each request is assigned a thread to process. "
print "\ nkeywords by textrank:"
# Keyword extraction based on TextRank algorithm
keywords = textrank (text)
# Output extracted keywords
for keyword in keywords:
print keyword + "/",
Console output,
Keywords by textrank:
Thread/process/scheduling/unit/Operation/Request/allocation/allowed/basic/shared/concurrent/stack/independent/execution/dispatch/composition/resource/implementation/Operation/processing/
3. Theoretical Analysis
The principles of TF-IDF algorithm and TextRank algorithm are analyzed in sequence.
3.1 TF-IDF algorithm analysis
In information retrieval theory, TF-IDF is short for Term Frequency-Inverse Document Frequency. TF-IDF is a numerical statistics used to reflect the importance of a word for a document in the corpus. In information retrieval and text mining, it is often used for factor weighting.
The main idea of TF-IDF is: if a word appears frequently in a document, that is, TF is high, and it rarely appears in other documents in the corpus, that is, the low of DF, that is, if the IDF is high, the term is considered to have good classification capabilities.
In practice, the TF-IDF is mainly to multiply the two, that is, TF * IDF, TF for Word Frequency (Term Frequency), indicating the word t appears in the document d Frequency; IDF indicates the Inverse Document Frequency (Inverse Document Frequency), which indicates the reciprocal of the number of documents containing word t in the corpus.
TF formula:
TF formula,
Formula,
Count (t) indicates the number of t words contained in document di;
Count (di) indicates the total number of words in document di;
IDF formula:
IDF formula,
Formula,
Num (corpus) indicates the total number of documents in corpus;
Num (t) indicates the number of documents whose corpus contains t;
Application to keyword extraction:
1. Pre-processing: first, perform word segmentation and part-of-speech tagging, and use words that meet the specified part-of-speech as candidate words;
2. Calculate the TF-IDF values of each word separately;
3. Sort the TF-IDF values of each word in descending order, and output a specified number of words as possible keywords;
3.2 TextRank algorithm analysis
Similar to PageRank, the syntax unit in the text is considered as a node in the graph. If the two syntax units have a certain syntax relationship (such as co-occurrence ), then, the two syntax units will have an edge connected to each other in the graph. Through a certain number of iterations, different nodes will have different weights. The syntax unit with a higher weight can be used as a keyword.
The weight of a node depends not only on its inbound nodes, but also on the weights of these inbound nodes. The more inbound nodes, the greater the weight of the inbound node, the higher the weight of the node;
The formula for TextRank iteration calculation is,
WS (Vi) = (1 −d) + d ∗ Σ Vj records In (Vi) wji Σ Vk records Out (Vj) wjk ∗ WS (Vj)
The weight of node I depends on the weight of the edge I-j In the worker node of node I/the weight of all outbound edges of node j * the weight of node j, add the weights calculated by these neighboring nodes and multiply them by a certain damping factor, that is, the weight of node I;
The damping factor d is generally 0.85;
General algorithm process:
1. Identify the text unit and add it as a vertex to the graph;
2. Identify the relationship between text units and use these relationships as edges between vertices in the graph. The edges can be directed, undirected, weighted, or not authorized;
3. iteration and convergence based on the above formula;
4. Sort the vertex scores in descending order;
- This model uses the co-occurrence relationship. If the semantic units of the two vertices appear in a window (the window size ranges from 2 to 10), the two vertices are connected;
- When adding a vertex to a graph, you need to consider syntax filtering, such as retaining only words of a specific part of speech (such as adjectives and nouns;
Application to key phrase extraction:
1. Pre-processing: first, perform word segmentation and part-of-speech tagging, and add a single word as a node to the graph;
2. Set the syntax filter to add the words that use the syntax filter to the graph. The words that appear in a window form an edge between each other;
3. Based on the above formula, iteration continues until convergence. Generally, the iteration threshold is set to 0.0001 for 20-30 times;
4. Sort the vertex scores in descending order and output a specified number of words as possible keywords;
5. Post-processing. If the two words are connected before and after the text, they are connected together as key phrases;
4. Source Code Analysis
The keyword extraction function of jieba word segmentation is implemented in the jieba/analyze directory.
Specifically, __init _. py is mainly used to encapsulate the keyword extraction interface for jieba word segmentation;
Tfidf. py realizes keyword extraction based on TF-IDF algorithm;
Textrank. py extracts keywords Based on the TextRank algorithm;
4.1 TF-IDF algorithm extraction keyword source code analysis
The main function of keywords Extraction Based on TF-IDF algorithm is TFIDF. extract_tags function, which is mainly implemented in jieba/analyze/tfidf. py.
TFIDF is the class defined for the keywords extracted by the TF-IDF algorithm. During class initialization, The tokenizer = jieba is loaded by default. dt, part-of-speech tagging function postokenizer = jieba. posseg. dt, stop_words = self. STOP_WORDS.copy (), idf dictionary idf_loader = IDFLoader (idf_path or DEFAULT_IDF), and obtain the idf dictionary and idf value (if a word does not appear in idf dictionary, the idf value is used as the idf value of the word ).
def __init __ (self, idf_path = None):
# Load
self.tokenizer = jieba.dt
self.postokenizer = jieba.posseg.dt
self.stop_words = self.STOP_WORDS.copy ()
self.idf_loader = IDFLoader (idf_path or DEFAULT_IDF)
self.idf_freq, self.median_idf = self.idf_loader.get_idf ()
Then we start to extract keywords through TF-IDF algorithm.
First, determine whether to call the part-of-speech tagging interface or the word segmentation interface based on whether the part-of-speech restriction set is introduced. For example, the part-of-speech restriction set is ["ns", "n", "vn", "v", "nr"], it indicates that keywords can only be extracted from words that are part of the word "Place Name", "noun", "verb", and "Personal Name.
1) if a part-of-speech restriction set is introduced, call the part-of-speech tagging interface to mark the part-of-speech of the input sentence to obtain the word segmentation and the corresponding part-of-speech. traverse the word segmentation result in sequence, if the word's part of speech is not in the part of speech restriction set, it is skipped. If the word length is less than 2, or the word is disabled, it is skipped; finally, add the words that meet the conditions to the Word Frequency dictionary, and Add 1 to the number of occurrences. Then, traverse the Word Frequency dictionary and obtain the idf value for each word according to the idf dictionary, divide by the total number of times in the Word Frequency dictionary to obtain the tf * idf value for each word. If the weight flag is set, words in the Word Frequency dictionary are sorted in descending order based on the tf-idf value, then, the topK words are output as keywords;
2) If the part-of-speech restriction set is not introduced, first call the word segmentation interface to perform word segmentation on the input sentence to obtain word segmentation. traverse the word segmentation result in sequence. If the word length is less than 2, or if the word is disabled, skip it. Then, add the words that meet the conditions to the Word Frequency dictionary, and Add 1 to the number of occurrences. Then, traverse the Word Frequency dictionary, the idf value of each word is obtained according to the idf dictionary, and divided by the total number of times in the Word Frequency dictionary to obtain the tf * idf value of each word. If the weight flag is set, the words in the Word Frequency dictionary are sorted in descending order based on the tf-idf value, and the top K words are output as keywords;
def extract_tags (self, sentence, topK = 20, withWeight = False, allowPOS = (), withFlag = False):
# Passed part of speech restriction set
if allowPOS:
allowPOS = frozenset (allowPOS)
# Invoking the part-of-speech tagging interface
words = self.postokenizer.cut (sentence)
# No incoming parts limit set
else:
# Call the word segmentation interface
words = self.tokenizer.cut (sentence)
freq = ()
for w in words:
if allowPOS:
if w.flag not in allowPOS:
continue
elif not withFlag:
w = w.word
wc = w.word if allowPOS and withFlag else w
# Determine whether the length of the word is less than 2, or whether the word is a stop word
if len (wc.strip ()) <2 or wc.lower () in self.stop_words:
continue
# Add it to the word frequency dictionary, increase the number by 1
freq [w] = freq.get (w, 0.0) + 1.0
# Count the total number of times in the word frequency dictionary
total = sum (freq.values ())
for k in freq:
kw = k.word if allowPOS and withFlag else k
# Calculate the tf-idf value of each word
freq [k] * = self.idf_freq.get (kw, self.median_idf) / total
# Sort by tf-idf value
if withWeight:
tags = sorted (freq.items (), key = itemgetter (1), reverse = True)
else:
tags = sorted (freq, key = freq .__ getitem__, reverse = True)
# Output topK words as keywords
if topK:
return tags [: topK]
else:
return tags
4.2 TextRank algorithm extraction keyword source code analysis
The main function used to extract keywords Based on the TextRank algorithm is the TextRank. textrank function, which is mainly implemented in jieba/analyze/textrank. py.
Specifically, TextRank is the class defined for the TextRank algorithm to extract keywords. During class initialization, The tokenizer = postokenizer = jieba function is loaded by default. posseg. dt, stop_words = self. STOP_WORDS.copy (), POS filtering set pos_filt = frozenset ('ns', 'n', 'vn ', 'V'), window span = 5, ("ns", "n", "vn", "v.
First, define an undirected right graph, and then perform word segmentation on the sentence; traverse the word segmentation result in sequence. If a word I meets the filtering conditions (parts of speech are in the part-of-speech filtering set, and the word length is greater than or equal to 2, and the word is not a disabled word), and then the word j in the window range after the word (these words also need to meet the filtering conditions ), use Word I and word j as keys and add them to the co-occurrence dictionary as values;
Then, traverse the co-occurrence dictionary in sequence, and show each element in the dictionary, key = (word I, word j), value = word I and word j, where word I, word j is used as the start and end points of an edge, and the number of co-occurrence times is used as the edge weight. It is added to the undirected graph defined previously.
Next, this undirected graph is iterated into the textrank algorithm. After several iterations, the algorithm converges and each word corresponds to a metric value;
If a weight flag is set, words in the undirected permission chart are sorted in descending order based on the value of the indicator value, and the top K words are output as keywords;
def textrank (self, sentence, topK = 20, withWeight = False, allowPOS = ('ns', 'n', 'vn', 'v'), withFlag = False):
self.pos_filt = frozenset (allowPOS)
# Define undirected right graph
g = UndirectWeightedGraph ()
# Define co-occurrence dictionary
cm = defaultdict (int)
# Participle
words = tuple (self.tokenizer.cut (sentence))
# Loop through each word in turn
for i, wp in enumerate (words):
# Word i satisfies the filter
if self.pairfilter (wp):
# Iterate through the words in the window range after the word i
for j in xrange (i + 1, i + self.span):
# Word j cannot exceed the entire sentence
if j> = len (words):
break
# Word j does not meet the filter conditions, then skip
if not self.pairfilter (words [j]):
continue
# Add word i and word j as keys and the number of occurrences as value to the co-occurrence dictionary
if allowPOS and withFlag:
cm [(wp, words [j])] + = 1
else:
cm [(wp.word, words [j] .word)] + = 1
# Iterate through each element of the co-occurrence dictionary in turn, using the words i and j as the starting and ending points of an edge, and the number of co-occurrences as the edge weight
for terms, w in cm.items ():
g.addEdge (terms [0], terms [1], w)
# Run the textrank algorithm
nodes_rank = g.rank ()
# Sort by indicator value
if withWeight:
tags = sorted (nodes_rank.items (), key = itemgetter (1), reverse = True)
else:
tags = sorted (nodes_rank, key = nodes_rank .__ getitem__, reverse = True)
# Output topK words as keywords
if topK:
return tags [: topK]
else:
return tags
The definition and implementation of the undirected graph are implemented in the UndirectWeightedGraph class. According to the initialization function _ init __of the UndirectWeightedGraph class, we can find that the so-called undirected graph is a dictionary, the dictionary key is the word to be added subsequently, and the dictionary value, it is a list composed of three tuples (starting point, ending point, edge weight), indicating all edges with this word as the starting point.
The operation of adding edges to an undirected graph is completed in the addEdge function. Because it is an undirected graph, we need to use start as the starting point and end as the ending point in sequence, then start is used as the end point, and end is used as the start point. The weights of the two edges are the same.
def addEdge(self, start, end, weight):
# use a tuple (start, end, weight) instead of a Edge object
self.graph[start].append((start, end, weight))
self.graph[end].append((end, start, weight))
The execution of textrank algorithm iteration is completed in the rank function.
First, assign the same weight to each node and calculate the sum of the times of all outbound degrees of the node;
Iterations are performed several times to ensure stable results;
In each iteration, each node is traversed in sequence. For node n, first, all the inbound nodes of node n are obtained based on the undirected rightmap (for undirected rightmap, the inbound and outbound nodes are the same, all of which are connected to node n). We have calculated the number of times all outbound operations of this inbound node, and its contribution to the weight of node n is equal to its own weight value multiplied by its co-occurrence with node n/The number of all outgoing degrees of the node, add the weights obtained from each inbound node and multiply them by a certain damping factor to obtain the weights of node n;
After the iteration, the weights are normalized and the corresponding weights of each node are returned.
def rank (self):
ws = defaultdict (float)
outSum = defaultdict (float)
wsdef = 1.0 / (len (self.graph) or 1.0)
# Initialize the weight of each node
# Sum of the number of out-degrees of each node
for n, out in self.graph.items ():
ws [n] = wsdef
outSum [n] = sum ((e [2] for e in out), 0.0)
# this line for build stable iteration
sorted_keys = sorted (self.graph.keys ())
# Traverse several times
for x in xrange (10): # 10 iters
# Iterate through the nodes
for n in sorted_keys:
s = 0
# Intra-degree nodes traversing nodes
for e in self.graph [n]:
# Add the weights contributed by these nodes
# Contribution rate = Number of co-occurrences of in-degree node and node n
s + = e [2] / outSum [e [1]] * ws [e [1]]
# Update the weight of node n
ws [n] = (1-self.d) + self.d * s
(min_rank, max_rank) = (sys.float_info [0], sys.float_info [3])
# Get the maximum and minimum weights
for w in itervalues (ws):
if w <min_rank:
min_rank = w
if w> max_rank:
max_rank = w
# Normalize weights
for n, w in ws.items ():
# to unify the weights, don't * 100.
ws [n] = (w-min_rank / 10.0) / (max_rank-min_rank / 10.0)
return ws
4.3 Use a custom disabled word set
In jieba word segmentation, keywords extracted based on TF-IDF algorithm and keywords extracted based on TextRank algorithm must be filtered by deprecated words. The class TFIDF that implements the TF-IDF algorithm to extract keywords and the class TextRank that implements the TextRank algorithm to extract keywords are all subclasses of the class KeywordExtractor. In the KeywordExtractor class, a method is implemented to load the deprecated word set provided by the user based on the path specified by the user.
The KeywordExtractor class is implemented in jieba/analyze/tfidf. py.
The KeywordExtractor class first provides a default set of STOP_WORDS.
The KeywordExtractor class then implements the set_stop_words method, which can load the deprecated word set provided by the user based on the path specified by the user.
You can copy extra_dict/stop_words.txt and add the words "one" and "each" to the end of the file as the deprecated Word file provided by the user, the sample code for keyword extraction using the disabled word set provided by the user is as follows,
from jieba import analyse
# Introduce TF-IDF keyword extraction interface
tfidf = analyse.extract_tags
# Use custom stopword collection
analyse.set_stop_words ("stop_words.txt")
# Original text
text = "Thread is the smallest unit of program execution. It is an execution flow of a process. \
It is the basic unit of CPU scheduling and dispatching. A process can be composed of many threads.
All resources of a process are shared between threads, and each thread has its own stack and local variables. \
Threads are independently scheduled for execution by the CPU, allowing multiple threads to run simultaneously in a multi-CPU environment. \
The same multi-thread can also achieve concurrent operations, each request is assigned a thread to process. "
# Keyword extraction based on TF-IDF algorithm
keywords = tfidf (text)
print "keywords by tfidf:"
# Output extracted keywords
for keyword in keywords:
print keyword + "/",
Keyword result,
Keywords by tfidf:
Thread/CPU/process/scheduling/multithreading/Program Execution/execution/stack/local variable/unit/concurrency/distribution/sharing/Request/Minimum/allowed/allocated/multiple /run/
Comparing the keyword extraction results in section 2.1, we can find that the words "one" and "each" are not extracted.
Keywords by tfidf:
Thread/CPU/process/scheduling/multithreading/Program Execution/each/execution/stack/local variable/unit/concurrency/dispatch/One/share/Request/Minimum/allowed/ allocate/
Implementation principle,Here we take the keyword extraction based on TF-IDF algorithm as an example.
As described earlier, jieba/analyze/_ init __. py is mainly used to encapsulate the keyword extraction interface for jieba word segmentation, in _ init __. py first instantiates the class TFIDF as the default_tfidf object, and the class TFIDF sets the deprecated Word Table during initialization. We know that the class TFIDF is a subclass of the class KeywordExtractor, the KeywordExtractor class provides a set of STOP_WORDS, so the TFIDF class first copies STOP_WORDS in the KeywordExtractor class during initialization as its stop_words set.
# Instantiate the TFIDF class
default_tfidf = TFIDF ()
# Instantiate the TextRank class
default_textrank = TextRank ()
extract_tags = tfidf = default_tfidf.extract_tags
set_idf_path = default_tfidf.set_idf_path
textrank = default_textrank.extract_tags
# User-set stopword collection interface
def set_stop_words (stop_words_path):
# Update the stopwords collection in the object default_tfidf
default_tfidf.set_stop_words (stop_words_path)
# Update the stopword collection in the object default_textrank
default_textrank.set_stop_words (stop_words_path)
If you need to use the disabled word set provided by you, you need to call the analyze. set_stop_words (stop_words_path) function. The set_stop_words function is implemented in the KeywordExtractor class. When the set_stop_words function is executed, the stop_words set in the default_tfidf object is updated. When the set_stop_words function is executed, the stop_words set is the updated stop_words set. We can make an experiment to verify whether the Stop Word set changes before and after calling the analyze. set_stop_words (stop_words_path) function.
from jieba import analyse
import copy
# Deeply copy the STOP_WORDS collection
stopwords0 = copy.deepcopy (analyse.default_tfidf.STOP_WORDS)
# Before setting a user-defined stopword set, copy the stopword set deeply
stopwords1 = copy.deepcopy (analyse.default_tfidf.stop_words)
print stopwords0 == stopwords1
print stopwords1-stopwords0
# Set user-defined stopword collection
analyse.set_stop_words ("stop_words.txt")
# After setting the user-defined stopword set, copy the stopword set in depth
stopwords2 = copy.deepcopy (analyse.default_tfidf.stop_words)
print stopwords1 == stopwords2
print stopwords2-stopwords1
The result is as follows,
True
Set ([])
False
Set ([U' \ u6bcf \ u4e2a ', U' \ u8207', U' \ u4e86', U' \ u4e00 \ u500b', U' \ u800c ', u' \ u4ed6 \ u5011 ', U' \ u6216', U' \ u7684', U' \ u4e00 \ u4e2a ', U' \ u662f', U' \ u5c31 ', u' \ u4f60 \ u5011 ', U' \ u5979 \ u5011', U' \ u6c92 \ u6709 ', U' \ u57fa \ u672c', U' \ u59b3 \ u5011 ', u' \ u53ca', U' \ u548c ', U' \ u8457', U' \ u6211 \ u5011', U' \ u662f \ u5426 ', u' \ u90fd '])
Note:
- Before loading the disabled word set provided by the user, the disabled word set is copied from STOP_WORDS in the KeywordExtractor class;
- After loading the disabled word set provided by the user, the disabled word set is expanded on the original basis;
Proves our ideas.
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.