1. How can we identify features that are clearly used in language data to classify them?
2. How can we build a language model for automating language processing tasks?
3. What language knowledge can we learn from these models?
6.1 have supervised classification Gender Identification
#创建一个分类器的第一步是决定输入的什么样的特征是相关的, and how to create a dictionary for those feature encodings #以下特征提取器 functions that contain information about a given name: Def gender_features (word): return {' last_l Etter ': word[-1]} print (Gender_features (' Shrek ')) #这个函数返回的字典被称为特征集, map the feature names to their values #现在, we have defined a feature extractor, We need to prepare a list of examples and corresponding class Tags: import nltk from Nltk.corpus import names import Random labeled_names = ([(Name, ' Male ') for name in Names.words (' Male.txt ')] + [(name, ' female ') for name in Names.words (' Female.txt ')]) Random.shuffle (Labe
Led_names) print (Labeled_names[:5]) #接下来, we use the feature extractor to process name data and divide the result list of a special collection into a training set and a test set.
#训练集用于训练一个新的 "Naive Bayes" classifier. Featuresets = [(Gender_features (n), gender) for (N, gender) in Labeled_names] Print (featuresets[:5)) #
Look what featuresets looks like ... Train_set, Test_set = featuresets[500:], featuresets[:500] classifier = NLTK. Naivebayesclassifier.train (Train_set) #测试一些没有出现在训练数据中的名字 print (Classifier.classify (gender_features (' Neo ')) print (Classifier.classify (Gender_features (' Trinity ')) #评估 print (Nltk.classify.accuracy (classifier, test_set)) #检查Classifier to determine which features are most effective for distinguishing between the sexes of a name. Print (Classifier.show_most_informative_features (5)) #比率称为似然比 that can be used to compare different feature-result relationships
Select the correct feature
#建立一个分类器的很多有趣的工作之一是找出哪些特征可能是相关, and how we can express them.
#虽然使用相当简单而明显的特征集往往可以得到像样的性 can, #但是使用精心构建的基于对当前任务的透彻理解的特征, usually significantly improve earnings.
#一个特征提取器, cross-fitting gender characteristics.
#这个特征提取器返回的特征集包括大量指 characteristic, which leads to the fitting of a relatively small name corpus. def gender_features2 (name): Features = {} features[' firstletter '] = name[0].lower () features[' lastletter '] = n Ame[-1].lower () for letter in ' abcdefghijklmnopqistuvwxyz ': features[' count (%s) '%letter] = Name.lower (). Count (
Letter) features["has (%s)"%letter] = (letter in Name.lower ()) Return features print (Gender_features2 (' John '))
#如果你提供太多的特征, the algorithm will be highly dependent on the characteristics of your training data and the generalization to the new example will not work well.
#这 problem is referred to as fitting, which is especially problematic when working on a small training set. Featuresets = [(Gender_features2 (n), gender) for (N, gender) in Labeled_names] train_set, Test_set = featuresets[500:], fe ATURESETS[:500] classifier = NLTK. Naivebayesclassifier.train (train_set) print (Nltk.classify.accuracy (classifier, test_set)) #旦初始特征集被选定,
A very effective way to improve the special collection is error analysis.
#首先, we select a development set that contains the corpus data used to create the model.
#然后将这种开发集分为训练集和开发测试集. Train_names = labeled_names[1500:] DevTest_names = labeled_names[500:1500] Test_names = labeled_names[:500] #训练集用于训练模型, the development test set is used for error analysis, and the test set is used for the final evaluation of the system. Train_set = [(Gender_features2 (n), g) for (N, G) in train_names] Devtest_set = [(Gender_features2 (n), g) for (N, g) in Dev Test_names] Test_set = [(Gender_features2 (n), g) for (N, G) in test_names] classifier = NLTK. Naivebayesclassifier.train (train_set) print (Nltk.classify.accuracy (classifier, devtest_set)) #用于训练有监督分类器的语料数据组织图. Corpus data is divided into two categories: development set and Test set.
Development sets are often further divided into training sets and development test sets.
Error = [] for (name, tag) in devtest_names:guess = Classifier.classify (gender_features (name)) if Guess!= tag: Error.append ((tag, guess, name) #然后, you can examine individual error cases where the model predicts the wrong label, and try to determine what additional information will enable it to make the right decision #然后可以 the appropriate tuning feature set. The name classifier we have created generates about 100 errors for the development test corpus (tag, guess, name) in sorted (error): Print (' correct=%-8s guess=%-8s name=%-30s '% (t AG, guess, name) break #调整我们的特征提取器包括两个字母后缀的特征: def gender_features (word): return {' suffix1 ': word[- 1:], ' suffix1 ': word[-2]} #使用新Feature extractor to reconstruct the classifier, we see the performance on the test dataset improved by nearly 3% classifier = nltk. Naivebayesclassifier.train (train_set) print (Nltk.classify.accuracy (classifier, devtest_set))
Document Classification
#建立分类器, automatically add the appropriate category labels to the new document #首先, we construct a document list with the corresponding category marked.
#对于这个例子, we select the film review corpus to classify each comment as positive or negative. From Nltk.corpus import movie_reviews documents = [(List (Movie_reviews.words (Fileid), category) for category In Movie_reviews.categories () to Fileid in Movie_reviews.fileids (category)] Random.shuffle (documents) #对于
Document subject recognition, we can define an attribute for each word to indicate whether the document contains the word.
#为了限制分类器需要处理的特征的数目, we started building a list of the first 2000 most frequent words in the corpus.
#然后, define a feature extractor that simply checks whether the word is in a given document. #一个文档分类的特征提取器, whose characteristics indicate whether each word all_words = NLTK in a given document. Freqdist (W.lower () for W in Movie_reviews.words ()) Word_features = List (all_words) [: Watts] def document_features ( Document): Document_words = Set (document) features = {} for Word in word_features:features[' contains ( %s) '%word] = (word in document_words) return features #print (document_features movie_reviews.words (' Pos/cv957_8737.tx T ') #已经定义了我们的特征提取器, you can use it to train a classifier to label new movie reviews #为了检查产生的分类器可靠性如何, we calculate their accuracy on the test set #使用 show_most_informative_features () To find out what features the classifier finds most informative #训练和测试一Classifier for document categorization. Featuresets = [(Document_features (d), C) for (d, c) in documents] train_set, Test_set = featuresets[100:], Featuresets[:10 0] classifier = nltk. Naivebayesclassifier.train (train_set) print (Nltk.classify.accuracy (classifier, test_set)) print (Classifier.show_ Most_informative_features (5)) #练一个分类器来 figure out which suffix is the most informative. First, let's find the most common suffix: from nltk.corpus import brown suffix_fdist = NLTK. Freqdist () for Word in brown.words (): Word = Word.lower () suffix_fdist[word[-1:]] = 1 Suffix_fdist[word[-2:] + + 1 suffix_fdist[word[-3:] = = 1 common_suffixes = [suffix for (suffix, count) in Suffix_fdist.most_common ()] pri NT (Common_suffixes[:5]) #现在, we have defined our feature extractor that can be used to train a new "decision tree" classifier def pos_features (word): features = {} for Suff
IX in common_suffixes:features[' EndsWith ({}) '. Format (suffix)] = Word.lower (). EndsWith (suffix) return features Tagged_words = brown.tagged_words (categories= ' news ') featuresets = [(Pos_features (n), G) for (n,g) in tagged_words] Size = INT (len (featuresets) * 0.1) train_set, Test_set = featuresets[size:], featuresets[:size] classifier = NLTK. Decisiontreeclassifier.train (train_set) print (Nltk.classify.accuracy (classifier, test_set)) print ( Classifier.classify (pos_features (' Cats '))
Explore Contextual Contexts
#了采取基于词的上下文的特征, we must modify the pattern previously defined for our feature extractor. #不是只传递已标注的词, we will pass the whole (unmarked) sentence, and the index #一个词性分类器 of the target word, whose feature checker examines the context in which a word appears to determine which part of the speech tag should be allocated.
In particular, the preceding words are regarded as a characteristic. def pos_features (sentence, i): features = {"suffix (1)": sentence[i][-1:], "suffix (2)": Sentence[i][-2:
], "suffix (3)": sentence[i][-3:]} if i = = 0:features["Prev-word"] = ' <START> ' else: features["Prev-word"] = Sentence[i-1] Return features print (Pos_features (brown.sents () [0], 8)) tagged_sents = brown.tagged_sents (categories= ' News ') #print (tagged_sents) featuresets = [] for tagged_sent in Tagged_sents:untag Ged_sent = Nltk.tag.untag (tagged_sent) for I, (word, tag) in Enumerate (tagged_sent): Featuresets.append (Pos_f Eatures (Untagged_sent, i), tag) size = Int (len (featuresets) * 0.1) train_set, Test_set = featuresets[size:], Featuresets [: size] classifier = NLTK. Naivebayesclassifier.train (train_set) print (Nltk.classify.accuracy (classifier, test_set))
Sequence Classification
#为了捕捉相关的分类任务之间的依赖关系, we can use the federated classifier model to collect relevant inputs and select the appropriate labels.
#在词性标注的例子中, various sequence classifier models can be used to select the POS tag for all the words in a given sentence.
#一种序列分类器策略, called continuous classification or greedy sequence classification, is to find the most likely class label for the first input, #然后使用这个问题的答案帮助找到下一个输入的最佳的标签.
#这个过程可以不断重 until all inputs are labeled #使用连续分类器进行词性标注. def pos_features (sentence, I, history): features = {"suffix (1)": sentence[i][-1:], "suffix (2)": Sen Tence[i][-2:], "suffix (3)": sentence[i][-3:]} if i = = 0:features["Prev-word"] = "<start > "features[" prev-tag "] =" <START> "else:features[" prev-word "] = Sentence[i-1] F eatures["Prev-tag"] = History[i-1] Return features class Consecutivepostagger (NLTK.
Taggeri): Def __init__ (self, train_sents): Train_set = [] for tagged_sent in train_sents: Untagged_sent = Nltk.tag.untag (tagged_sent) history = [] for I, (word, tag) in enumerate (tagged_s
ENT): FeatureSet = Pos_features (Untagged_sent, I, History) Train_set.append ((featureset, tag)) History.append (tag) self.classifier = NLTK. Naivebayesclassifier.train (train_set) def tag (self, sentence): History = [] for I, word in enumerate ( Sentence): FeatureSet = pos_features (sentence, I, history) tag = Self.classifier.classify (features ET) history.append (tag) return zip (sentence, history) Tagged_sents = Brown.tagged_sents (cate Gories= ' news ' size = Int (len (tagged_sents) * 0.1) train_sents, test_sents = tagged_sents[size:], Tagged_sents[:size] Tag GER = Consecutivepostagger (train_sents) print (Tagger.evaluate (test_sents))
Other sequence classification methods
#隐马尔可夫模型类似于连续分类器, it does not just look at the input but also the historical #不是简单地找出一个给定的词的单个最好的标签 of the predicted tag
, but it produces a probability distribution for the tag.
#然后将这些概率结合起来计算标记序列的概率得分, the highest probability tag sequence is selected.
6.2 More examples of supervised classifications
sentence Segmentation
#句子分割可以看作是一个标点符号的分类任务: #每当我们遇到一个可能会结束一个句子 symbol, such as a period or question mark, we must decide whether it terminates the current sentence.
#第一步是获得一些已被分割成句子的数据, convert it into a form suitable for extracting features #tokens is a merged list of individual sentence identifiers, #boundaries is a collection that contains the index of all the sentence boundary identifiers.
sents = Nltk.corpus.treebank_raw.sents () print (Sents[:2]) tokens = [] boundaries = set () offset = 0 for sent in sents: Tokens.extend (Sent) offset = Len (Sent) Boundaries.add (offset-1) #下一步, we need to specify the data characteristics that determine whether punctuation represents a sentence boundary: def P Unct_features (tokens, i): return {' next-word-capitalized ': Tokens[i+1][0].isupper (), ' Prev-word ': tokens[i -1].lower (), ' punct ': tokens[i], ' Prev-word-is-one-char ': Len (tokens[i-1]) = = 1} #基于这一特征提取器, we can pass Select all punctuation to create a list of tagged feature sets, #然后标注它们是否是边界标识符: Featuresets = [(Punct_features (tokens, i), (I in boundaries)) F
or I in range (1, Len (tokens)-1) if tokens[i] in '.?! '] #使用这些特征集, we can train and evaluate a punctuation mark classifier size = Int (len (featuresets) * 0.1) train_set, Test_set = featuresets[size:], featuresets[: Size] Classifier = NLTK. NaivebayescLassifier.train (train_set) print (Nltk.classify.accuracy (classifier, test_set))
Identify dialog behavior Types
#识别对话中言语下的对话行为 is an important first step in understanding the conversation.
#NPS chat corpus, including more than 10,000 posts from instant messaging sessions.
#这些帖子都已经被贴上 a label in 15 types of conversational behavior, such as "statement", "Emotion", "yn problem", "continuer".
#因此, we can use this data to create a classifier that identifies the type of conversation behavior of new instant message posts.
import NLTK
#第一步是提取基本的消息数据. We'll call Xml_posts () to get a data structure that represents the XML annotation for each post:
posts = nltk.corpus.nps_chat.xml_posts () [: 10000]
#print ( Posts)
#下一步, we will define a simple feature extractor to check what words the post contains:
def dialogue_act_features (POST):
features = {} for
Word In Nltk.word_tokenize (POST):
features[' contains (%s) '%word.lower ()] = True return
features
#最后, We construct training and test data and create a new classifier by extracting the features for each post (using the Post.get (' class ')
#获得一个帖子的对话行 for the type):
featuresets = [Dialogue_ Act_features (Post.text), Post.get (' class ')) for post in posts]
size = Int (len (featuresets) * 0.1)
Train_set, Test_set = featuresets[size:], featuresets[:size]
classifier = nltk. Naivebayesclassifier.train (Train_set)
print (Nltk.classify.accuracy (classifier, test_set))
Identifying text Implication
The #识别文字蕴含 (recognizing textual entailment (RTE)) is a feature extractor that determines a given fragment of a text T
#是否蕴含着另一个叫做 A "hypothetical" text
#认识文字蕴涵. The
#RTEFeatureExtractor class establishes a vocabulary that is available in both text and assumptions after removing some of the deactivated words, and then calculates overlaps and differences.
def rte_features (rtepair):
extractor = nltk. Rtefeatureextractor (Rtepair)
features = {}
features[' word_overlap '] = Len (extractor.overlap (' word '))
features[' Word_hyp_extra '] = Len (Extractor.hyp_extra (' word '))
features[' ne_overlap '] = Len (Extractor.overlap (' Ne '))
features[' ne_hyp_extra '] = Len (Extractor.hyp_extra (' ne ')) return
features
Rtepair = Nltk.corpus.rte.pairs ([' Rte3_dev.xml ']) [+]
extractor = nltk. Rtefeatureextractor (rtepair) print (
extractor.text_words) print (
extractor.hyp_words) print (
Extractor.overlap (' word '))
print (Extractor.overlap (' ne '))
print (Extractor.hyp_extra (' word '))
extend to large datasets
#Python provides a good environment for basic text processing and feature extraction
#如果你尝试在 large datasets using pure Python machine learning implementations such as NLTK. Naivebayesclassifier),
#你可能会发 current learning algorithm will cost a lot of time and memory.
6.3 Evaluation
Test Set
#建立测试集时 is often a trade-off between the amount of data that can be tested and that can be used for training
#选择测试集时另一个需要考虑的是测试集中实例与开发集中的实例的相似程度
#例如: Consider pos tagging tasks.
#在一种极端情况, we can create training sets and test sets by randomly assigning sentences from a data source that reflects a single genre (such as news):
import random from
Nltk.corpus import Brown
tagged_sents = List (brown.tagged_sents (categories= ' News '))
random.shuffle (tagged_sents)
size = Int (len ( tagged_sents) * 0.9)
train_set, Test_set = tagged_sents[size:], tagged_sents[:size]
#在这种情况下, Our test sets and training sets will be very similar.
#训练集和测试集均取自同一文体, so we cannot believe that the evaluation results can be extended to other genres.
#确保训练集和测试集来自不同的文件:
file_ids = brown.fileids (categories= ' news ')
size = Int (len (file_ids) * 0.1)
Train_set = Brown.tagged_sents (file_ids[size:])
test_set = brown.tagged_sents (file_ids[:size))
# If we are going to perform a more convincing assessment, we can get a test set from a document that has less contact with the documentation in the training set.
Train_set = brown.tagged_sents (categories= ' news ')
Test_set = brown.tagged_sents (categories= ' fiction ')
accuracy
#用于评估一个分类最简单的度量是准确度, measure the proportion of input that the classifier correctly labels on the test set. The
#nltk. Classify.accuracy () function computes the accuracy of the classifier model on a given test set:
accuracy and Recall rate
# True positive is the related project we correctly identify as relevant.
# True negative is not relevant in the project we correctly identify as irrelevant.
# False positives (or type I errors) are not related to the item we identify incorrectly.
# false-negative (or type II error) is a related project in which we are incorrectly identified as irrelevant.
# accuracy (Precision), which indicates how much of the project we have found is relevant, tp/(tp+ FP).
# Recall rate (Recall), indicating how much of the related project we found, tp/(TP+FN).
# F-Measure (f-measure) (or F-score, f-score), combined accuracy and recall rate for a separate score,
# is defined as the harmonic mean of precision and recall (2xPrecisionxRecall)/(Precision + R e ca L).
Confusion Matrix
#一个混淆矩阵是一个表, where each cells[i,j] represents the number of times the correct label I is predicted to be the label J.
#因此, the diagonal item (that is, Cells[i,i]) represents the correctly predicted label, and non-diagonal items represent errors
Cross-validation
#在不同的测试集上执行多个评估, and then combine the scores of these evaluations, this technique is called cross-validation.
#特别是, we subdivide the original corpus into N subsets called folds (folds).
#对于每一个这些的折叠, we use all other data training models in addition to the data in this fold, and then test the model on this fold.
#即使个别的折叠可能是太小了而不能在其上给出准确的评价分数, the comprehensive evaluation score is based on a large amount of data and is therefore quite reliable.
#采用交叉验证的优势是, it allows us to look at how much performance changes in different training sets.
#如果我们从所有 N Training sets get very similar scores, then we can be quite confident that the score is accurate.
#另一方面, if the score on the N training set is very different, then we should be skeptical about the accuracy of the evaluation score.
6.4 Decision Tree
#决策树是一个简单的为输入值选择标签的流程图.
#这个流程图由检查特征值的决策节点 and assigning the leaf nodes of the label.
#为输入值选择标签, we start with the initial decision node of the flowchart (called its root node).
entropy and information gain
#计算标签链表的墒.
Import Math
def entropy (labels):
freqdist = nltk. Freqdist (labels)
probs = [Freqdist.freq (l) for L in Freqdist]
return-sum (P * Math.log (P, 2) for P in probs)
p Rint (Entropy ([' Male ', ' male ', ' male ', ' male '])
print (Entropy (' Male ', ' female ', ' male ', ' Male '))
print ( Entropy ([' Female ', ' male ', ' female ', ' Male '])
print (Entropy (' female ', ' female ', ' male ', ' female '))
print ( Entropy ([' Female ', ' female ', ' female ', ' female '])
freqdist = nltk. Freqdist ([' Male ', ' male ', ' male ', ' male ']) for
L in Freqdist:
print (L)
print (Freqdist.freq (l))
6.8 Summary
Modeling language data in a corpus can help us understand language models and can also be used to predict new language data.
A supervised classifier uses tagged training corpus to build a model, predicting the input label based on the characteristics of the input. A supervised classifier can perform a variety of NLP tasks, including document categorization, POS tagging, statement segmentation, dialogue behavior type recognition, and identifying implication relationships and many other tasks.
When training a supervised classifier, you should divide the corpus into three datasets: a training set for constructing classifier models, a set of development tests to help select and adjust model features, and a test set for evaluating the performance of the final model.
When evaluating a supervised classifier, it is important that you use fresh data that is not included in the training set or the development test set. Otherwise, your assessment may be unrealistic and optimistic.
Decision trees can automatically build tree-structured flowcharts to label input variable values based on their characteristics, although they are easy to interpret, but not suitable for handling situations in which attribute values interact with each other in determining the appropriate label.
In the naive Bayesian classifier, each feature determines the contribution of which label should be used independently. It allows for an association between eigenvalues, but there is a problem when two or more features are highly correlated.
The basic models used by the maximum entropy classifier are similar to those of naive Bayes; however, they use iterative optimization to find the set of feature weights that maximize the probability of a training set.
Most of the models that are automatically constructed from Corpora are descriptive, that is, they let us know which features are relevant to a given pattern or structure, but they do not give any information about the causal relationship between these features and patterns.