In the previous "Clustering System of Collective Smart Programming learning", I clustered some favorite e-books. At that time, without any prior knowledge, the K-means clustering algorithm was used to classify books into economics, psychology, photography, and so on. Today, I plan to classify books manually. First, I will sort out my favorite books and classify them. I am exhausted, but I find that there are a lot of things that I especially like, especially like, like, good, and dislike... I just divided a small part manually, and I don't want to do it again. I have already divided a small part. Can I distinguish other books based on this part? That is, if I have some preliminary knowledge, can I use it?
To classify data, you must use some features to determine whether the content has or is missing. The content of e-book classification is book documents, and features are words or phrases in documents. As mentioned in the clustering system, books of different classifications have different characteristics.
First, we will sort out these features.
def getwords(doc): splitter=re.compile('\\W*') print doc # Split the words by non-alpha characters words=[s.lower() for s in splitter.split(doc) if len(s)>2 and len(s)<20] # Return the unique set of words only return dict([(w,1) for w in words])
Define a class that represents the classifier. This class encapsulates the information that the classifier has mastered so far:
class classifier: def __init__(self,getfeatures,filename=None): self.fc={} self.cc={} self.getfeatures=getfeatures def incf(self,f,cat): self.fc.setdefault(f, {}) self.fc[f].setdefault(cat, 0) self.fc[f][cat] += 1 def incc(self,cat): self.cc.setdefault(cat, 0) self.cc[cat] += 1 def fcount(self, f, cat): if f in self.fc and cat in self.fc[f]: return self.fc[f][cat] else: return 0.0 def catcount(self,cat): if cat in self.cc: return float(self.cc[cat]) else: return 0.0 def categories(self): return self.cc.keys() def totalcount(self): return sum(self.cc.values())
You can train the model. A very simple function:
def train(self,item,cat): features=self.getfeatures(item) for f in features: self.incf(f,cat) self.incc(cat)
After training, we can calculate the probability that feature f belongs to CAT classification, that is, the total number of books under CAT classification divided by the number of books with feature f under CAT classification:
def fprob(self,f,cat): if self.catcount(cat)==0: return 0 return self.fcount(f,cat)/self.catcount(cat)
Now, the model has been created. How can we classify unclassified books based on this model? It can be classified by Naive Bayes or Fisher.
(1) Bayesian classification: the original intention of Bayes is very simple: there are ten balls, six green ones, and four blue ones in the box. I can easily find out the probability of blue. If I know that there are ten balls in the box, but I don't know how many green balls are there, how many green balls are there? The Bayesian formula is also very simple. It is used to change the conditional probability. According to the training model, it is easy to know the probability of a feature under a certain classification. How to calculate the probability that this feature may belong to this classification.
The naive Bayes formula assumes that each phrase in an article is irrelevant. In this way, the probability of the entire book is the product of the probability of all phrases:
def docprob(self,item,cat): features=self.getfeatures(item) p=1 for f in features: p*=self.weightedprob(f,cat,self.fprob) return p
According to Bayesian formula, P (Category | book) = P (Book | category) * P (Category)/P (book), we do not consider P (book ), this value does not change when computing books belong to that category.
def prob(self,item,cat): catprob=self.catcount(cat)/self.totalcount() docprob=self.docprob(item,cat) return docprob*catprob
In this way, to predict that a book belongs to that category, as long as the probability of this book belongs to all categories is calculated, the highest probability among them is. I would like to draw a deeper picture here. The highest probability is not suitable for some occasions, such as classified mail, spam mail, and non-spam mail. We would rather let a few spam mails into normal mail, it would be nice to misjudge non-spam as spam. In this case, we add a threshold value when making a judgment. For example, if the probability of identifying a spam email is several times higher than that of a normal email, we confirm it as a spam filter, otherwise, we will not judge.
def setthreshold(self,cat,t): self.thresholds[cat]=t def getthreshold(self,cat): if cat not in self.thresholds: return 1.0 return self.thresholds[cat] def classify(self,item,default=None): probs={} # Find the category with the highest probability max=0.0 for cat in self.categories(): probs[cat]=self.prob(item,cat) if probs[cat]>max: max=probs[cat] best=cat # Make sure the probability exceeds threshold*next best for cat in probs: if cat==best: continue if probs[cat]*self.getthreshold(best)>probs[best]: return default return best
(2) Fisher classification:
The Bayesian method is used to calculate the product of the existence rate of all phrases in a book under a category to determine the probability that this book will belong to that category. The Fisher method directly estimates the category of a book based on the Existence rate of a phrase belonging to a certain category. The calculation of this existence rate is P = the number of such phrases/The total number of phrases. In this way, it is easy to judge that a book belongs to that category. The existence rate of all phrases in the book can be the largest in the p accumulation.
def cprob(self,f,cat): # The frequency of this feature in this category clf=self.fprob(f,cat) if clf==0: return 0 # The frequency of this feature in all the categories freqsum=sum([self.fprob(f,c) for c in self.categories()]) # The probability is the frequency in this category divided by # the overall frequency p=clf/(freqsum) return p def fisherprob(self,item,cat): # Multiply all the probabilities together p=1 features=self.getfeatures(item) for f in features: p*=(self.weightedprob(f,cat,self.cprob)) # Take the natural log and multiply by -2 fscore=-2*math.log(p) # Use the inverse chi2 function to get a probability return self.invchi2(fscore,len(features)*2)
def invchi2(self,chi, df): m = chi / 2.0 sum = term = math.exp(-m) for i in range(1, df//2): term *= m / i sum += term return min(sum, 1.0)
After the product of the probability values of each phrase is obtained in the Code, some minor adjustments are made.
Like Bayes, we also add a threshold to avoid false positives. The threshold here is not a multiplier relationship, but a minimum probability value:
def setminimum(self,cat,min): self.minimums[cat]=min def getminimum(self,cat): if cat not in self.minimums: return 0 return self.minimums[cat] def classify(self,item,default=None): # Loop through looking for the best result best=default max=0.0 for c in self.categories(): p=self.fisherprob(item,c) # Make sure it exceeds its minimum if p>self.getminimum(c) and p>max: best=c max=p return best
In this way, the classification is over, and the principles and code are very simple. I hope I can clearly describe myself.