Classification System of Collective intelligent programming Learning

Last Update:2018-12-04 Source: Internet

Author: User

Tags natural log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous "Clustering System of Collective Smart Programming learning", I clustered some favorite e-books. At that time, without any prior knowledge, the K-means clustering algorithm was used to classify books into economics, psychology, photography, and so on. Today, I plan to classify books manually. First, I will sort out my favorite books and classify them. I am exhausted, but I find that there are a lot of things that I especially like, especially like, like, good, and dislike... I just divided a small part manually, and I don't want to do it again. I have already divided a small part. Can I distinguish other books based on this part? That is, if I have some preliminary knowledge, can I use it?

To classify data, you must use some features to determine whether the content has or is missing. The content of e-book classification is book documents, and features are words or phrases in documents. As mentioned in the clustering system, books of different classifications have different characteristics.

First, we will sort out these features.

def getwords(doc):  splitter=re.compile('\\W*')  print doc  # Split the words by non-alpha characters  words=[s.lower() for s in splitter.split(doc)           if len(s)>2 and len(s)<20]    # Return the unique set of words only  return dict([(w,1) for w in words])

Define a class that represents the classifier. This class encapsulates the information that the classifier has mastered so far:

class classifier:  def __init__(self,getfeatures,filename=None):    self.fc={}    self.cc={}    self.getfeatures=getfeatures  def incf(self,f,cat):    self.fc.setdefault(f, {})    self.fc[f].setdefault(cat, 0)    self.fc[f][cat] += 1  def incc(self,cat):    self.cc.setdefault(cat, 0)    self.cc[cat] += 1  def fcount(self, f, cat):    if f in self.fc and cat in self.fc[f]:      return self.fc[f][cat]    else:      return 0.0  def catcount(self,cat):    if cat in self.cc:      return float(self.cc[cat])    else:      return 0.0  def categories(self):    return self.cc.keys()  def totalcount(self):    return sum(self.cc.values())

You can train the model. A very simple function:

  def train(self,item,cat):    features=self.getfeatures(item)    for f in features:      self.incf(f,cat)    self.incc(cat)

After training, we can calculate the probability that feature f belongs to CAT classification, that is, the total number of books under CAT classification divided by the number of books with feature f under CAT classification:

  def fprob(self,f,cat):    if self.catcount(cat)==0: return 0    return self.fcount(f,cat)/self.catcount(cat)

Now, the model has been created. How can we classify unclassified books based on this model? It can be classified by Naive Bayes or Fisher.

(1) Bayesian classification: the original intention of Bayes is very simple: there are ten balls, six green ones, and four blue ones in the box. I can easily find out the probability of blue. If I know that there are ten balls in the box, but I don't know how many green balls are there, how many green balls are there? The Bayesian formula is also very simple. It is used to change the conditional probability. According to the training model, it is easy to know the probability of a feature under a certain classification. How to calculate the probability that this feature may belong to this classification.

The naive Bayes formula assumes that each phrase in an article is irrelevant. In this way, the probability of the entire book is the product of the probability of all phrases:

  def docprob(self,item,cat):    features=self.getfeatures(item)       p=1    for f in features: p*=self.weightedprob(f,cat,self.fprob)    return p

According to Bayesian formula, P (Category | book) = P (Book | category) * P (Category)/P (book), we do not consider P (book ), this value does not change when computing books belong to that category.

  def prob(self,item,cat):    catprob=self.catcount(cat)/self.totalcount()    docprob=self.docprob(item,cat)    return docprob*catprob

In this way, to predict that a book belongs to that category, as long as the probability of this book belongs to all categories is calculated, the highest probability among them is. I would like to draw a deeper picture here. The highest probability is not suitable for some occasions, such as classified mail, spam mail, and non-spam mail. We would rather let a few spam mails into normal mail, it would be nice to misjudge non-spam as spam. In this case, we add a threshold value when making a judgment. For example, if the probability of identifying a spam email is several times higher than that of a normal email, we confirm it as a spam filter, otherwise, we will not judge.

def setthreshold(self,cat,t):    self.thresholds[cat]=t      def getthreshold(self,cat):    if cat not in self.thresholds: return 1.0    return self.thresholds[cat]    def classify(self,item,default=None):    probs={}    # Find the category with the highest probability    max=0.0    for cat in self.categories():      probs[cat]=self.prob(item,cat)      if probs[cat]>max:         max=probs[cat]        best=cat    # Make sure the probability exceeds threshold*next best    for cat in probs:      if cat==best: continue      if probs[cat]*self.getthreshold(best)>probs[best]: return default    return best

(2) Fisher classification:
The Bayesian method is used to calculate the product of the existence rate of all phrases in a book under a category to determine the probability that this book will belong to that category. The Fisher method directly estimates the category of a book based on the Existence rate of a phrase belonging to a certain category. The calculation of this existence rate is P = the number of such phrases/The total number of phrases. In this way, it is easy to judge that a book belongs to that category. The existence rate of all phrases in the book can be the largest in the p accumulation.

  def cprob(self,f,cat):    # The frequency of this feature in this category        clf=self.fprob(f,cat)    if clf==0: return 0    # The frequency of this feature in all the categories    freqsum=sum([self.fprob(f,c) for c in self.categories()])    # The probability is the frequency in this category divided by    # the overall frequency    p=clf/(freqsum)        return p  def fisherprob(self,item,cat):    # Multiply all the probabilities together    p=1    features=self.getfeatures(item)    for f in features:      p*=(self.weightedprob(f,cat,self.cprob))    # Take the natural log and multiply by -2    fscore=-2*math.log(p)    # Use the inverse chi2 function to get a probability    return self.invchi2(fscore,len(features)*2)

  def invchi2(self,chi, df):    m = chi / 2.0    sum = term = math.exp(-m)    for i in range(1, df//2):        term *= m / i        sum += term    return min(sum, 1.0)

After the product of the probability values of each phrase is obtained in the Code, some minor adjustments are made.
Like Bayes, we also add a threshold to avoid false positives. The threshold here is not a multiplier relationship, but a minimum probability value:

  def setminimum(self,cat,min):    self.minimums[cat]=min    def getminimum(self,cat):    if cat not in self.minimums: return 0    return self.minimums[cat]  def classify(self,item,default=None):    # Loop through looking for the best result    best=default    max=0.0    for c in self.categories():      p=self.fisherprob(item,c)      # Make sure it exceeds its minimum      if p>self.getminimum(c) and p>max:        best=c        max=p    return best

In this way, the classification is over, and the principles and code are very simple. I hope I can clearly describe myself.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More