NewLISP bayes algorithm, newlispbayes Algorithm

Source: Internet
Author: User

NewLISP bayes algorithm, newlispbayes Algorithm
Understanding conditional probability

To use bayes, first understand the conditional probability. refer to the previous article to understand the conditional probability.

Two-phase algorithm-training and query

Let's take a look at the famous bayes algorithm. in data mining, bayes is a classification algorithm. Bayes is divided into two stages: Training and query. Training refers to training the sample dataset to find out the rules. Newlispe provides the bayes-train Function


Training:
syntax: (bayes-train list-M1 [list-M2 ... ] sym-context-D)
List-M1 [list-M2...] the input parameters are a bunch of lists. The elements in the list can be symbol or string. In bayes, these elements have a standard name called token. so list-M1... these are called the token list and also indicate different classes, so they can also be called traning set category.
During use, we analyze the sample space in advance and divide the sample space into several types of datasets in a certain way. Each dataset is represented by a token list. Therefore, we divide training datasets with a purpose and provide them with good data for training. In addition, we have already divided three categories. If there is only one type of data during training, we can use '() as the parameter for the other two types of data, which is irrelevant.

The bayes-train function counts the number of times each token appears in the input list, and then stores the statistical results in the context in the form of key/value, that is, the context represented by sym-context-D.
Now let's look at an example. The token is all symbol. There are two token lists in total, and the training result is saved to context 'L.
> (bayes-train '(A A B C C) '(A B B C C C) 'L)(15 18)> (symbols L)(L:A L:B L:C L:L L:total)> 
The symbols function is used to show that there are several symbrs In the result L. total is the total, and their values are displayed one by one.
> L:A(2 1)> L:B(1 2)> L:C(2 3)> L:total(5 6)

We can see that the number of token A is two or one in two token lists, and the number of times B and C is (1 2) AND (2 3) respectively, the total number is (5 6) the key is the frequency of the token value in different token lists.

The token can also be a string, but note that in the result context, the key starts with _, for example:
(bayes-train '("one" "two" "two" "three")             '("three" "one" "three")              '("one" "two" "three") 'S)   

In S, the key is _ one, _ two, and _ three.
These token lists logically represent the order of tokens. The token list can contain millions of tokens, for example, for natural language training.

Incremental training is worth noting that training can be performed continuously. If you call it again, you will find that the token frequency is increasing. For example:
> (bayes-train '(A A B C C) '(A B B C C C) 'L)(10 12)> L:A(4 2)> L:total(10 12)
This is good news. We can save the result of L in the database each time. We will have a new training sample to continue the training instead of starting from scratch. I. If the token frequency has been obtained by other means, we can skip a training and directly Save the result in context to help with subsequent training. The new token can also be added, and the bayes-train function can update the result correctly. When the training set is very large or the training data increases over time, incremental training is the best method.
The query is ultimately for the query. The bayes-query function is prototype as follows:
syntax: (bayes-query list-L context-D [bool-chain [bool-probs]])

List-l is a token list. Like several token lists used for training, context-D is the context obtained by training and contains the training result. The following two parameters are not discussed here.
The following is a well-understood example. The three traning sets correspond to three types of data: color, human, and programming language, then, a token list containing the programming language and people is used for query,
It is found that the probability of being color classification is 0, and the probability of being a programming language and a person is 50%.

(bayes-train '("red" "blue" "yellow") '("girl" "boy" "uncle" "female") '("c++" "java" "newlisp") 'L)(3 4 3)> (bayes-query '("c++" "girl") L)(0 0.5 0.5)

This example is well understood, but note that during training, the token should appear in multiple token lists (training set category, it is unlikely that, as in this example, it is only in a training set category.
We use the inverse Chi ² algorithm by default here. We need to ensure that different training set's categories have the same token:

If the inverse Chi² method is used, the total number of tokens in the different training set's categories should be equal or similar. 



Bayes queries use the following conditional probability formula as the basis for calculation.
Assume there are two training set category
p(A|tkn) = p(tkn|A) * p(A) / ( p(tkn|A) * p(A) + p(tkn|B) * p(B) )

Note:
1. p (A | tkn) is used to calculate the conditional probability of training set category A when A token appears.
2. (p (tkn | A) * p (A) indicates the probability that token and A appear at the same time.

Assume there are N training set category
p(Mc|tkn) = p(tkn|Mc) * p(Mc) / sum-i-N( p(tkn|Mi) * p(Mi) )









Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.