Your prediction Gets as good as Your Data
May 5 by Kazem
In the past, we had seen software engineers and data scientists assume that they can keep increasing their prediction acc Uracy by improving their machine learning algorithm. Here, we want to approach the classification problem from a different angle where we recommend data scientists should anal Yze the distribution of their data first to measure information level in data. This approach can givesus an upper bound for what far one can improve the accuracy of a predictive algorithm and make sure Our optimization efforts is not wasted!
Entropy and information
In information theory, mathematician has developed a few useful techniques such as entropy to measure information level I n data in process. Let's think of a random coin with a head probability of 1%.
If one filps Such a coin, we'll get more information when we see the head event since it's a rare event compared to tail Which is more likely to happen. We can formualte the amount of information in a random variable with the negative logarithm of the event probability. This captures the described intuition. Mathmatician also formulated another measure called entropy by which they capture the average information in a random proc ESS in bits. Below We have shown the entropy formula for a discrete random variable:
For the first example, let's assume we have a coin with P (H) =0% and P (T) =100%. We can compute the entropy of the coin as follows:
For the second example, let's consider a coin where P (h) =1% and P (T) =1-p (h) =99%. Plugging numbers one can find, the entropy of such a coin is:
Finally, if the coin has p (H) = P (T) = 0.5 (i.e. a fair coin), it entropy is calculated as follows:
Entropy and predictability
So, what these examples tell us? If we have a coin with head probability of zero, the coin's entropy is zero meaning that the average information in the CO is zero. This makes sense because flipping such a coin always comes as tail. Thus, the prediction accuracy is 100%. In other words, if the entropy is zero and we have the maximum predictibility.
In the second example, head probability are not zero and still very close to zero which again makes the coin to be very pre Dictable with a low entropy.
Finally, in the last example we have 50/50 chance of seeing Head/tail events which maximizes the entropy and consequently Minimizes the predictability. In words, one can show this a fair coin has the meaximum entropy of 1 bit making the prediction as good as a random guess.
Kullback–leibler divergence
As last example, it's important to give another example of what we can borrow ideas from information theory to measure the Distance between, probability distributions. Let's assume we are modeling the random processes by their PMF ' s:p (.) and Q (.). One can use entropy measure to compute the distance between the PMF ' s as follows:
Above distance function is known as KL divergence which measures the distance of Q's PMF from P ' s PMF. The KL divergence can come handy in various problems such as NLP problems where we ' d like to measure the distance between Sets of data (e.g. bag of words).
Wrap-up
In this post, we showed, the entropy from information theory provides a-to-measure how much information exists in Our data. We also highlighted the inverse relationship between the entropy and the predictability. This shows, we can use the entropy measure to calculate a upper bound for the accuracy of the prediction problem in H and.
Feel free to share with us if you have a comments or questions in the comment section below.
can also reach us at [email protected]
Your prediction Gets as good as Your Data