Bayesian Decision Theory is the statistical approach to pattern classification. It leverages probability to make classifications, and measures the risk (i.e. cost) of assigning an input to a given class.
In this article we’ll start by taking a look at prior probability, and how it is not an efficient way of making predictions. Bayesian Decision Theory makes better predictions by using the prior probability, likelihood probability, and evidence to calculate the posterior probability. We’ll discuss all of these concepts in detail. Finally, we’ll map these concepts from Bayesian Decision Theory to their context in machine learning. The outline of this article is as follows:
After completing this article, stay tuned for Part 2, in which we’ll apply Bayesian Decision Theory to both binary and multi-class classification problems. To assess the performance of the classifier, both the loss and the risk of making a prediction are discussed. If the classifier makes a weak prediction, a new class named “reject” is used to accept samples with a high uncertainty. Part 2 will discuss when and why a sample is assigned to the reject class. Stay tuned for the follow up.
Let’s get started.
To discuss probability, we should start with how to calculate the probability that an action occurs. The probability is calculated according to the past occurrences of the outcomes (i.e. events). This is called the prior probability (“prior” meaning “before”). In other words, the prior probability refers to the probability in the past.
Assume that someone asks who will be the winner of a future match between two teams. Let $A$ and $B$ refer to the first or second team winning, respectively.
In the last 10 cup matches, $A$ occurred 4 times and $B$ occurred the remaining 6 times. So, what is the probability that $A$ occurs in the next match? Based on the experience (i.e., the events that occurred in the past), the prior probability that the first team ($A$) wins in the next match is:
But the past events may not always hold, because the situation or context may change. For example, team $A$ could have won only 4 matches because there were some injured players. When the next match comes, all of these injured players will have recovered. Based on the current situation, the first team may win the next match with a higher probability than the one calculated based on past events only.
The prior probability measures the probability of the next action without taking into consideration a current observation (i.e. the current situation). It’s like predicting that a patient has a given disease based only on past doctors visits.
In other words, because the prior probability is solely calculated based on past events (without present information), this can degrade the prediction value quality. The past predictions of the two outcomes $A$ and $B$ may have occurred while some conditions were satisfied, but at the current moment, these conditions may not hold.
This problem is solved using the likelihood.
The likelihood helps to answer the question: given some conditions, what is the probability that an outcome occurs? It is denoted as follows:
Where $X$ refers to the conditions, and $C_i$ refers to the outcome. Because there may be multiple outcomes, the variable $C$ is given the subscript $i$.
The likelihood is read as follows:
Under a set of conditions $X$, what is the probability that the outcome is $C_i$?
According to our example of predicting the winning team, the probability that the outcome $A$ occurs does not only depend on past events, but also on current conditions. The likelihood relates the occurrence of an outcome to the current conditions at the time of making a prediction.
Assume the conditions change so that the first team has no injured players, while the second team has many injured players. As a result, it is more likely that $A$ occurs than $B$. Without considering the current situation and using only the prior information, the outcome would be $B$, which is not accurate given the current situation.
For the example of diagnosing a patient, this could be an understandably better prediction, as the diagnosis will take into account their current symptoms rather than their prior condition.
A drawback of using only the likelihood is that it neglects experience (prior probability), which is useful in many cases. So, a better way to do a prediction is to combine them both.
Using only the prior probability, the prediction is made based on past experience. Using only the likelihood, the prediction depends only on the current situation. When either of these two probabilities is used alone, the result is not accurate enough. It is better to use both the experience and the current situation together in predicting the next outcome.
The new probability would be calculated as follows:
For the example of diagnosing a patient, the outcome would then be selected based on their medical history as well as their current symptoms.
Using both the prior and likelihood probabilities together is an important step towards understanding Bayesian Decision Theory.
Bayesian Decision Theory (i.e. the Bayesian Decision Rule) predicts the outcome not only based on previous observations, but also by taking into account the current situation. The rule describes the most reasonable action to take based on an observation.
The formula for Bayesian (Bayes) decision theory is given below:
The elements of the theory are:
Bayesian Decision Theory gives balanced predictions, as it takes into consideration the following:
If any of the previous factors was not used, the prediction would be hindered. Let’s explain the effect of excluding any of these factors, and mention a case where using each factor might help.
When there is information about the frequency of the occurrence of $C_i$ alone, $X$ alone, and both $C_i$ and $X$ together, then a better prediction can be made.
There are some things to note about the theory/rule:
The next three sections discuss each of these points.
Assuming there are two possible outcomes, then the following must hold:
The reason is that for a given input, its outcome must be one of these two. There are no uncovered outcomes.
If there are $K$ outcomes, then the following must hold:
Here is how it is written using the summation operator, where $i$ is the outcome index and $K$ is the total number of outcomes:
Note that the following condition must hold for all prior probabilities:
Similar to the prior probability, the sum of all posterior probabilities must be 1, according to the next equations.
If the total number of outcomes is $K$, here is the sum using the summation operator:
Here is how to sum all the posterior probabilities for $K$ outcomes using the summation operator:
Here is how the evidence is calculated when only two outcomes occur:
For $K$ outcomes, here is how the evidence is calculated:
Here is how it is written using the summation operator:
According to the latest equation of the evidence, the Bayesian Decision Theory (i.e. posterior) can be written as follows:
This section matches the concepts in machine learning to Bayesian Decision Theory.
First, the word outcome should be replaced by class. Rather than saying the outcome is $C_i$, it is more machine learning-friendly to say the class is $C_i$.
Here is a list that relates the factors in Bayesian Decision Theory to machine learning concepts:
When the following conditions apply, it is likely that the feature vector $X$ is assigned to the class $C_i$:
When a classification model is trained, it knows the frequency that a class $C_i$ occurs, and this information is represented as the prior probability $P(C_i)$. Without the prior probability $P(C_i)$, the classification model loses some of its learned knowledge.
Assuming that the prior probability $P(C_i)$ is the only probability to be used, the classification model classifies the input $X$ based on the past observations without even seeing the new input $X$. In other words, without even feeding the sample (feature vector) to the model, the model makes a decision and assigns it to a class.
The training data helps the classification model to map each input $X$ to its class label $C_i$. Such learned information is represented as the likelihood probability $P(X|C_i)$. Without the likelihood probability $P(X|C_i)$, the classification model cannot know if the input sample $X$ is related to the class $C_i$.
This article introduced Bayesian Decision Theory in the context of machine learning. It described all the elements of the theory starting with prior probability $P©$, the likelihood probability $P(X|C)$, the evidence $P(X)$, and finally the posterior probability $p(C|X)$.
We then discussed how these concepts build to Bayesian Decision Theory, and how they work in the context of machine learning.
In the next article we’ll discuss how to apply Bayesian Decision Theory to binary and multi-class classification problems, see how the loss and the risk are calculated, and finally, cover the idea of the “reject” class.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!