Skip to Content
Digital GardenMathematicsProbability and StatisticsConditional Probability and Independence

Conditional Probability and Independence

In some cases when performing a random experiment we may already have some prior information or possess incomplete information about the experiment. We may also be interested in knowing the probability of an outcome under certain conditions. This leads us to the concept of conditional probability.

The most simple example would be if we threw a die and then a friend told us that the die showed an even number. If we then wanted to know what the probability of the die showing a 6 is, so that we land on Mayfair in Monopoly, we would have to take into account that the die can only show an even number. In this case we have prior information about the die showing an even number which is the same as knowing incomplete information about the die and also wanting to know a probability under certain conditions. This is a conditional probability.

Formally we want to know the probability of an event \(A\) given that another event \(B\) has already occurred. We denote this as \(\P(A | B)\), which is read as the probability of \(A\) given \(B\). We define the conditional probability of \(A\) given \(B\) as follows:

\[\P(A | B) = \frac{\P(A \cap B)}{\P(B)} \]

Importantly to avoid division by 0, we need to have \(\P(B) \neq 0\). These formula makes sense and can also be intuitively understood. Let’s look at this in the case of the die example. We have the following events in a laplace

  • \(A\): Die shows 6, so \(A=\{6\}\)
  • \(B\): Die shows an even number, so \(B=\{2,4,6\}\)
  • \(A \cap B\): Die shows 6 and an even number, so \(A \cap B=\{6\}\)

Because we have a laplace space, the proabilities are simple. But what is the conditional probability actually telling us. In the numerator we have the outcomes that are in both \(A\) and \(B\), so these are the events that interest us and can happen. In the denominator we have the outcomes that are in \(B\), so these are the events that will happen. So the conditional probability is the ratio of the events that interest us and can happen to the events that will happen. For the die example we then get:

\[\P(A | B) = \frac{\P(A \cap B)}{\P(B)} = \frac{1/6}{3/6} = \frac{1}{3} \]

This means that the probability of the die showing a 6 given that it shows an even number is \(\frac{1}{3}\). This is intuitive as we have 3 possible outcomes that are even and only one of them is a 6.

Todo

This can also be nicely be shown with a Venn diagram

If it turns out that the prior information we are given happens to be the exact same as the event we are interested in, so \(A=B\), then we know that the event \(A\) will surely happen and we get the following:

\[\P(A | A) = \frac{\P(A \cap A)}{\P(A)} = \frac{\P(A)}{\P(A)} = 1 \]

This again makes intuitively sense as if for example we know that the die shows a 6, then the probability of the die showing a 6 is 1. There is a similar scenario where the sample space is the condition. You can think of this as if your friend told you that the die is showing a number between 1 and 6. He might as well have told you nothing. In this case we have the following:

\[\P(A | \Omega) = \frac{\P(A \cap \Omega)}{\P(\Omega)} = \frac{\P(A)}{1} = \P(A) \]

We can also have the opposite case where the prior information makes it impossible for the event to happen, so \(A \cap B = \emptyset\). For example if I told you that the die shows a 7, then the probability of the die showing a 6 is 0. This is also intuitive as the die can not show a 7. In this case we get:

\[\P(A | B) = \frac{\P(A \cap B)}{\P(B)} = \frac{0}{\P(B)} = 0 \]

An important thing to notice is that prior information is not symmetric. So in other words one event might “increase” the probability of another event, but the other event does not “increase” the probability of the first event by the same amount if even at all. For example if we have two events \(A\) and \(B\), where \(A\) is the die showing a 6 and \(B\) is the die showing an even number, then we have:

\[\P(A | B) = \frac{\P(A \cap B)}{\P(B)} = \frac{1/6}{3/6} = \frac{1}{3} \text{ and } \P(B | A) = \frac{\P(B \cap A)}{\P(A)} = \frac{1/6}{1/6} = 1 \]

Therefore in general we have that \(\P(A | B) \neq \P(B | A)\). This is important to keep in mind as it can lead to confusion.

Conditional Probability as a Probability Measure

By our definition of a probability meassure, it turns out that the conditional probability is also a probability meassure. More precisely, if \(B\) is an event with \(\P(B) \neq 0\), then we can define the conditional probability as a probability meassure on the event space \(\mathcal{F}\) of the sample space \(\Omega\) as follows:

\[\P( \cdot | B) : \mathcal{F} \to [0,1] \text{ with } \P(A | B) = \frac{\P(A \cap B)}{\P(B)} \]

To show it is a probability meassure we need to show that it satisfies the three properties of a probability meassure:

  1. Countable additivity: \(\P(\bigcup_{i=1}^{\infty} A_i | B) = \sum_{i=1}^{\infty} \P(A_i | B)\) for disjoint events \(A_i\). ----------------TODOO-------------
  2. \(\P(\emptyset | B) = 0\). We have already seen this above in one of the example and it is rather simple. We have \(\emptyset \cap B = \emptyset\) and \(\P(B) \neq 0\), so we have \(\P(\emptyset | B) = \frac{\P(\emptyset)}{\P(B)} = \frac{0}{\P(B)} = 0\).
  3. \(\P(\Omega | B) = 1\). This is also simple as we have \(\Omega \cap B = B\) and \(\P(B) \neq 0\), so we have \(\P(\Omega | B) = \frac{\P(B)}{\P(B)} = 1\).

Stochastic Independence

We have seen that for certain events if we know that one event \(B\) has already occurred, this can influence the probability of another event \(A\). However, depending on the experiment we can also have that the prior information of \(B\) has no influence on the event \(A\). So the the probability of \(A\) is the same as the probability of \(A\) given \(B\). This is called stochastic independence as the two events do not influence each other and can be done indepdently of each other. More formally we can write this as follows:

\[\P(A | B) = \P(A) = \frac{\P(A \cap B)}{\P(B)} \]

Where \(\P(B) \neq 0\). This means that the probability of \(A\) given \(B\) is the same as the probability of \(A\) without any prior information. If we multiply both sides by \(\P(B)\) we get:

\[\P(A \cap B) = \P(A) \cdot \P(B) \]

This formula is preferred as it avoids the division by 0 and is more intuitive. The probability of \(A\) and \(B\) happening is the same as the probability of \(A\) happening and then times the probability of \(B\) happening. From this formula we can also see that independence is symmetric, so if \(A\) and \(B\) are independent, then \(B\) and \(A\) are also independent. As if we would devide both sides by \(\P(A)\) we would get:

\[\frac{\P(A \cap B)}{\P(A)} = \P(B) \Leftrightarrow \P(B | A) = \P(B) \]

When reading literature on probability and statistics you will often come across the abbreviation “i.i.d.”. This stands for independent and identically distributed. This means that the events are independent of each other and have the same distribution. Where the distribution means they have the same probability measure. This is often used in the context of random variables and stochastic processes. The i.i.d. assumption is often used in statistics and machine learning to simplify the analysis of data and models.

Example

An example of this would be if we have two coins and we want to know the probability of the second coin showing heads given that the first coin shows heads. So if we model our experiment as a laplace space, we have the following events:

  • \(A\): Coin 1 shows heads, so \(A=\{(H,H),(H,T)\}\)
  • \(B\): Coin 2 shows heads, so \(B=\{(H,H),(T,H)\}\)
  • \(A \cap B\): Coin 1 shows heads and Coin 2 shows heads, so \(A \cap B=\{(H,H)\}\)

We know that the probability of the second coin showing heads is \(\frac{1}{2}\), either it shows heads or tails. If we know that the first coin shows heads, this does not change the probability of the second coin showing heads. intuitively this also makes sense as the two coins are physically indepdent of each other and we model them as such, so we do not somehow inbetween the two coin tosses become a supernatural being and influence the second coin toss. So we have:

\[\P(A | B) = \frac{\P(A \cap B)}{\P(B)} = \frac{1/4}{1/2} = \frac{1}{2} = \P(A) \]

Because we know coin tosses are indepdent of each other we can then also easily calculate the probability of both coins showing heads. This is simply the probability of a coin showing heads times the probability of the other coin showing heads. So we have:

\[\P(A \cap B) = \P(A) \cdot \P(B) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4} \]

We couldv’e also said that the two coins are i.i.d. as they are independent of each other and have the same distribution, so they are identically distributed. This means that the two coins have the same probability measure, so they have the same probability of showing heads or tails. Having the i.i.d condition would also allow us to make the coins unfair, so for example the coin has a 60% chance of showing heads and a 40% chance of showing tails. However, both coins would need to have the same probability measure, so they would both need to have a 60% chance of showing heads and a 40% chance of showing tails, hence they are identically distributed.

Physical Independence vs Stochastic Independence

It is important to note that stochastic independence does not have to mean that the events are physically independent of each other. Two events can be indepdent of each other despite the experiment only being conducted once physically. An interesting example of this is if we throw two dice at the same time. The two dice are physically independent of each other. Now we have the following events:

  • \(A\): The first die is even, so \(A=\){(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),(4,1),(4,2),(4,3),(4,4),(4,5),(4,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}. So \(A\) has 18 outcomes out of 36 possible outcomes.
  • \(B\): The second die has a sum of 7, so \(B=\){(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}. So \(B\) has 6 outcomes out of 36 possible outcomes.
  • \(A \cap B\): The first die is even and the second die has a sum of 7, so \(A \cap B=\{(2,5),(4,3),(6,1)\}\). So \(A \cap B\) has 3 outcomes out of 36 possible outcomes.

You would think that it matters if the first die is even or not, but it does not. These two events are independent of each other despite their being a physical dependency. To show they are independent we just need to show that the \(P(A \cap B) = \P(A) \cdot \P(B)\), so we have:

\[\P(A \cap B) = \frac{3}{36} = \frac{1}{12} \text{ and } \P(A) \cdot \P(B) = \frac{18}{36} \cdot \frac{6}{36} = \frac{108}{1296} = \frac{1}{12} \]

So the two events are independent of each other. Interestingly if we define the event slightly differently we get a different result:

  • \(A\): The first die is even, so \(A=\){(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),(4,1),(4,2),(4,3),(4,4),(4,5),(4,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}. So \(A\) has 18 outcomes out of 36 possible outcomes.
  • \(C\): The second die has a sum of 6, so \(C=\){(1,5),(2,4),(3,3),(4,2),(5,1)}. So \(C\) has 5 outcomes out of 36 possible outcomes.
  • \(A \cap C\): The first die is even and the second die has a sum of 6, so \(A \cap C=\{(2,4),(4,2)\}\). So \(A \cap C\) has 2 outcomes out of 36 possible outcomes.

Now we have:

\[\P(A \cap C) = \frac{2}{36} = \frac{1}{18} \text{ and } \P(A) \cdot \P(C) = \frac{18}{36} \cdot \frac{5}{36} = \frac{90}{1296} = \frac{5}{72} \]

So now clearly the two events are not independent of each other. What is happening here? The reason for this is that within event \(B\) the number of outcomes where the event \(A\) has an influence is the same as for the ones it doesn’t. In other words the number of outcomes where the first die is even is 3 and the number of outcomes where the first die is odd is also 3. So the event has no influence. However, in event \(C\) the number of outcomes where the first die is even is 2 and the number of outcomes where the first die is odd is 3. So the parity of the first die matters.

Independence of Multiple Events

Multiplication Rule

By rearranging the formula for conditional probability, we actually get a formula for the intersection of two events, so in other words a formula for when two events happen at the same time. This is called the multiplication rule. Why this is called the multiplication rule we will see later when we talk about stochastic independence and multi-stage random experiments. We get the multiplication rule by rearranging the formula for conditional probability by multiplying both sides by \(\P(B)\):

\[\P(A | B) = \frac{\P(A \cap B)}{\P(B)} \Leftrightarrow \P(A \cap B) = \P(A | B) \cdot \P(B) \]

This is the multiplication rule for two events. However, this also works the other way around. So in other words if we have the probability of \(B\) given \(A\), we can also rearrange the formula for conditional probability to get the multiplication rule in the other direction:

\[\P(B | A) = \frac{\P(A \cap B)}{\P(A)} \Leftrightarrow \P(A \cap B) = \P(B | A) \cdot \P(A) \]

So from this we can see that the multiplication rule is symmetric or more formally we have:

\[\P(A \cap B) = \P(A | B) \cdot \P(B) = \P(B | A) \cdot \P(A) \]

Multi-Stage Random Experiments

In a multi-stage random experiment, multiple random experiments are conducted sequentially. These are often represented using tree diagrams (event trees), distinguishing between final outcomes and intermediate outcomes.

We define the following rules:

  1. The probabilities along a path are multiplied together.
  2. If multiple paths lead to the same final outcome, their probabilities are added.

A good video explanation can be found here.

Info

An important note is that for a probabilistic model the difference between throwing two coins and throwing one coin twice is or two coins one after another is not important. This is because we do not model any dependence between the two coins, so for example the first coin influencing the second coin, which could in practice be the case as the two coins hit each other in the air. So in other words the two events are independent of each other so it does not matter if we throw them at the same time or one after another. This is important to keep in mind as it can lead to confusion. If you still find it confussing think of the possible tree diagrams and you will see that they are the same.

Independent Stages
Dependent Stages

An urn contains 6 balls: 2 white and 4 black. We randomly draw 2 balls one after another without replacement, meaning 2 stages. We ask: What is the probability of drawing 2 balls of the same color (\(A\)) or 2 balls of different colors (\(B\))?

Stage 1:

  • \(P(W) = {2 \over 6} = {1 \over 3}\)
  • \(P(S) = {4 \over 6} = {2 \over 3}\)

Stage 2: After the first draw, only 5 balls remain. If a white ball was drawn:

  • \(P(W|W) = {1 \over 5}\)
  • \(P(S|W) = {4 \over 5}\)

If a black ball was drawn:

  • \(P(W|S) = {2 \over 5}\)
  • \(P(S|S) = {3 \over 5}\)

The results are:

For same-colored balls:

\[P(A)=P(WW) + P(SS) = {1 \over 3} \cdot {1 \over 5} + {2 \over 3} \cdot {2 \over 5} = {7 \over 15} \]

For differently-colored balls:

\[P(A)=P(WS) + P(SW) = {1 \over 3} \cdot {4 \over 5} + {2 \over 3} \cdot {2 \over 5} = {8 \over 15} \]

multi-stage-random-experiment

Birthday Problem

The birthday paradox is an example of how certain probabilities are often misestimated intuitively.

We ask: “What is the probability that at least two people in a group of \(k\) people share the same birthday?”

To answer this, we first consider the probability that no two people share a birthday:

For 2 people: \({365 \over 365} \cdot {364 \over 365}\)
For 3 people: \({365 \over 365} \cdot {364 \over 365} \cdot {363 \over 365}\)
etc.

This probability approaches 0 as \(k\) increases. Thus, we can answer our question as follows:

\[P(\text{same})=1-P(\text{different}) \Leftrightarrow P(A)=1- \frac{365 \cdot (365-1)\cdot...\cdot (365-n+1)}{365^n} \]

Law of Total Probability

In some cases we may have a bunch of conditional probailities so, the proabilities of an event \(A\) given that another event \(B\) has already occurred but we actually want to know the total probability of \(A\). This leads to the law of total probability.

More precisely, if we have the exhaustive mutually exclusive events \(B_1, B_2, \ldots, B_n\) so in other words \(B_1 \cup B_2 \cup \ldots \cup B_n = \Omega\) and \(B_i \cap B_j = \emptyset\) for \(i \neq j\). You can also think of these events \(B_i\) as partitions of the sample space \(\Omega\). Then we can write the total probability of an event \(A\) as follows:

\[\P(A) = \sum_{i=1}^{n} \P(A | B_i) \cdot \P(B_i) = \sum_{i=1}^{n} \P(A \cap B_i) \]

This comes from the multiplication rule and the fact that the events \(A_i\) are mutually exclusive. Each term in the sum represents the probability of \(B\) occurring given that one of the events \(A_i\) has occurred which from the multiplication rule is the same as the events \(A_i\) happening and \(B\) happening at the same time, so \(A_i \cap B\). Because the events \(A_i\) are exhaustive they cover the entire sample space \(\Omega\), so when we sum over all the events \(A_i\) we get the total probability of \(B\). We can also prove this formally using the distributivity of set operations. We have:

\[A = A \cap \Omega = A \cap (B_1 \cup B_2 \cup \ldots \cup B_n) = (A \cap B_1) \cup (A \cap B_2) \cup \ldots \cup (A \cap B_n) \]

And because the events \(B_i\) are mutually exclusive we can use countable additivity to get:

\[\P(A) = \P(A \cap B_1) + \P(A \cap B_2) + \ldots + \P(A \cap B_n) = \sum_{i=1}^{n} \P(A \cap B_i) = \sum_{i=1}^{n} \P(A | B_i) \cdot \P(B_i) \]
Example

As an example for the law of total probability, let’s consider a football match. We have the following events:

  • \(A\): The team wins.
  • \(B\): The team plays at home.
  • \(B^c\): The team plays away.

Now if we wanted to know the probability of the team winning, we could use the law of total probability. For this we need to know the probability of the team winning at home and away and the probability of the team playing at home and away. So for example we could have:

  • \(P(A | B) = 0.8\) for the proability of the team winning at home.
  • \(P(A | B^c) = 0.3\) for the probability of the team winning away.
  • \(P(B) = 0.6\) for the probability of the team playing at home.
  • \(P(B^c) = 1 - P(B) = 0.4\) for the probability of the team playing away.

Then we can use the law of total probability to get:

\[\P(A) = \P(A | B) \cdot \P(B) + \P(A | B^c) \cdot \P(B^c) = 0.8 \cdot 0.6 + 0.3 \cdot 0.4 = 0.48 + 0.12 = 0.6 \]

This means that the probability of the team winning is 0.6 or 60% with the given probabilities. This is also intuitive from the numbers as the team wins 80% of the time at home and 30% of the time away and they play 60% of the time at home and 40% of the time away. So team is more likely to win.

Bayes’ Theorem

Another common case is we are given a conditional proability and we want to know the reverse of the conditional probability. So we have the probability of an event \(B\) given that another event \(A\) has already occurred, so \(\P(B | A)\) and we want to know the probability of \(A\) given that \(B\) has already occurred, so \(\P(A | B)\). This is called Bayes’ theorem. Bayes’ theorem follows from the multiplication rule and the law of total probability. More formally if we again have the events \(B_1, B_2, \ldots, B_n\) which partition the sample space \(\Omega\) then we can write Bayes’ theorem as follows for the events \(A\) and \(B_i\):

\[\P(B_i | A) = \frac{\P(B_i \cap A)}{\P(A)} = \frac{\P(A | B_i) \cdot \P(B_i)}{\sum_{j=1}^{n} \P(A | B_j) \cdot \P(B_j)} \]

Importantly we need to have \(\P(A) \neq 0\) and \(\P(B_i) \neq 0\). This is also called the inverse probability as we are looking for inverse of the conditional proability that we already have.

Sickness Test

One of the most common examples of Bayes’ theorem is the test for a sickness as it can be quiet counter intuitive and paradoxical/scary. We want to develop a test for a rare sickness. We know that the sickness occurs in 0.01% of the population so \(\frac{1}{10000}\) people have the sickness. We also know that the test very accurate. If the person really is sick, the test will show positive with a probability of 99%. So we can model this experiment.

We create our sample space \(\Omega\) as a set of tuples \((X,Y)\) where each tuple represents a person and \(X\) is the event of the person being sick and \(Y\) is the event of the test showing positive. We use boolean values for \(X\) and \(Y\). So we have the following events:

  • \(S\): The person is sick, so \(S=\{(1,0),(1,1)\}\)
  • \(T\): The test shows positive, so \(T=\{(0,1),(1,1)\}\)
  • etc.

From the information we have we can get the following probabilities:

  • \(\P(S) = 0.0001\) for the probability of a person being sick.
  • \(\P(T | S) = 0.99\) for the probability of the test showing positive given that the person is sick.
  • \(\P(T | S^c) = 0.01\) for the probability of the test showing positive given that the person is not sick. This follows from the fact that if the test is positive, then there are only two possibilities: the person is sick and the test shows positive or the person is not sick and the test shows positive.

What we now want to know is if I take the test and it shows positive, what is the probability that I am actually sick. So we want to know \(\P(S | T)\). We can use Bayes’ theorem to get this:

\[\P(S | T) = \frac{\P(T | S) \cdot \P(S)}{\sum_{i=1}^{2} \P(T | S_i) \cdot \P(S_i)} = \frac{0.99 \cdot 0.0001}{0.99 \cdot 0.0001 + 0.01 \cdot 0.9999} \approx 0.0098 = 0.98\% \]

No, you are not seeing that wrong. Despite the test being 99% accurate, the probability of you actually being sick is only 0.98%. Why is this happening? We can split the people that will have a positive test result into two groups:

  1. People that are sick and have a correct positive test result. Because the sickness is so rare, this group represents roughly \(\frac{1}{10000}\) of the population, so 0.01% of the population.
  2. The other group is the people that are not sick and have a false positive test result. This group represents roughly \(\frac{1}{100}\) of the population, so 1% of the population.

Given that the test is positive, it is much more likely that you are in the second group than in the first group. This is because the test is so accurate, but the sickness is so rare. This is a common example of how Bayes’ theorem can lead to counter intuitive results and it is important to keep this in mind when interpreting test results.

Breast Cancer Test
Last updated on