After the last post discussing set theory here, the next logical step is to discuss the concept of probability theory. The theory of probability is notoriously counter intuitive and is not as precise as some other subjects in mathematics. To navigate this subject I will attempt to prove some theorems about probability from basic axioms. Theorems are not the most exciting material to read through but needed to understand an idea carefully so I will attempt to provide a running commentary when needed.

## Axioms of Probability

Before we can talk about proving any theorems we need to discuss axioms which will act as a foundation or “rules of the game” as we construct proofs. Axiom is a rare word in common language but it is defined by Google as “a statement or proposition which is regarded as being established, accepted, or self-evidently true”. In other words, axioms are statements we take for granted as we reason about an idea mathematically. The remarkable observation here is that with only 3 axioms (which emerge from set theory) we can provide a large number of theorems. So let’s begin:

Let $C$ be defined as some event. If $Pr(C)$ is defined for a type of subset on the sample space $S$ (check last post to see how this was defined) then we have the following axioms:

1. For any event $C$, $$Pr(C) \geq 0$$.
2. $$Pr(S) = 1$$
3. $$Pr(C_1 \cup C_2 \cup ….) = Pr(C_1) + Pr(C_2) + … = \sum_{i=1}^{\infty} Pr(C_i)$$ where the sets $C_i$ are disjoint $(C_i \cap C_j = \emptyset)$ for $i \neq j$.

Axioms 1 and 2 seems pretty straightforward – essentially they state that the probability of some event $C$ could be 0 (it cannot happen) or more than 0. The probability of the sample space is 1 – it is certain that at least one event in the sample space will occur. Axiom 3 needs to be looked at a little closer; it states that the probability of union of events (happening together) $$C_1,C_2,…$$ is equivalent to the sum total of the probabilities of each event happening separately. Now we have our building blocks in place, we can proceed to prove some basic theorems.

## Basic Theorems of Probability

Theorem 1: $Pr(\emptyset) = 0$. The probability associated with the null set is 0.

Proof: Let $C = \emptyset$ and $C^{c} = S$

$Pr(\emptyset) = Pr(C)$
$Pr(\emptyset) = Pr(S) – Pr(C^{c})$ !*May need to be familiar with theorem 3 to see the complement rule

Note: $Pr(C) + Pr(S) = Pr(S)$ This is because $C$ is defined to be in $S$ anyway so the addition adds nothing.

$$Pr(\emptyset) = Pr(S) – Pr(S) = 0$$

Theorem 2: For any finite sequence of events $$C_1, C_2,…, C_n$$

$$Pr(\mathop{\cup}_{i = 1}^{n} C_i) = \sum_{i=1}^{n} Pr(C_i)$$

Proof:

Out of the $$C_1, C_2,…$$ events only consider $n$ events to have a nonzero probability and $C_i = \emptyset$ for $i > n$ events. Then the events in the infinite sequence are disjoint such that $$\mathop{\cup}_{i = 1}^{\infty} C_i = \mathop{\cup}_{i = 1}^{n} C_i + \mathop{\cup}_{i = n+1}^{\infty} C_i$$

Therefore

$$Pr(\mathop{\cup}_{i = 1}^{n} C_i) = Pr(\mathop{\cup}_{i = 1}^{\infty} C_i)$$

$$Pr(\mathop{\cup}_{i = 1}^{n} C_i) = \sum_{i=1}^{\infty} Pr(C_i) = \sum_{i=1}^{n} Pr(C_i) + \sum_{i=n+1}^{\infty} Pr(C_i)$$

$$Pr(\mathop{\cup}_{i = 1}^{n} C_i) = \sum_{i=1}^{n} Pr(C_i) + 0$$

Note: $$Pr(C_i) = Pr(\emptyset) = 0$$ for $i \geq n+1$ by construction and by Theorem 1.

$$Pr(\mathop{\cup}_{i = 1}^{n} C_i) = \sum_{i=1}^{n} Pr(C_i)$$

## Extended Theorems of Probability

Theorem 3:
For any event $C \subset S$,
$$Pr(C^c) = 1 – Pr(C)$$

Proof:
$$S = C \cup C^c$$
$$\emptyset = C \cap C^c$$

$$1 = Pr(C) + Pr(C^c) = Pr(S)$$

$$Pr(C^c) = Pr(S) – Pr(C)$$

$$Pr(C^c) = 1 – Pr(C)$$

Theorem 4:
For any event $C \subset S$
, $0 \leq Pr(C) \leq 1$

Proof: Since $\emptyset \subset C \subset S$

$$Pr(\emptyset) \leq Pr(C) \leq Pr(S)$$

Note: $Pr(S) = 1$ and $Pr(\emptyset) = 0$

$$0 \leq Pr(C) \leq 1$$

Theorem 5: If $C_1$ and $C_2$ are subsets of $S$ such that $C_1 \subset C_2$, then $Pr(C_1) \leq Pr(C_2)$

Proof: Let $C_2 = C_1 \cup [C_1^c \cap C_2] = \emptyset$ and $C_1 = C_1 \cap [C_1^c \cap C_2] = \emptyset$

Hence
$$Pr(C_2) = Pr(C_1) + Pr(C_1^c \cap C_2)$$

Note: $Pr(C^c_1 \cap C_2) \geq 0$

$$Pr(C_2) \geq Pr(C_1)$$

Theorem 6: If $C_1$ and $C_2$ are subsets of $S$ then

$$Pr(C_1 \cup C_2) = Pr(C_1) + Pr(C_2) – Pr(C_1 \cap C_2)$$

Proof:
$$C_1 \cup C_2 = C_1 \cup (C_1^c \cap C_2)$$
$$C_2 = (C_1 \cap C_2) \cup (C_1^c \cap C_2)$$

$$Pr(C_1 \cup C_2) = Pr(C_1) + Pr(C_1^c \cap C_2)$$

$$Pr(C_2) = Pr(C_1 \cap C_2) + Pr(C_1^c \cap C_2)$$

$$Pr(C_1^c \cap C_2) = Pr(C_2) – Pr(C_1 \cap C_2)$$

$$Pr(C_1 \cup C_2) = Pr(C_1) + Pr(C_2) – Pr(C_1 \cap C_2)$$

The extended theorems seemed heavy for me when I first looked at them but I’ve tried to leave all the relevant steps included in order to see the picture more clearly. Take some time and follow along with a pencil and paper and it should become apparent. I also find drawing pictures helps with seeing how sets interact with each other. It should be noted that most of the theorems listed are those that most people studying probability in school, this is no accident, when these ideas were presented to me in school they were presented as rules to apply and to be taken from granted. Now we see that these ideas have a firm basis in set theory.

## Conditional Probability

Often when we apply probability in life there is a question about how connected different events are. If I’m nice to a girl and tell a hilarious joke and then ask her on a date are my chances of obtaining a yes the same compared to if I had just asked her on a date without the social lubrication? The answer is no but if we were to abstract from this painful experience we could ask a general question – what is the probability of some event $B$ given that another event $A$ has occurred? This is the type of question that conditional probability helps to answer.

Conditional Probability is defined as

$$Pr(C_2|C_1) = \frac{Pr(C_1 \cap C_2)}{Pr(C_1)}$$

In plain English, the identity above states that the probability of event $C_2$ occurring given $C_1$ is equivalent to the probability that the intersection of both events has occurred divided by event $C_1$.

Furthermore we have the following properties:

1. $$Pr(C_2|C_1) \geq 0$$
2. $$Pr(C_2 \cup C_3 \cup … | C_1) = Pr(C_2|C_1) + Pr(C_3|C_1) + …$$
3. $$Pr(C_1|C_1) = 1$$

## Law of Total Probability

There is no rule or idea which has helped me understand the basics of probability more than the law of total probability. All I remembered when I first studied this topic was this and it got me pretty far. In other words – this is important.

Let the sample space $S$ be partitioned into $k$ mutually exclusive and exhaustive events $$C_1,…,C_k$$ such that $$S = C_1 \cap C_2 \cap … \cap C_k$$ Then the probability of a partition of the sample space $S$ is denoted by $$Pr(C_i) \ i=1,…k$$ Diagrammatically it can be illustrated as

Let $A$ be another event in the sample space such that $Pr(A) > 0$. In our diagram, let the oval in the rectangle represent event $A$.

Therefore, the event $A$ intersects the sample space $S$ in the following way

$$A = A \cap S$$

$$A = A \cap [C_1 \cup … \cup C_k]$$

$$A = (A \cap C_1) \cup (A \cap C_2) \cup … \cup (A \cap C_k)$$

Since $A \cap C_i$ are mutually exclusive

$$Pr(A) = Pr(A \cap C_1) + Pr(A \cap C_2) + … + Pr(A \cap C_k) = \sum_{i=1}^{k} Pr(A \cap C_i)$$

Using the definition of conditional probability we decompose $Pr(A \cap C_i)$ as

$$Pr(A \cap C_i) = Pr(A|C_i)Pr(C_i)$$

Law of Total Probability: $Pr(A) = \sum_{i=1}^{k} Pr(A \cap C_i) = \sum_{i=1}^{k} Pr(A|C_i)Pr(C_i)$

## Bayes Theorem

Now that we have the Law of Total Probability we can discuss one of the most powerful theorems in all of probability – Bayes Theorem. It’s power is in its exceptionally wide applications and this can be derived by applying the Law of Total Probability to the definition of conditional probability.

$$Pr(C_j | C_i) = \frac{Pr(C_i \cap C_j)}{Pr(C_i)}$$

Note 1: $Pr(C_i) = \sum_{j=1}^{k} Pr(C_j)Pr(C_i|C_j)$ By the Law of Total Probability

Note 2: We can also write $Pr(C_i \cap C_j)$ as $Pr(C_i \cap C_j) = Pr(C_i|C_j)Pr(C_j)$ Via some algebraic manipulation (move the denominator to the other side of the equals sign)

Hence Bayes Theorem is

$$Pr(C_j|C_i) = \frac{Pr(C_i|C_j)Pr(C_j)}{Pr(C_i)} = \frac{Pr(C_i|C_j)Pr(C_j)}{\sum_{j=1}^{k} Pr(C_j)Pr(C_i|C_j)}$$

## Independent Events

One final note to make is about independence. Not all discrete events in the world are connected or impact each other. There is a philosophical debate about this and some argue that at the micro scale, all things are interconnected. Although this is an attractive proposition, it ruins the point I want to make so I will ignore it. By definition two events $C_1$ and $C_2$ are independent if

$$Pr(C_1 \cap C_2) = Pr(C_1) \times Pr(C_2)$$

If our events independent then it simplifies the results mentioned in this post. As an example conditional probability between independent events becomes

$$Pr(C_1|C_2) = Pr(C_1)$$

This concludes this post about the axioms of probability and some of the basic results that emerge as a result of those axioms. These results are important in developing the theory of random variables which are important when applying probability to real world situations.