Probability theory

Probability

Probability describes how likely an event, $E$, is to happen.

Probability

Probability describes how likely an event, $E$, is to happen.

$0 \leq P(E) \leq 1$

A probability is always between 0 and 1, where 1 means that the event always happens, and 0 that it never happens.

Probability

Probability describes how likely an event, $E$, is to happen.

$0 \leq P(E) \leq 1$
$P(S) = 1$

The total probability of all possible events is always 1.

The sample space, $S$, is the set of all possible events.

Probability

Probability describes how likely an event, $E$, is to happen.

$0 \leq P(E) \leq 1$
$P(S) = 1$
If $E$, $F$ are disjoint events, then $P(E \cup F) = P(E) + P(F)$

The probability of two disjoint (non overlapping) events, is the sum of the probability of each event separately.

Probability

Probability describes how likely an event, $E$, is to happen.

Axioms of probability

$0 \leq P(E) \leq 1$
$P(S) = 1$
If $E$, $F$ are disjoint events, then $P(E \cup F) = P(E) + P(F)$

Common rules of probability

Based on the axioms the following rules of probability can be proved.

Complement rule: let $E'$ be the complement of $E$, then $P(E') = 1 - P(E)$
Impossible event: $P(\emptyset)=0$
Probability of a subset: If $E \subseteq F$ then $P(F) \geq P(E)$
Addition rule: $P(E \cup F) = P(E) + P(F) - P(E \cap F)$

The urn model

By drawing balls from the urn with (or without) replacement probabilities and other properties of the model can be inferred.

Random variables

A random variable describes the outcome of a random experiment.

The weight of a random newborn baby, $W$, $P(W>4.0kg)$
The smoking status of a random mother, $S$, $P(S=1)$
The hemoglobin concentration in blood, $Hb$, $P(Hb<125 g/L)$
The number of mutations in a gene, $M$
BMI of a random man, $B$
Weight status of a random man (underweight, normal weight, overweight, obese), $W$
The result of throwing a dice, $X$

Random variables

A random variable describes the outcome of a random experiment.

Random variables: $X, Y, Z, \dots$, in general denoted by a capital letter.
Probability: $P(X=5)$, $P(Z>0.34)$, $P(W \geq 3.5 | S = 1)$
Observations of the random variable, $x, y, z, \dots$
The sample space is the collection of all possible observation values.
The population is the collection of all possible observations.
A sample is a subset of the population.

Discrete random variables

A categorical random variable has nominal or ordinal outcomes such as; {red, blue, green} or {tiny, small, average, large, huge}.

A discrete random number has countable number of outcome values, such as {1,2,3,4,5,6}; {0,2,4,6,8} or all integers.

A discrete or categorical random variable can be described by its probability mass function (PMF).

The probability that the random variable, $X$, takes the value $x$ is denoted $P(X=x) = p(x)$.

Example: a fair six-sided dice

Possible outcomes: $\{1, 2, 3, 4, 5, 6\}$

Example: a fair six-sided dice

The probability mass function;

x	1	2	3	4	5	6
p(x)	0.167	0.167	0.167	0.167	0.167	0.167

Example: Nucleotide at a given site

Table 1: Probability mass function of a nucleotide site.
x	A	C	T	G
p(x)	0.4	0.2	0.1	0.3

Example: Number of bacterial colonies

Expected value

The expected value is the average outcome of a random variable over many trials and is denoted $E[X]$ or $\mu$.

When the probability mass function is know, $E(X)$ can be computed as follow;

\[E[X] = \mu = \sum_{i=1}^n x_i p(x_i),\] where $n$ is the number of outcomes.

Alternatively, $E(X)$ can be computed as the population mean, by summing over all $N$ objects in the population;

\[E[X] = \mu = \frac{1}{N}\sum_{i=1}^N x_i\]

Variance

The variance is a measure of spread and is defined as the expected value of the squared distance from the population mean;

\[var(X) = \sigma^2 = E[(X-\mu)^2] = \sum_{i=1}^n (x_i-\mu)^2 p(x_i)\]

Standard deviation

The standard deviation is the square root of the variance and is usually denoted $\sigma$

\[\sigma = \sqrt{E[(X-\mu)^2]} = \sqrt{\sum_{i=1}^n (x_i-\mu)^2 p(x_i)}\] or by summing over all objects in the population;

\[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2}\]

The standard deviation is always positive and on the same scale as the outcome values.

Simulate distributions

Once a random variables probability distribution is known, properties of interest can be computed, such as;

probabilities, e.g. $P(X=a), P(X<a)$ and $P(X \geq a)$
expected value, $E(X)
variance, $\sigma^2$
standard deviation, $\sigma$

If the distribution is not known, simulation might be the solution.

Simulate distributions

When rolling a single dice the probabity of six is 1/6.

When rolling 20 dice, what is the probability of at least 15 sixes?

The outcome of a single dice roll is a random variable, $X$, that can be described using an urn model.

Simulation in R!

In R we can simulate random draws from an urn model using the function sample.

# A single coin toss
sample(c("H", "T"), size=1)

[1] "H"

# Another coin toss
sample(c("H", "T"), size=1)

[1] "T"

Every time you run the sample a new coin toss is simulated.

The argument size tells the function how many balls we want to draw from the urn. To draw 20 balls from the urn, set size=20, remember to replace the ball after each draw!

# 20 independent coin tosses
(coins <- sample(c("H", "T"), size=20, replace=TRUE))

 [1] "H" "T" "T" "T" "T" "H" "T" "T" "H" "H" "T" "T" "T" "T" "H" "T" "H" "T" "T"
[20] "H"

How many heads did we get in the 20 random draws?

# How many heads?
sum(coins == "H")

[1] 7

We can repeat this experiment (toss 20 coins and count the number of heads) several times to estimate the distribution of number of heads in 20 coin tosses.

To do the same thing several times we use the function replicate.

To simulate tossing 20 coins and counting the number of heads 10000 times, do the following;

Nheads <- replicate(10000, {
  coins <- sample(c("H", "T"), size=20, replace=TRUE)
  sum(coins == "H")
})

Plot distribution of the number of heads in a histogram.

hist(Nheads, breaks=0:20)

Now, let’s get back to the question; when tossing 20 coins, what is the probability of at least 15 heads?

$P(X \geq 15)$

Count how many times out of our 10000 exeriments the number is 15 or greater

sum(Nheads >= 15)

[1] 215

From this we conclude that

$P(X \geq 15) =$ 215/10000 = 0.02

Parametric discrete distributions

Uniform
Bernoilli
Binomial
Poisson
Negative binomial
Geometric
Hypergeometric

Uniform

In a uniform distribution every possible outcome has the same probability.

With $n$ different outcomes, the probability for each outcome is $1/n$.

Bernoulli

A Bernoulli trial is a random experiment with two outcomes; success (1) and failure (0).

The outcome of a Bernoulli trial is a discrete random variable, $X$.

\[P(X=x) = p(x) = \left\{ \begin{array}{ll} p & \mathrm{if}\,x=1\mathrm,\,success\\ 1-p & \mathrm{if}\,x=0\mathrm,\,failure \end{array} \right.\]

Using the definitions of expected value and variance it can be shown that;

\[E[X] = p\\ var(X) = p(1-p)\]

Binomial

The number of successes in a series of $n$ independent and identical Bernoulli trials ($Z_i$, with probability $p$ for success) is a discrete random variable, $X$.

\[X = \sum_{i=0}^n Z_i,\]

Binomial

The probability mass function of $X$, called the binomial distribution, is

\[P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\]

The expected value and variance;

\[E[X] = np\\ var(X) = np(1-p)\]

In R: pbinom to compute $P(X \leq k)$ and dbinom to compute the pmf $P(X=k)$.

Hypergeometric distribution

The hypergeometric distribution describe the number of successes in a series of $n$ draws without replacement, from a population of size $N$ with $Np$ objects of interest (successes).

The probability density function

\[P(X=k) = \frac{\binom{Np}{k}\binom{N-Np}{n-k}}{\binom{N}{n}}\] In R: phyper to compute $P(X \leq k)$ and dhyper to compute the pmf $P(X=k)$.

Poisson distribution

The Poisson distribution describes the number of times a rare event (probability $p$) occurs in a large number ($n$) of trials.

Examples

A rare disease has a very low probability for a single individual. The number of individuals in a large population that catch the disease in a certain time period can be modelled using the Poisson distribution.
Number of reads aligned to a gene region.

Poisson distribution

The probability mass function;

\[P(X=k) = \frac{\lambda}{k!}e^{-\lambda},\]

\[E[X] = var(X) = \lambda = n \pi\]

The Poisson distribution can approximate the binomial distribution if $n$ is large and $\pi$ is small, $n>10, \pi < 0.1$.

In R: ppois to compute $P(X \leq k)$ and dpois to compute the pmf $P(X=k)$.

Negative binomial

A negative binomial distribution describes the number of failures that occur before a specified number of successes ($r$) has occurred, in a sequence of independent and identically distributed Bernoilli trials.

$r$ is also called the dispersion parameter.

In R: dnbinm, pnbinom, qnbinom

Geometric

The geometric distribution is a special case of the negative binomial distribution, where $r=1$.

In R: dgeom, pgeom, qgeom

Example PMFs

Figure 1: Probability mass functions for the binomial distribution (n=20, p=0.1, 0.3 or 0.5), hypergeometric distribution (N=100, n=20, p=0.1, 0.3 or 0.5), negative binomial distribution (n=20, r=n*p, p=0.1, 0.3 or 0.5) and Poisson distribution (n=20, p=0.1, 0.3 or 0.5).

In R

Probability mass functions, $P(X=x)$; dbinom, dhyper, dpois, dnbinom and dgeom.

Cumulative distribution functions, $P(X \leq x)$; pbinom, phyper, ppois, pnbinom and pgeom.

Also, functions for computing an $x$ such that $P(X \leq x) = q$, where $q$ is a probability of interest are available using; qbinom, qhyper, qpois, qnbinom and qgeom.