Probability theory, part II

Continuous random variable

A continuous random number is not limited to discrete values, but any continuous number within one or several ranges is possible.

Examples: weight, height, speed, intensity, …

A continuous random variable can be described by its probability density function, pdf.

Probability density function

\[ \int_{-\infty}^{\infty} f(x) dx = 1 \]

The area under the curve from \(a\) to \(b\) is the probability that the random variable \(X\) takes a value between \(a\) and \(b\).

\(P(a \leq X \leq b) = \int_a^b f(x) dx\)

Cumulative distribution function

The cumulative distribution function, cdf, \(F(x)\), is defined as:

\[F(x) = P(X \leq x) = \int_{-\infty}^x f(x) dx\]

As the total probability (over all x) is 1, it follows that \(P(X > x) = 1 - P(X \leq x) = 1 - F(x)\) and thus \(P(a < X \leq b) = F(b) - F(a)\).

Expected value and variance

Two important parameters of a distribution are the expected value, \(E(X) = \mu\), that describe the distributions location and the variance, \(\sigma^2\), that describe the spread.

The expected value, or population mean, is defined as;

\[E[X] = \mu = \int_{-\infty}^\infty x f(x) dx\]

The variance is defined as the expected value of the squared distance from the population mean;

\[\sigma^2 = E[(X-\mu)^2] = \int_{-\infty}^\infty (x-\mu)^2 f(x) dx\]

The square root of the variance is called the standard deviation, \(\sigma\).

Expected value and variance

As for discrete random variables, if the population is countable, \(E(X)\) and \(\sigma^2\) can also be computed by summing over all \(N\) objects in the population.

\[E[X] = \mu = \frac{1}{N}\sum_{i=1}^N x_i\]

\[\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2\]

Normal distribution

The normal probability density function

\[f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\]

describes the distribution of a normal random variable, \(X\), with expected value \(\mu\) and standard deviation \(\sigma\), \(e\) and \(\pi\) are two common mathematical constants, \(e \approx 2.71828\) and \(\pi \approx 3.14159\).

In short we write \(X \sim N(\mu, \sigma)\).

The bell-shaped normal distributions is symmetric around \(\mu\) and \(f(x) \rightarrow 0\) as \(x \rightarrow \infty\) and as \(x \rightarrow -\infty\).

Normal distribution

As \(f(x)\) is well defined, values for the cumulative distribution function \(F(x) = \int_{- \infty}^x f(x) dx\) can be computed.

Standard normal distribution

Using transformation rules we can define

\[Z = \frac{X-\mu}{\sigma}, \, Z \sim N(0,1)\]

Values for the cumulative standard normal distribution, \(F(z)\), are tabulated and easy to compute in R using the function pnorm.

Standard normal distribution

Properties of the standard normal distribution

\(P(Z \leq -z) = P(Z \geq z) = 1 - P(Z \leq z)\)

\(P(Z < z) = P(Z \leq z)\)

Some value of particular interest:

\[F(1.64) = 0.95\\ F(1.96) = 0.975\]

As the normal distribution is symmetric \(F(-z) = 1 - F(z)\)

\[F(-1.64) = 0.05\\ F(-1.96) = 0.025\]

\[P(-1.96 < Z < 1.96) = 0.95\]

F(z) = P(Z
	0	0.01	0.02	0.03	0.04	0.05	0.06	0.07	0.08	0.09
0.0	0.5000	0.5040	0.5080	0.5120	0.5160	0.5199	0.5239	0.5279	0.5319	0.5359
0.1	0.5398	0.5438	0.5478	0.5517	0.5557	0.5596	0.5636	0.5675	0.5714	0.5753
0.2	0.5793	0.5832	0.5871	0.5910	0.5948	0.5987	0.6026	0.6064	0.6103	0.6141
0.3	0.6179	0.6217	0.6255	0.6293	0.6331	0.6368	0.6406	0.6443	0.6480	0.6517
0.4	0.6554	0.6591	0.6628	0.6664	0.6700	0.6736	0.6772	0.6808	0.6844	0.6879
0.5	0.6915	0.6950	0.6985	0.7019	0.7054	0.7088	0.7123	0.7157	0.7190	0.7224
0.6	0.7257	0.7291	0.7324	0.7357	0.7389	0.7422	0.7454	0.7486	0.7517	0.7549
0.7	0.7580	0.7611	0.7642	0.7673	0.7704	0.7734	0.7764	0.7794	0.7823	0.7852
0.8	0.7881	0.7910	0.7939	0.7967	0.7995	0.8023	0.8051	0.8078	0.8106	0.8133
0.9	0.8159	0.8186	0.8212	0.8238	0.8264	0.8289	0.8315	0.8340	0.8365	0.8389
1.0	0.8413	0.8438	0.8461	0.8485	0.8508	0.8531	0.8554	0.8577	0.8599	0.8621
1.1	0.8643	0.8665	0.8686	0.8708	0.8729	0.8749	0.8770	0.8790	0.8810	0.8830
1.2	0.8849	0.8869	0.8888	0.8907	0.8925	0.8944	0.8962	0.8980	0.8997	0.9015
1.3	0.9032	0.9049	0.9066	0.9082	0.9099	0.9115	0.9131	0.9147	0.9162	0.9177
1.4	0.9192	0.9207	0.9222	0.9236	0.9251	0.9265	0.9279	0.9292	0.9306	0.9319
1.5	0.9332	0.9345	0.9357	0.9370	0.9382	0.9394	0.9406	0.9418	0.9429	0.9441
1.6	0.9452	0.9463	0.9474	0.9484	0.9495	0.9505	0.9515	0.9525	0.9535	0.9545
1.7	0.9554	0.9564	0.9573	0.9582	0.9591	0.9599	0.9608	0.9616	0.9625	0.9633
1.8	0.9641	0.9649	0.9656	0.9664	0.9671	0.9678	0.9686	0.9693	0.9699	0.9706
1.9	0.9713	0.9719	0.9726	0.9732	0.9738	0.9744	0.9750	0.9756	0.9761	0.9767
2.0	0.9772	0.9778	0.9783	0.9788	0.9793	0.9798	0.9803	0.9808	0.9812	0.9817
2.1	0.9821	0.9826	0.9830	0.9834	0.9838	0.9842	0.9846	0.9850	0.9854	0.9857
2.2	0.9861	0.9864	0.9868	0.9871	0.9875	0.9878	0.9881	0.9884	0.9887	0.9890
2.3	0.9893	0.9896	0.9898	0.9901	0.9904	0.9906	0.9909	0.9911	0.9913	0.9916
2.4	0.9918	0.9920	0.9922	0.9925	0.9927	0.9929	0.9931	0.9932	0.9934	0.9936
2.5	0.9938	0.9940	0.9941	0.9943	0.9945	0.9946	0.9948	0.9949	0.9951	0.9952
2.6	0.9953	0.9955	0.9956	0.9957	0.9959	0.9960	0.9961	0.9962	0.9963	0.9964
2.7	0.9965	0.9966	0.9967	0.9968	0.9969	0.9970	0.9971	0.9972	0.9973	0.9974
2.8	0.9974	0.9975	0.9976	0.9977	0.9977	0.9978	0.9979	0.9979	0.9980	0.9981
2.9	0.9981	0.9982	0.9982	0.9983	0.9984	0.9984	0.9985	0.9985	0.9986	0.9986
3.0	0.9987	0.9987	0.9987	0.9988	0.9988	0.9989	0.9989	0.9989	0.9990	0.9990
3.1	0.9990	0.9991	0.9991	0.9991	0.9992	0.9992	0.9992	0.9992	0.9993	0.9993
3.2	0.9993	0.9993	0.9994	0.9994	0.9994	0.9994	0.9994	0.9995	0.9995	0.9995
3.3	0.9995	0.9995	0.9995	0.9996	0.9996	0.9996	0.9996	0.9996	0.9996	0.9997
3.4	0.9997	0.9997	0.9997	0.9997	0.9997	0.9997	0.9997	0.9997	0.9997	0.9998

Sum of two normal random variables

If \(X \sim N(\mu_1, \sigma_1)\) and \(Y \sim N(\mu_2, \sigma_2)\) are two independent normal random variables, then their sum is also a random variable:

\[X + Y \sim N(\mu_1 + \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})\]

and

\[X - Y \sim N(\mu_1 - \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})\]

Sum of many random variables

This can be extended to the case with \(n\) independent and identically distributed normal random variables \(X_i \in N(\mu, \sigma)\).

\[\sum_{i=1}^n X_i \in N(n\mu, \sqrt{n}\sigma)\]

If \(n\) is large enough this holds for any independent and identically distributed random variables, even if they are not normally distributed.

Theorem 1 (Central limit theorem) The sum of \(n\) independent and equally distributed random variables is normally distributed, if \(n\) is large enough.

As a result of central limit theorem, the distribution of fractions or mean values of a sample follow the normal distribution, at least if the sample is large enough (a rule of thumb is that the sample size \(n>30\)).

Central limit theorem, example

A left skewed distribution has a heavier left tail than right tail. An example might be age at death of natural causes.

Figure 1: A left skewed distribution. Can for example show the distribution of age of a mouse who died of natural causes.

Randomly sample 3, 5, 10, 15, 20, 30 observations and compute their mean value, \(m\). Repeat many times to get the distribution of mean values.

Central limit theorem, example

Figure 2: Distribution of sample means, where the means are computed based on random samples of sizes 3, 5, 10, 15, 20 and 30, respectively.

Note, mean is just the sum divided by the number of samples \(n\).

\(\chi^2\)-distribution

The sum of \(n\) independent identically distributed normal random variables, \(X_i \in N(0,1)\), is a random variable \(Y\). \[Y = \sum_{i=1}^n X_i^2\] is \(\chi^2\) distributed with \(n-1\) degrees of freedom.

In short \(Y \in \chi^2_{n-1}\).

Example: The sample variance \(S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\bar X)^2\) is such that \(\frac{(n-1)S^2}{\sigma^2}\) is \(\chi^2\) distributed with \(n-1\) degrees of freedom.

F-distribution

The ratio of two \(\chi^2\)-distributed variables divided by their degrees of freedom is F-distributed

Example: The ratio of two sample variances is F-distributed

t-distribution

The ratio of a normally distributed variable and the square root of a \(\chi^2\)-distributed variable is t-distributed.

The ratio between sample mean and sample variance is t-distributed.

Random sampling

In many (most) experiments it is not feasible to examine the entire population. Instead we study a random sample.

A random sample is a random subset of individuals from a population.

There are different techniques for performing random sampling, two common techniques are simple random sampling and stratified random sampling.

Simple random sampling

A simple random sample is a random subset of individuals from a population, where every individual has the same probability of being choosen.

Simple random sampling using an urn model;

Let every individual in the population be represented by a ball. The value on each ball is the measurement we are interested in, for example height, shoe size, hair color, healthy/sick, type of cancer/no cancer, blood glucose value, etc.

Draw \(n\) balls from the urn, without replacement, to get a random sample of size \(n\).

Stratified random sampling

In stratified random sampling the population is first divided into subpopulations based on important attributes, e.g. sex (male/female), age (young/middle aged/old) or BMI (underweight/normal weight/overweight/obese). Simple random sampling is then performed within each subpopulation.

–>

Sample properties

Summary statistics can be computed for a sample, such as the sum, proportion, mean and variance.

Sample proportion

The proportion of a population with a particular property is \(\pi\).

The number of individuals with the property in a simple random sample of size \(n\) is a random variable \(X\).

The proportion of individuals in a sample with the property is also a random variable;

\[P = \frac{X}{n}\] with expected value \[E[P] = \frac{E[X]}{n} = \frac{n\pi}{n} = \pi\]

Sample mean

For a particular sample of size \(n\); \(x_1, \dots, x_n\), the sample mean is

\[m = \bar x = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i.\]

Note that mean of \(n\) independent identically distributed random variables, \(X_i\), is itself a random variable;

\[\bar X = \frac{1}{n}\sum_{i=1}^n X_i.\]

If \(X_i \sim N(\mu, \sigma)\), then \(\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\).

When we only have a sample, the sample mean \(m\) is our best estimate of the population mean. It is possible to show that the sample mean is an unbiased estimate of the population mean, i.e. the average (over many size \(n\) samples) of the sample mean is \(\mu\), i.e. \(E[\bar X] = \mu\).

Sample variance

The sample variance is computed as;

\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-m)^2.\]

The sample variance, \(S^2\), is also a random variable and it is possible to show that the sample variance is an unbiased estimate of the population variance.

Sampling distribution

A sampling distribution is a probability distribution of a sample property. A sampling distribution is obtained by sampling many times from the studied population.

Sample estimates of mean and variance are unbiased, but not perfect.

The standard deviation of the sampling distribution is called the standard error.

Standard error of the mean

For the sample mean, \(\bar X\), the variance is

\[E[(\bar X - \mu)^2] = \mathrm{var}(\bar X) = \frac{\sigma^2}{n}\] The standard error of the mean is thus;

\[SEM = \frac{\sigma}{\sqrt{n}}\] Replacing \(\sigma\) with the sample standard deviation, \(s\), we get an estimate of the standard deviation of the mean;

\[SEM \approx \frac{s}{\sqrt{n}}\]