NBIS, SciLifeLab
April 24, 2023
A continuous random number is not limited to discrete values, but any continuous number within one or several ranges is possible.
Examples: weight, height, speed, intensity, …
A continuous random variable can be described by its probability density function, pdf.
\[ \int_{-\infty}^{\infty} f(x) dx = 1 \]
The area under the curve from \(a\) to \(b\) is the probability that the random variable \(X\) takes a value between \(a\) and \(b\).
\(P(a \leq X \leq b) = \int_a^b f(x) dx\)
The cumulative distribution function, cdf, \(F(x)\), is defined as:
\[F(x) = P(X \leq x) = \int_{-\infty}^x f(x) dx\]
As the total probability (over all x) is 1, it follows that \(P(X > x) = 1 - P(X \leq x) = 1 - F(x)\) and thus \(P(a < X \leq b) = F(b) - F(a)\).
Two important parameters of a distribution are the expected value, \(E(X) = \mu\), that describe the distributions location and the variance, \(\sigma^2\), that describe the spread.
The expected value, or population mean, is defined as;
\[E[X] = \mu = \int_{-\infty}^\infty x f(x) dx\]
The variance is defined as the expected value of the squared distance from the population mean;
\[\sigma^2 = E[(X-\mu)^2] = \int_{-\infty}^\infty (x-\mu)^2 f(x) dx\]
The square root of the variance is called the standard deviation, \(\sigma\).
As for discrete random variables, if the population is countable, \(E(X)\) and \(\sigma^2\) can also be computed by summing over all \(N\) objects in the population.
\[E[X] = \mu = \frac{1}{N}\sum_{i=1}^N x_i\]
\[\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2\]
The normal probability density function
\[f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\]
describes the distribution of a normal random variable, \(X\), with expected value \(\mu\) and standard deviation \(\sigma\), \(e\) and \(\pi\) are two common mathematical constants, \(e \approx 2.71828\) and \(\pi \approx 3.14159\).
In short we write \(X \sim N(\mu, \sigma)\).
The bell-shaped normal distributions is symmetric around \(\mu\) and \(f(x) \rightarrow 0\) as \(x \rightarrow \infty\) and as \(x \rightarrow -\infty\).
As \(f(x)\) is well defined, values for the cumulative distribution function \(F(x) = \int_{- \infty}^x f(x) dx\) can be computed.
Using transformation rules we can define
\[Z = \frac{X-\mu}{\sigma}, \, Z \sim N(0,1)\]
Values for the cumulative standard normal distribution, \(F(z)\), are tabulated and easy to compute in R using the function pnorm
.
Properties of the standard normal distribution
\(P(Z \leq -z) = P(Z \geq z) = 1 - P(Z \leq z)\)
\(P(Z < z) = P(Z \leq z)\)
Some value of particular interest:
\[F(1.64) = 0.95\\ F(1.96) = 0.975\]
As the normal distribution is symmetric \(F(-z) = 1 - F(z)\)
\[F(-1.64) = 0.05\\ F(-1.96) = 0.025\]
\[P(-1.96 < Z < 1.96) = 0.95\]
0 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | |
---|---|---|---|---|---|---|---|---|---|---|
0.0 | 0.5000 | 0.5040 | 0.5080 | 0.5120 | 0.5160 | 0.5199 | 0.5239 | 0.5279 | 0.5319 | 0.5359 |
0.1 | 0.5398 | 0.5438 | 0.5478 | 0.5517 | 0.5557 | 0.5596 | 0.5636 | 0.5675 | 0.5714 | 0.5753 |
0.2 | 0.5793 | 0.5832 | 0.5871 | 0.5910 | 0.5948 | 0.5987 | 0.6026 | 0.6064 | 0.6103 | 0.6141 |
0.3 | 0.6179 | 0.6217 | 0.6255 | 0.6293 | 0.6331 | 0.6368 | 0.6406 | 0.6443 | 0.6480 | 0.6517 |
0.4 | 0.6554 | 0.6591 | 0.6628 | 0.6664 | 0.6700 | 0.6736 | 0.6772 | 0.6808 | 0.6844 | 0.6879 |
0.5 | 0.6915 | 0.6950 | 0.6985 | 0.7019 | 0.7054 | 0.7088 | 0.7123 | 0.7157 | 0.7190 | 0.7224 |
0.6 | 0.7257 | 0.7291 | 0.7324 | 0.7357 | 0.7389 | 0.7422 | 0.7454 | 0.7486 | 0.7517 | 0.7549 |
0.7 | 0.7580 | 0.7611 | 0.7642 | 0.7673 | 0.7704 | 0.7734 | 0.7764 | 0.7794 | 0.7823 | 0.7852 |
0.8 | 0.7881 | 0.7910 | 0.7939 | 0.7967 | 0.7995 | 0.8023 | 0.8051 | 0.8078 | 0.8106 | 0.8133 |
0.9 | 0.8159 | 0.8186 | 0.8212 | 0.8238 | 0.8264 | 0.8289 | 0.8315 | 0.8340 | 0.8365 | 0.8389 |
1.0 | 0.8413 | 0.8438 | 0.8461 | 0.8485 | 0.8508 | 0.8531 | 0.8554 | 0.8577 | 0.8599 | 0.8621 |
1.1 | 0.8643 | 0.8665 | 0.8686 | 0.8708 | 0.8729 | 0.8749 | 0.8770 | 0.8790 | 0.8810 | 0.8830 |
1.2 | 0.8849 | 0.8869 | 0.8888 | 0.8907 | 0.8925 | 0.8944 | 0.8962 | 0.8980 | 0.8997 | 0.9015 |
1.3 | 0.9032 | 0.9049 | 0.9066 | 0.9082 | 0.9099 | 0.9115 | 0.9131 | 0.9147 | 0.9162 | 0.9177 |
1.4 | 0.9192 | 0.9207 | 0.9222 | 0.9236 | 0.9251 | 0.9265 | 0.9279 | 0.9292 | 0.9306 | 0.9319 |
1.5 | 0.9332 | 0.9345 | 0.9357 | 0.9370 | 0.9382 | 0.9394 | 0.9406 | 0.9418 | 0.9429 | 0.9441 |
1.6 | 0.9452 | 0.9463 | 0.9474 | 0.9484 | 0.9495 | 0.9505 | 0.9515 | 0.9525 | 0.9535 | 0.9545 |
1.7 | 0.9554 | 0.9564 | 0.9573 | 0.9582 | 0.9591 | 0.9599 | 0.9608 | 0.9616 | 0.9625 | 0.9633 |
1.8 | 0.9641 | 0.9649 | 0.9656 | 0.9664 | 0.9671 | 0.9678 | 0.9686 | 0.9693 | 0.9699 | 0.9706 |
1.9 | 0.9713 | 0.9719 | 0.9726 | 0.9732 | 0.9738 | 0.9744 | 0.9750 | 0.9756 | 0.9761 | 0.9767 |
2.0 | 0.9772 | 0.9778 | 0.9783 | 0.9788 | 0.9793 | 0.9798 | 0.9803 | 0.9808 | 0.9812 | 0.9817 |
2.1 | 0.9821 | 0.9826 | 0.9830 | 0.9834 | 0.9838 | 0.9842 | 0.9846 | 0.9850 | 0.9854 | 0.9857 |
2.2 | 0.9861 | 0.9864 | 0.9868 | 0.9871 | 0.9875 | 0.9878 | 0.9881 | 0.9884 | 0.9887 | 0.9890 |
2.3 | 0.9893 | 0.9896 | 0.9898 | 0.9901 | 0.9904 | 0.9906 | 0.9909 | 0.9911 | 0.9913 | 0.9916 |
2.4 | 0.9918 | 0.9920 | 0.9922 | 0.9925 | 0.9927 | 0.9929 | 0.9931 | 0.9932 | 0.9934 | 0.9936 |
2.5 | 0.9938 | 0.9940 | 0.9941 | 0.9943 | 0.9945 | 0.9946 | 0.9948 | 0.9949 | 0.9951 | 0.9952 |
2.6 | 0.9953 | 0.9955 | 0.9956 | 0.9957 | 0.9959 | 0.9960 | 0.9961 | 0.9962 | 0.9963 | 0.9964 |
2.7 | 0.9965 | 0.9966 | 0.9967 | 0.9968 | 0.9969 | 0.9970 | 0.9971 | 0.9972 | 0.9973 | 0.9974 |
2.8 | 0.9974 | 0.9975 | 0.9976 | 0.9977 | 0.9977 | 0.9978 | 0.9979 | 0.9979 | 0.9980 | 0.9981 |
2.9 | 0.9981 | 0.9982 | 0.9982 | 0.9983 | 0.9984 | 0.9984 | 0.9985 | 0.9985 | 0.9986 | 0.9986 |
3.0 | 0.9987 | 0.9987 | 0.9987 | 0.9988 | 0.9988 | 0.9989 | 0.9989 | 0.9989 | 0.9990 | 0.9990 |
3.1 | 0.9990 | 0.9991 | 0.9991 | 0.9991 | 0.9992 | 0.9992 | 0.9992 | 0.9992 | 0.9993 | 0.9993 |
3.2 | 0.9993 | 0.9993 | 0.9994 | 0.9994 | 0.9994 | 0.9994 | 0.9994 | 0.9995 | 0.9995 | 0.9995 |
3.3 | 0.9995 | 0.9995 | 0.9995 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9997 |
3.4 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9998 |
If \(X \sim N(\mu_1, \sigma_1)\) and \(Y \sim N(\mu_2, \sigma_2)\) are two independent normal random variables, then their sum is also a random variable:
\[X + Y \sim N(\mu_1 + \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})\]
and
\[X - Y \sim N(\mu_1 - \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})\]
This can be extended to the case with \(n\) independent and identically distributed normal random variables \(X_i \in N(\mu, \sigma)\).
\[\sum_{i=1}^n X_i \in N(n\mu, \sqrt{n}\sigma)\]
If \(n\) is large enough this holds for any independent and identically distributed random variables, even if they are not normally distributed.
Theorem 1 (Central limit theorem) The sum of \(n\) independent and equally distributed random variables is normally distributed, if \(n\) is large enough.
As a result of central limit theorem, the distribution of fractions or mean values of a sample follow the normal distribution, at least if the sample is large enough (a rule of thumb is that the sample size \(n>30\)).
A left skewed distribution has a heavier left tail than right tail. An example might be age at death of natural causes.
Randomly sample 3, 5, 10, 15, 20, 30 observations and compute their mean value, \(m\). Repeat many times to get the distribution of mean values.
Note, mean is just the sum divided by the number of samples \(n\).
The sum of \(n\) independent identically distributed normal random variables, \(X_i \in N(0,1)\), is a random variable \(Y\). \[Y = \sum_{i=1}^n X_i^2\] is \(\chi^2\) distributed with \(n-1\) degrees of freedom.
In short \(Y \in \chi^2_{n-1}\).
Example: The sample variance \(S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\bar X)^2\) is such that \(\frac{(n-1)S^2}{\sigma^2}\) is \(\chi^2\) distributed with \(n-1\) degrees of freedom.
The ratio of two \(\chi^2\)-distributed variables divided by their degrees of freedom is F-distributed
Example: The ratio of two sample variances is F-distributed
The ratio of a normally distributed variable and the square root of a \(\chi^2\)-distributed variable is t-distributed.
The ratio between sample mean and sample variance is t-distributed.
In many (most) experiments it is not feasible to examine the entire population. Instead we study a random sample.
A random sample is a random subset of individuals from a population.
There are different techniques for performing random sampling, two common techniques are simple random sampling and stratified random sampling.
A simple random sample is a random subset of individuals from a population, where every individual has the same probability of being choosen.
Simple random sampling using an urn model;
Let every individual in the population be represented by a ball. The value on each ball is the measurement we are interested in, for example height, shoe size, hair color, healthy/sick, type of cancer/no cancer, blood glucose value, etc.
Draw \(n\) balls from the urn, without replacement, to get a random sample of size \(n\).
In stratified random sampling the population is first divided into subpopulations based on important attributes, e.g. sex (male/female), age (young/middle aged/old) or BMI (underweight/normal weight/overweight/obese). Simple random sampling is then performed within each subpopulation.
–>
–>
Summary statistics can be computed for a sample, such as the sum, proportion, mean and variance.
The proportion of a population with a particular property is \(\pi\).
The number of individuals with the property in a simple random sample of size \(n\) is a random variable \(X\).
The proportion of individuals in a sample with the property is also a random variable;
\[P = \frac{X}{n}\] with expected value \[E[P] = \frac{E[X]}{n} = \frac{n\pi}{n} = \pi\]
For a particular sample of size \(n\); \(x_1, \dots, x_n\), the sample mean is
\[m = \bar x = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i.\]
Note that mean of \(n\) independent identically distributed random variables, \(X_i\), is itself a random variable;
\[\bar X = \frac{1}{n}\sum_{i=1}^n X_i.\]
If \(X_i \sim N(\mu, \sigma)\), then \(\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\).
When we only have a sample, the sample mean \(m\) is our best estimate of the population mean. It is possible to show that the sample mean is an unbiased estimate of the population mean, i.e. the average (over many size \(n\) samples) of the sample mean is \(\mu\), i.e. \(E[\bar X] = \mu\).
The sample variance is computed as;
\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-m)^2.\]
The sample variance, \(S^2\), is also a random variable and it is possible to show that the sample variance is an unbiased estimate of the population variance.
A sampling distribution is a probability distribution of a sample property. A sampling distribution is obtained by sampling many times from the studied population.
Sample estimates of mean and variance are unbiased, but not perfect.
The standard deviation of the sampling distribution is called the standard error.
For the sample mean, \(\bar X\), the variance is
\[E[(\bar X - \mu)^2] = \mathrm{var}(\bar X) = \frac{\sigma^2}{n}\] The standard error of the mean is thus;
\[SEM = \frac{\sigma}{\sqrt{n}}\] Replacing \(\sigma\) with the sample standard deviation, \(s\), we get an estimate of the standard deviation of the mean;
\[SEM \approx \frac{s}{\sqrt{n}}\]