Module 4 - Normal Approximation and Binomial Distribution

# Module 4 - Normal Approximation and Binomial Distribution >[!Abstract] This module covers the empirical rule and normal approximation for data, a technique that is used in many statistical procedures. You will also learn about the binomial distribution and the basics of random variables. ```toc ``` ## The Empirical Rule If the data follows the normal curve, then - about 2/3 (68%) of the data falls within one standard deviation of the mean - about 95% fall within 2 standard deviations of the mean - about 99.7% fall within 3 standard deviations of the mean ![[Module 1 - Introduction and Descriptive Statistics for Exploring Data#The Standard Deviation]] ## Standardising Data A normal curve is completely determined by its parameters: the mean $\bar{x}$ and the standard deviation $\sigma$. To compute areas under the normal curve, we first **[[Standardisation|standardise]]** the data by subtracting off $\bar{x}$ and then dividing by $\sigma$: $z = \frac{\text{height} - \bar{x}}{\sigma}$ $z$ is called the **standardised value** or ***z-score*** - $z$ has no unit - If $z=2$, this means the height is 2 standard deviations above the average **Once a data is standardised, the data will have a mean 0 and a standard deviation equal to 1** -- this is the point of standardising. This is also a technique used in [[Feature Engineering]] in order to make sure all the features have the same scale. ## Normal Approximation Finding areas under the normal curve is called **normal approximation**. We can use it to answer questions like: what percentage of fathers have heights between 67.4 in and 71.9 in? In order to do so we follow the following steps: 1. Standardise $z = \frac{\text{height} - \bar{x}}{\sigma} = -0.5 ; 2$ 2. Use a computer to compute the area ## The Binomial Setting and Binomial Coefficient In the context of frequentist probabilities, the act of listing all possibilities is known as *total enumeration*. A **binomial setting** involves an independent set of repetitive experiments with a given set of possible outcomes. The **binomial coefficient** counts the number of ways in which one can arrange $k$ successes in $n$ experiments $\frac{n!}{k!(n-k)!}$ Applying this coefficient in the binomial setting gives the **binomial formula** $P(k \text{ successes and } n \text{ experiments}) = \frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}$ ## Random Variables The outcomes of repetitive experiments are due to chance, so the number of successes is random: One set of 10 experiments might result in 4 successes, another set might result in 7 successes. $X$ is called a **random variable** $P(X = 2) = 30.2\%$. $X$ has the [[Binomial Distribution]] We can visualise the probabilities of the various outcomes of X with a **probability histogram:** ![[prob_histogram.png]] When the experiment is repeated several times, say for $n=50$, the distribution would start to approximate a normal distribution. Which means we can answer question of binomial nature using the normal approximation discussed above. ![[normalized_hist_prob.png]] ## Sampling Without Replacement A simple random sample selects subjects without replacement. This is not the binomial setting because $p$ changes after a subject has been removed. But if the population is much larger than the sample, then sampling with replacement is about the same as sampling without replacement i.e. it will approximate the binomial distribution and therefore also follow the normal curve.