Z-test for Proportions

# Z-test for Proportions >[!Question] [Why do we use a z-test rather than a t-test with proportional data?](https://stats.stackexchange.com/questions/90893/why-use-a-z-test-rather-than-a-t-test-with-proportional-data#:~:text=The%20reason%20you%20can%20use,have%20to%20take%20into%20account.) >The reason you can use a 𝑧-test with proportion data is because the standard deviation of a proportion is a function of the proportion itself. Thus, once you have estimated the proportion in your sample, you don't have an extra source of uncertainty that you have to take into account. As a result, you can use the normal distribution instead of the 𝑡 distribution as your sampling distribution. > >If you have more than 2 groups, you can use logistic regression, as you note. You do have to know the $n_j$s in each group however. If you just had a set of observed proportions, but didn't know how many trials had been observed to generate those proportions, you cannot run a proper test of whether the proportions differed. >[!Abstract] Questions >1. How can you compare CTRs of an ad of two groups of users? >2. How can you derive a confidence interval for the probability of getting heads from a series of coin tosses? The typical one and two-sample proportions tests are of the following form $T = \frac{d}{s}$ - Where, - $d$ is the difference between a proportion and a constant or the difference between two proportions - $s$ is the estimated standard deviation of d >[!Note] The Slutsky's Theorem >As long as the denominator $s$ converges in probability towards the unknown standard deviation $\sigma_d$ , then $\frac{d}{s} \sim N(0, 1)$ Therefore, we can treat $T$ as asymptotically normal. However, there is no justification for treating it as $t$-distributed, as a result, we don't use t-tests for proportionality comparisons. Though it is worth mentioning that for large samples the result of a z-test is quite similar to t-test thus t-tests are sometimes used in practice instead of z-tests. ## One-proportion Z-test >[!Abstract] When do you use one-proportion z-test? >We use a one proportion z-test when we want to compare a proportion of a population to a *constant*. Let - $p$ be the success rate of a large number of $n$ independent [[Binomial Distribution|Bernouli]] trials - $\hat{p}$ be the observed success rate, that is the number of observed successes over the total number of trials >[!Idea] When the sample contains at least 10 successes and 10 failures, it would be reasonable to use the normal approximation of a [[Binomial Distribution|binomial distribution]] $n\hat{p} \sim Binomial(n,p) \sim N(np, npq)$ $\hat{p} \sim N\bigg(p, \frac{p(1-p)}{n}\bigg)$ ### Hypothesis $H_0: p = p_0$ $H_1: p \neq p_0$ ### Z-statistic Under $H_0$ the statistic follows: $Z \sim N(0, 1)$ $Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$ Based on this, we can calculate the observed z-statistic and compared it with the z critical value. If the observed z-statistic is larger than the critical z value, then we reject $H_0$ ### Confidence Interval The confidence interval for $p$ is given by $\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$ ### Example: Estimate the click-through rate (CTR) $p \in (0, 1)$ of users on Ads Suppose that we have an algorithm for Ad selection and we'd like to estimate the CTR $p \in (0, 1)$ of users on the Ads selected by this algorithm. Assuming the following: - 1000 users - Observed CTR rate is $\hat{p} = 0.2$ - $\alpha = 5\%$ The null hypothesis in this case is (assuming this was the CTR before the algorithm was impemented): $H_0: p = 0.15$ ```python p_0 = 0.15 n = 1000 p_hat = 0.2 sigma = (p_0 + (1-p_0) / n)**0.5 observed_z_score = (p_hat - p_0) / sigma # 4.4280 critical_z_score = stats.norm.ppf(0.975) # 1.9599 ``` Since the observed z-score is greater than the critical z-score, we reject the null hypothesis $H_0$ And the confidence interval given as follows: ```python # Confidence interval at 95% sigma = (p_hat * (1-p_hat) / n)**0.5 margin_of_error = critical_z_score * sigma lower = p_hat - margin_of_error upper = p_hat + margin_of_error confidence_interval = (lower, upper) # (0.1752, 0.2247) ``` Continue [here](https://www.youtube.com/watch?v=KCcu4BdgZGA&list=PLY1Fi4XflWStljP1tzfAfU_Qn0wHzhzYm&ab_channel=EmmaDing), read [this](https://stats.stackexchange.com/questions/81975/the-z-test-vs-the-chi2-test-for-comparing-the-odds-of-catching-a-cold-in-2/82556#82556) ## Source - [Z-test for Proportions | Statistics Interview Q&A for Data Scientists](https://www.youtube.com/watch?v=KCcu4BdgZGA) , ## Two-proportion z-test