Statistics of AB Testing

# Statistics of AB Testing --- ## Review Questions ### 1. Hypothesis Testing Basics - What are the differences between a z-test and t-test? - When to use z-test versus a t-test? - Given the data, how would you calculate the t-statistic or z-statistic? ### 2. Hypothesis Testing + A/B Testing - Given a test result, calculate if the result is significant - How to make launch decisions? - How would you use hypothesis testing in practice? ### 3. Hypothesis Testing + SQL - Query average "likes" in control and treatment groups - Compte test statistic, and tell if it's significant --- ## What are the available tests? - t-test - z-test - Welch's t-test (when the sample variances are not similar) ## What are the differences between the different tests? (t-distribution vs z-distribution) - t-distribution is more spread out - Standard deviation is known - t-distribution produces wider confidence interval than z-distribution ![[t_vs_z_distr.png]] ### Why don't we use t-tests for proportions? In the usual t-tests, the t-statistics are all of the form: d/s, where s is an estimated standard error of d. The t-distribution arises from the following The reason for this is that the test statistic does not have a t-distribution. For one-sample or two-sample proportion tests: - Test statistics follow the form d/s - Asymptotically normal - No justification for a t-distribution Why do people use t-tests for proportions? - They are (at least academically) wrong - But approximation using t-distribution on Bernoulli data is good - *For a large sample, t-distribution and z-distribution have similar results* ## How to know which test to use? The following flowchart can be used to determine what chart to use ![[hypothesis_test.png]] - For small samples it is important to check if the data is normally distributed - For larger samples, we can invoke [[Central Limit Theorem]] and assume the distribution of sample means is approximately normally distributed - z-test is less commonly used in reality as population variance is often unknown ### Note: [[Bernoulli Distribution]] - Distribution of a random variable - $Pr(1) = p$ - $Pr(0) = 1-p$ - Example: Click throuhg probability (CTP) -> Pr(click), Pr(no click) - This is also what you would use to understand changes in proportions. Example, percentage of users or pages ## Two-Sample Test of Proportion Experiment: test color of a button - Click through probability: N(users who clicked) / N (total users) - 1000 users in both control & treatment groups Results: - Control group: 1.1% CTP - Treatment group: 2.3% CTP Question: - Can we conclude that the difference between the two groups is significant? - Do you recommend launching this experiment? Note: - Practical significant boundary: 0.01 - Significance level $\alpha = 0.05$ Questions to answer: 1. Which hypothesis tests to use? - Either clicks or doesn't click: -> [[Bernoulli Distribution]] - Control group: $n * p = 1000 * 1.1\% = 11$ - Treatment group = 23 - Test statistic follows *z-distribution* (population proportions) - Measurements - Users clicked $X_{ct}, X_{tr}$ - Total number of users $n_{ct}, n_{tr}$ $\hat{p}_{ct} = \frac{X_{ct}}{n_{ct}} = \frac{11}{1000}$ $\hat{p}_{tr} = \frac{X_{tr}}{n_{tr}} = \frac{23}{1000} $ 2. What is the null hypothesis? $d = \hat{p_{tr}} - \hat{p_{ct}}$ $H_0:p_{ct}, d=0$ - Test statistic $TS = \frac{\hat{p_{tr}} - \hat{p_{ct}}}{SE}$ - Choose SE such that it can represent both groups -> Pooled standard error 1. First calculate *pooled* SE $\hat{p}$ $\hat{p} = \frac{X_{ct} + X_{tr}}{n_{ct} + n_{tr}} = \frac{11+23}{1000+1000} = 0.017$ 2. Compte *pooled* SE $SE = \sqrt{\hat{p} (1-\hat{p}) (1/n_{ct} + 1/n_{tr})} = 0.00578$ $TS = 2.076$ 3. Is the result statistically significant? - Critical z-score = 1.96 - TS > 1.96 pr TS < -1.96 then reject null hypothesis 4. Is the result practically significant? - You can determine this by using the confidence interval ## Two-Sample Test of Means Experiment: if a new feature changes average number of posts - 30 users in both control & treatment groups - Control: [1, 0, 1, 3, 2, ...] - Treatment: [0, 2, 3, 1, 0, ...] - Mean of control = 1.4 - Mean of treatment = 2 - $\alpha = 0.05$ - Practical significant boundary = $0.05$ - Question: should you launch this feature? - Now we are dealing with a two sample distribution with unknown but similar variances - Compute *pooled variance* - Goal: measure the difference $d = \mu_t - \mu_c$ - Null hypothesis $H_0: \mu_c = \mu_t, d = 0$ - Test statistic with *pooled variance* $TS = \frac{\mu_t - \mu_c}{S_{pool}\sqrt{\frac{1}{n_c} + \frac{1}{n_t}}}$ - *Pooled standard error* $S_{pool} = \sqrt{\frac{SS_c + SS_t}{df}}$ - Continue [here](https://www.youtube.com/watch?v=6uw0A3aKwMc) ## Sources - [Crack Hypothesis Testing Problems in Data Science Interviews | Binomial test, z-test and t-test](https://www.youtube.com/watch?v=IY7y-t30UJc)