AB Testing - nati.sh

# AB Testing >[!Abstract] >Online controlled expriments are the best scientific way to establish causality with high probability. > > ## TODO: - [ ] What is the role of placebo/nocebo in AB testing? - [ ] Maybe read what is written in [[Placebo Effect]] ```toc ``` ## AB Testing Procedure ![[ab_testing_king.png]] ### 1. **The Problem Statement** - Prerequisites --- - What is the problem statement? - What is the nature of this product? (e-commerce, SAAS, subscriptions?) - What are the business goals of this problem? - What is the **User Journey?** What does it look like? - What is the core product? Print an outline of the user journey? ![[user_journey.png]] ### 2. **Hypothesis Testing** --- - **Define the success and driver metrics.** This is somewhat similar to what is called a **[[AB Testing#Metric Selection#What are driver metrics?|Driver Metric]]**, which is discussed further down in the document - Consider the following - Is it **measurable**? - Is it **attributable**? Can you establish a clear link between the behaviour (effect) and the treatment (cause)? - Is it **sensitive**? Is it a good proxy in terms of user experience. Does it have low variability? (Low variability is good) - Is it **timely** (quick)? - **State the hypothesis statements** - **Null Hypothesis ($H_0$):** The average revenue per day per user between the baseline and variant ranking algorithms are the same - **Alternative Hypothesis ($H_a$):** The average revenue per day per user between the baseline and variant ranking algorithms are different - **Set the significance level**. Eg. alpha = 0.05 - **Set the statistical power**. The probability of detecting an effect if the alternative hypothesis is true. It is common practice to design experiments for 80-90% power. - **Set the minimum detectable effect (MDE).** The minimum effect that is detectable ### 3. **Design the Experiment** --- - Set the randomisation unit - Eg. User. Randomise using the user unit - **Set guardrail metrics** - Which population do you want to randomise? - Use the user funnel to decide on this - Which users do you want to target? - Determine sample size $n = \frac{2\sigma^2(z_{\alpha /2} + z_\beta)^2}{\delta^2}$ - Where - $\sigma^2$ is the *variance* - This is usually estimated from previous A/B tests - Historical logs - Or simply run an A/A test when there is no historical data available - $\alpha$ is the significance level, Type I error (false positive) - $\beta$ is Type II error (false negative) which is 1 - *power* - $\delta$ is the difference between the two groups - Difference between control and treatment - Use *minimum detectable effect* (aka practical significance) - Eg. 10M increase in revenue, 10k increase in button click - Smallest change that is meaningful to the business - If $\alpha = 0.05$ then $z_{\alpha/2} \approx 1.96$ which results in the following formula which is typically used as the sample size $n =\frac{16 \sigma^2}{\delta^2}$ For further details, look into [[The Statistics of AB Testing]] ### 4. **Running the Experiment** --- - Wait until the completion of the experiment before making a conclusion ### 5. **Validity Checks** --- - Bias checks - Instrumentation effects - External factors, seasonality, economic conditions, Covid etc. - Check for selection bias using A/A Test - Sample ratio mismatch using [[Chi-squared Test|Chi-Square]] Goodness of Fit Test ### 6. **Interpret Results** --- - Reject or accept the null hypothesis H0 ### 7. **Launch Decision** --- Consider the following - Metric Trade-Offs: primary metric may improve but secondary metrics may decline - Cost of Launching: consider the cost of launching and maintaining the change across all the users --- ## Power of an AB Test >[!Quote] >“ A common mistake is to assume that just because a metric is not statistically significant, there is no Treatment effect. It could very well be that the experiment is *underpowered* to detect the effect size we are seeing, that is, there are not enough users in the test. ” >-- Trustworthy Online Controlled Experiments How do you determine the number of users you need, or how long you'd need to wait, for your experiment? The statistical power is the probability of finding a statistically signficiant result when the null hypothesis is false. ### A Thought Experiment - Peeking This experiment illustrates why you need to perform [[Power Analysis|power analyis]] before you do your A/B Test Suppose 100 different individuals run [[AA Tests|A/A tests]]. Each waits for a significant result, i.e., p-value < 5% (up to a maximum of 10k visitors). - How many find a significant result and stop early? -> **Over half!!** - In A/B testing, "peeking" can dramatically inflate false positives. ![[false_positives_in_peeking.png]] ![[peekingABTestp2.png]] ## Experimentation 101 - Primary Metric - Conversion in just an example of a primary metric. When you create an assumption your primary metric has to be **conclusive** in order to accept that there is a statistical significance in your observation - Primary Metric -> **Conclusive Positive/Negative** - Supporting/Secondary Metrics - Primary metric is the end goal although most of the times you will not succeed at once. Supporting metrics give you indication whether the change you are suggesting is moving in the right direction. - Some good examples of supporting metrics in e-commerce are: - Customer service tickets - Product returns/cancellations - Health Metrics - Health metrics are referring to the health of your platform/service that you deliver and to make sure that your experiment is not impacting unexpected metrics - Some examples are - Performance/Website Speed - Javascript errors - Backend errors/queries - You can use an AA Test here - Binomial Goals - Binomial goals ar eall the little breadcrumbs that customers are leaving while navigating your product/service. They can be measured in a yes/no response and tracked accordingly. - Bounce rate from page - Bike category page visited ## Metric Selection ### What are driver metrics? - aka surrogate metric, indirect, predictive metrics - Major metrics used for A/B testing - Short-term, sensitive and actionable - Eg. A simple A/B test of an ad campaign - Goal: increase total revenue from sales of items ### What are the attributes of driver metrics? - We can use three criterion to determine a driver metric: 1. Sensitive and timely - *Good example: CTR* - Immediately reflects Ads performance - Bad example: DAU - Takes time for users to purchase, adopt to the product - Important to the business but not suitable for A/B tests - *DAU* is better suited as a *success metric* as opposed to a driver metric 2. Measurable - *Good example: CTR* - Counts can be obtained in real-time - Bad example: MAU, user retention - Out of the experiment timeframe 3. Attributable: we must be able to attribute the change in the metric to an experiment variant. This requires us to measure the metric in the control and treatment group separately - *Good example: CTR* - Good Ads campaign leads to a higher CTR and vice versa - Bad example: DAU, user retention - Many other things can cause a change in these metrics ### How to develop and select driver metrics? - Combine qualitative and qunatitative methods - **Qualitative:** user experience research, focus groups and surveys to understand users needs - **Quantitative:** data analysis Most importantly, it is important to **understand the motivation** of an A/B test and define metrics specifically for measuring the change. For that, you can do the following: 1. Fully understand the goal of the test - Is the goal user growth? - Is it to improve engagement? - Is it to increase revenue? - Is the change about acquisition, activation, retention, referral or revenue? - *Be as specific as possible to fully understand the goal* Eg: YouTube hides dislike counts on videos - Goal: protect creators, especially small creators, from harassment and make them feel safe - What is expected? - Small creators become more engaged - Two driver metrics - Average time spent on YouTube per creator - Average number of videos published per creator 2. Analyse user experience - Consider the steps users in each group need to take to use a feature or a product - Most products or features have a *funnel* that moves users towards taking key actions or desired outcomes that are meaningful to the business For our example above: - **Desired outcome:** fewer dislike on videos on smaller channels - **Control:** viewers can see the number of dislikes - **Treatment:** viewers cannot see dislike numbers, but feedback is possible - **Metric:** Average number of dislikes per viewer - Measure the average number of dislikes for smaller creators (ideally decreased) - **Result:** reduction in dislike attacking behavior ## Choosing Randomization Units for A/B Tests - What is a randomization unit? - Unit of diversion - Who or what is randomly assigned to each variant of an A/B Test It is important that the **user experience** is consistent when it comes to selecting the randomization unit. Furthermore, the randomization unit must be **coarser** than the *unit of analysis*, see below for an example. - What is commonly used? - UserID - Consistent user experience - Allows for long-term measurements - User retention - User learning effect - Cookie - Anonymous - Can be cleared - Event - Finer level of granularity (same user can be connected to many page views or sessions) - Page-level Randomization - Session-level Randomization - Device ID for mobile experiments - Eg. lets say you want to analyse *click-through rate* - Unit of analysis: page view - Randomisatoin unit: user - Randomization unit is coarser than unit of analysis. --- ## A/B Testing Pitfalls 1. Sample Ratio Mismatch - Possible causes - Issues with ramp-up plans that PMs use to avoid unexpected phenomena - A user being assigned to multiple ongoing experiments - When users are assigned based on certain segmentations that might be changing over time (eg. location) - Bugs in the process of user assignment - Debugging tools - Analysing upstream data processing units in randomization - Variant assignments, is there any bias being introduced (eg. female vs male) 2. Violation of **Stable Unit Treatment Value Assumption** SUTVA - Randomization units are *independent* and there is *no interaction* between them - Often happens in two-sides markets where resources are shared between groups: - eg. Ebay, Uber etc - If the new feature increases demand on the treatment group, then the treatment group needs more supply to fulfil that demand. This will impact the supply of the control group and as a result violate SUTVA - Solution - Predict where the interference will happen and take it into account in the design phase - Isolate control and treatment units, for example, by isolating experiment groups in different geographical locatins 3. Change in Users' Behaviors - Some users love new features: *novelty effect* - Users use the feature heavily in the begining and abandon it - this is a novelty effect. This effect should be monitored and quantified. - While others hate it: *primacy effect* ## Relations - - [[Causal Inference]] - [[How to Compare Distributions?]] - [[P-value]] - [[Type I and Type II Errors]] - [[Power Analysis|Statistical Power]] - [[Novelty Effect]] ## Sources: - [A/B Testing in Data Science - YouTube Playlist](https://www.youtube.com/playlist?list=PLY1Fi4XflWSvgsaD9eXng6N5kxcMtcxGK) - Trustworthy Online Controlled Experiments - Book