How to Compare Distributions?

# How to Compare Distributions >[!Abstract] >In [[Machine Learning]] and [[self.stats/Statistics]] we encounter several kinds of distributions. Comparing these distributions, especially across different groups, is a common problem in data science. When it comes to causal inference in particular, the problem arises as we try to assess the quality of randomisation between control and treatment groups. > >**How can we compare them?** [[Probability Distributions]] describe a particular type of phenomena. How can we compare one form to another? ## Boxplots The boxplot is a good trade-off between summary statistics and data visualization. The center of the **box** represents the _median_ while the borders represent the first (Q1) and third [quartile](https://en.wikipedia.org/wiki/Quartile) (Q3), respectively. The **whiskers** instead extend to the first data points that are more than 1.5 times the _interquartile range_ (Q3 — Q1) outside the box. The points that fall outside of the whiskers are plotted individually and are usually considered [**outliers**](https://en.wikipedia.org/wiki/Outlier). ![[box_plots.png]] A boxplot is a good indicative of the dispersion of the group distribution. ## Histogram Whenever you have some sort of distribution of numbers, think of the histogram. It groups data into equally wide bins and plots their count within each bin. ```python sns.histplot(data=df, x='Income', hue='Group', bins=50); plt.title("Histogram"); ``` ![[hist_plot_demo.png]] Above you can see the histogram plot of two different distributions. As you can see, it is rather difficult to compare the two because they have different number of observations and also the number of bins is arbitrary. This is solved by using the **Density Histogram**. ### Density Histogram Instead of just counting the numebr of occurences, you can use their density: ```python sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); plt.title("Density Histogram"); ``` ![[density_histogram_demo.png]] There is still one issue, that the number of bins is arbitrary. This can be solved using a **Kernel Density Function** ## Kernel Density Function The **Kernel Density Function** tries to approximate the histogram with a continious function, using **[[Kernel Density Estimation (KDE)]]**. ```python sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); plt.title("Kernel Density Function"); ``` ![[kde_hist_plot_demo.png]] > From the plot, it seems that the estimated kernel density of `income` has "fatter tails" (i.e. higher variance) in the `treatment` group, while the average seems similar across groups. > The **issue** with kernel density estimation is that it is a bit of a black box and might mask relevant features of the data. ## Cumulative Distribution (Density) This a straightforward way of comparing two distributions. At each point of the x-axis (`income`) we plot the percentage of data points that have an equal or lower value. ```python sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density", element="step", fill=False, cumulative=True, common_norm=False); plt.title("Cumulative distribution function"); ``` ![[cumulative_plot_demo.png]] **Interpretation:** - Since the two lines cross more or less at 0.5 (y axis), it means that their median is *similar* - Since the orange line is above the blue line on the left and below the blue line on the right, it means that the distribution of the `treatment` group has fatter tails ## Q-Q Plot A related method is the **Q-Q** **plot**, where _q_ stands for quantile. The Q-Q plot plots the quantiles of the two distributions against each other. If the distributions are the same, we should get a 45-degree line. ![[qq_plot_demo.png]] ### Sources: - [How to Compare Two or More Distributions](https://towardsdatascience.com/how-to-compare-two-or-more-distributions-9b06ee4d30bf)