# Module 1 - Introduction and Descriptive Statistics for Exploring Data
```toc
```
## # Pie Chart, Bar Graph, and Histograms
It is best to use graphical summary to communicate information, because people prefer to look at pictures rather than at numbers. It can also be argued that more information can be contained within a picture as opposed to, for example, a list of numbers.
## The Box Plot
The boxplot conveys less information than a histogram, but it takes up less space and is well suited to compare several datasets.
![[the_box_plot.png]]
- The middle line of the boxplot represents the **median**, *the mean is not included in the boxplot*
- The median is where half of the data lies above and half below
- The lower whisker shows the smallest number in the data
- The upper whisker shows the largest number in the data
- The lower line in the box shows the **first quartile** where 1/4 of the data are smaller and 3/4 of the data is larger
- The upper line shows the **third quartile** where 3/4 of the data are smaller and 1/4 larger
The boxplot makes it easier to compare different categories within a dataset. For example, the following diagram shows the efficiency difference of vehicles categorised by the number of cylinders
![[box_plots_cats.png]]
The boxplot gives us a **five-number summary** of the data: *the smallest number, the 1st quartile, the median (50th percentile), 3rd quartile and the largest number.*
The boxplot also gives us the **interquartile range** = 3rd quartile - 1st quartile. The interquartile range gives us a measure of how spread out the data is.
## The Standard Deviation
Although the interquartile range gives us an idea as to how spread out the data set is, the standard deviation is a more commonly used measure of spread.
The standard deviation is given by
$\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 } $
Sometimes, $n-1$ is used instead of $n$
The mean gives you a measure of *centre*, while the standard deviation gives you a measure of *spread*. Keep in mind that, both can be sensitive to outliers (very large, or very small datapoints) in which case it may be better to use the median and the interquartile range. I made this mistake while interviewing for [[DeepL Interview - Rejected|DeepL]]
## The Scatterplot
The **scatterplot** is used to depict data that come as *pairs*. It visualises the relationship between two variables.
![[scatter_plot.png]]
## Providing Context is Important
Statistical analyses typically compare the observed data to a reference. Therefore context is essential for graphical integrity. One way to provide context is by using *small multiples*. The compact design of boxplot makes it well suited for this task:
![[small_multiples_boxplot.png]]
## Numerical Summary Measures
For summarising data with one number, use the **mean** (=average) or the **median**. The median is the number that is larger than half the data and smaller than the other half.
The graph below includes a histogram plot with several [[Point Estimation|point estimates]] outlined and shown visually.
![[summarised_plot.png]]
Note: if the histogram is symmetric, then the mean and median are identical. This would be the case, for example, in peoples height, weight etc, i.e. those that follow the [[Normal Distribution]]
Which one is better?
- If the median sales price of 10 homes is $ 1 million, then we know that 5 homes sold for $ 1 million or more
- If we are told that the average sale price is $ 1 million, then we can't draw such a conclusion