# What is Statistics? --- Statistics is about the study of data. It's about the properties of the data. Statistics is a mathematic endeavour in the pursuit of understanding the structures of datasets. Suppose we want to study the effects of a certain treatment (like a vaccine) on a response (like getting Covid). In order to do that we use the method of *comparison*. We compare the responses of a *treatment group* with a *control group*. Usually it is quite hard to judge the effect of a certain treatment without comparing it to something else. If the control group is comparable to the treatment group, apart from the treatment itself, the a difference in response between the groups is likely due to be the treatment. However, if there are other factors present in the differences between the two groups, then the effects of thse *other factors* is likely to be *confounded* with the effect of the treatment. ## Plots ### The Histogram ### The box plot ### The Scatter Plot ## Point Statistic ### Mean, Mode and Median ### Variance ### Standard Deviation ## Correlation #### Scatter Plots ## Regression ### Questions to be answered: 1. What exactly is the relationship between probability and statistics? Why are they always presented together? - Inverse probability - Uncertainty 2. Word cloud of statistics and roadmap 1. → Compile a few pdfs of books on statistics and create a word-cloud > The roulette wheel has neither conscience nor memory. — Joseph Bertrand > 3. Principles 1. Maximum Likelihood 2. Bayesian Treatment ## Plots and Statistics --- - How using plots as a starting point for studying data? - How to compare distributions? [article reference] - Here talk about the histogram, box plots (?) - Point statistics: mean and median, (balancing act) ## Variance and Standard Deviation --- - How to measure variance within the data? - What does the variance tell you? ## Correlation --- - Scatter plots - Scatter plot, `kind=’hex’` ```python s = sns.jointplot(x='log_age', y='log_sales', data=df, kind='hex'); s.ax_joint.grid(False); s.ax_marg_y.grid(False); s.fig.suptitle("Sales by firm's age", y=1.02); ``` - Binned scatter plots > Ecological correlations are based on rates or averages. They are often used in political science and sociology. And the tend to overstate the strength of an association. So watch out. > - Confounder ### The Standard Deviation Line > "The line that goes through the point of averages and climbs at the rate of one vertical SD for each horizontal SD" Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th edn). > ## Linear Regression --- - The regression method - RMS error - Residual plots as a test for non-linearities - Homoscedastic — football shaped — vs. heteroscedastic - Levels of accuracy at different parts of the scatter diagram - RMS is a sort of average error in heteroscedatic distributions - How to fix homoscedastic data —> transformation (eg. log scale) p197 [not for now perhaps] > The primary goal of linear regression is to find the strength of correlation between your variables — [Medium article](https://medium.com/bitgrit-data-science-publication/top-5-machine-learning-algorithms-explained-d15234b627f7) > > *The average regression of the offspring is a constant fraction of their respective mid-parental deviations. — Francis Galton* > - Using the normal curve inside a vertical strip - Estimating percentage scores over a certain value using another variable (p195) - Observational studies, and how you can not predict y if you change x in them - Intellectual disaster p211 - Using OLS [Understanding the Frisch-Waugh-Lovell Theorem](https://medium.com/towards-data-science/the-fwl-theorem-or-how-to-make-all-regressions-intuitive-59f801eb3299) # Sources [What is Statistics? (Michael I. Jordan) | AI Podcast Clips](https://www.youtube.com/watch?v=AQUAPiHahVY) > Perhaps the single most important lesson I learned in the class that I took from Professor Freedman was to mentally separate the mathematical model of a situation from the reality of the data generating process that the model describes. >