R - nati.sh

# The R Programming Language R is heavily used in [[self.stats/Statistics|statistcs and data science.]] - Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` (plus `group_by`) ### Filter (AND, OR) ```R # note: you can use comma or ampersand to represent AND condition filter(flights, Month==1, DayofMonth==1) ``` ```R # use pipe for OR condition filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA") ``` ### Select ```R select(flights, DepTime, ArrTime, FlightNum) ``` ```R # use colon to select multiple contiguous columns, and use `contains` to match columns by name # note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay")) ``` ### Chaining | Pipelining ```R # chaining method flights %>% select(UniqueCarrier, DepDelay) %>% filter(DepDelay > 60) ``` ```R # create two vectors and calculate Euclidian distance between them x1 <- 1:5; x2 <- 2:6 sqrt(sum((x1-x2)^2)) ``` #### vs ```R # chaining method (x1-x2)^2 %>% sum() %>% sqrt() ``` ### Arrange ```R # dplyr approach flights %>% select(UniqueCarrier, DepDelay) %>% arrange(DepDelay) ``` ### Mutate - adding new variables ```R flights %>% select(Distance, AirTime) %>% mutate(Speed = Distance/AirTime*60) ``` ### Aggregation - reducing variables ```R # dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay flights %>% group_by(Dest) %>% summarise(avg_delay = mean(ArrDelay, na.rm=TRUE)) ``` #### `summarise_each` summarising for multiple columns ```R # for each carrier, calculate the percentage of flights cancelled or diverted flights %>% group_by(UniqueCarrier) %>% summarise_each(funs(mean), Cancelled, Diverted) ``` ```R # for each carrier, calculate the minimum and maximum arrival and departure delays flights %>% group_by(UniqueCarrier) %>% summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay")) ``` ```R # rewrite more simply with the `tally` function flights %>% group_by(Month, DayofMonth) %>% tally(sort = TRUE) ``` ```R # for each destination, count the total number of flights and the number of distinct planes that flew there flights %>% group_by(Dest) %>% summarise(flight_count = n(), plane_count = n_distinct(TailNum)) ``` ### Window Functions - Aggregation function (like `mean`) takes n inputs and returns 1 value - Window functions take n inputs and return n values - Includes ranking and ordering functions (like `min_rank`) Eg. for each carrier, calculate which two days of the year they had their longest departure delays ```R flights %>% group_by(UniqueCarrier) %>% select(Month, DayofMonth, DepDelay) %>% filter(min_rank(desc(DepDelay)) <= 2) %>% arrange(UniqueCarrier, desc(DepDelay)) ``` Or ```R # rewrite more simply with the `top_n` function flights %>% group_by(UniqueCarrier) %>% select(Month, DayofMonth, DepDelay) %>% top_n(2) %>% arrange(UniqueCarrier, desc(DepDelay)) ``` ```R # for each month, calculate the number of flights and the change from the previous month flights %>% group_by(Month) %>% summarise(flight_count = n()) %>% mutate(change = flight_count - lag(flight_count)) ``` - `lag` means look at the earlier value - `lead` means look at the next value ### Transmute - mutate and select combined --- ## Reference - [Introduction to dplyr](https://rpubs.com/justmarkham/dplyr-tutorial)