# The R Programming Language
R is heavily used in [[self.stats/Statistics|statistcs and data science.]]
- Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` (plus `group_by`)
### Filter (AND, OR)
```R
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1, DayofMonth==1)
```
```R
# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
```
### Select
```R
select(flights, DepTime, ArrTime, FlightNum)
```
```R
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
```
### Chaining | Pipelining
```R
# chaining method
flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)
```
```R
# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
```
#### vs
```R
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
```
### Arrange
```R
# dplyr approach
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay)
```
### Mutate - adding new variables
```R
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime*60)
```
### Aggregation - reducing variables
```R
# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
```
#### `summarise_each` summarising for multiple columns
```R
# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled, Diverted)
```
```R
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
```
```R
# rewrite more simply with the `tally` function
flights %>%
group_by(Month, DayofMonth) %>%
tally(sort = TRUE)
```
```R
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum))
```
### Window Functions
- Aggregation function (like `mean`) takes n inputs and returns 1 value
- Window functions take n inputs and return n values
- Includes ranking and ordering functions (like `min_rank`)
Eg. for each carrier, calculate which two days of the year they had their longest departure delays
```R
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
filter(min_rank(desc(DepDelay)) <= 2) %>%
arrange(UniqueCarrier, desc(DepDelay))
```
Or
```R
# rewrite more simply with the `top_n` function
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
top_n(2) %>%
arrange(UniqueCarrier, desc(DepDelay))
```
```R
# for each month, calculate the number of flights and the change from the previous month
flights %>%
group_by(Month) %>%
summarise(flight_count = n()) %>%
mutate(change = flight_count - lag(flight_count))
```
- `lag` means look at the earlier value
- `lead` means look at the next value
### Transmute
- mutate and select combined
---
## Reference
- [Introduction to dplyr](https://rpubs.com/justmarkham/dplyr-tutorial)