Introduction
As the nation’s capital, Washington D.C. stands as a vibrant and
densely populated metropolitan hub. To address the city’s need for
efficient transportation in such limited space, SmartBike DC, now known
as Capital Bikeshare, was established with an initial deployment of 10
stations and 120 bicycles. As of now, this bikeshare network extends
throughout the DC metro area, catering to over 3 million riders
annually. The system connects residents to workplaces, recreational
areas, Metro stations, and various points of interest across the region
to fill the gaps where alternative transportation may not be
available.
A significant challenge faced by bikeshare systems is the optimal
redistribution of bikes within the network. Ensuring the availability of
at least one bike and one open slot at every station is crucial, as a
station without bikes or bike parking inconveniences riders, compelling
them to seek alternative transportation. In this initiative, we focus on
developing a predictive model for Capital Bikeshare in the District of
Columbia. The model aims to facilitate bike redistribution by
forecasting demand in 15-minute intervals for the upcoming week.
Departing from traditional truck-based resupply methods, which are
resource-intensive, we propose a novel approach—offering dynamic
incentives to riders for redistributing bikes across the network.
Our analysis is restricted to the District of Columbia because of
computational limitations on model cross validation.
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(ggplot2)
library(FNN)
library(modelr)
library(gganimate)
library(gifski)
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
Import Data
Our model development process involves utilizing a bike-trip dataset
from Capital Bikeshare. We extract information on all bikeshare trips, including the names and coordinates of pickup
and dropoff locations for each trip.
data <- read.csv("C:/Users/Owner/Documents/RStudio/PublicPolicyAnalytics/BikeShare/202105-capitalbikeshare-tripdata.csv")
data2 <- data %>%
mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
dcCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2021,
state = "DC",
geometry = TRUE,
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
dcTracts <-
dcCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
dat_census <- st_join(data2 %>%
filter(is.na(start_lat) == FALSE &
is.na(start_lng) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lng) == FALSE) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
dcTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(from_longitude = unlist(map(geometry, 1)),
from_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
st_join(., dcTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(to_longitude = unlist(map(geometry, 1)),
to_latitude = unlist(map(geometry, 2)))%>%
filter(is.na(Origin.Tract) == FALSE,
is.na(Destination.Tract) == FALSE) %>%
as.data.frame() %>%
select(-geometry)
Engineering Feaures, Weather Data
Since climate conditions are important to whether a commuter decides
to travel by bicycle or not, we import 2021 data from the weather
station at Reagan National Airport to create temperature and
precipitation variables for our model. We visualize these data below. We
see a cyclical pattern in temperature and can expect bike ridership to
follow with these dips because of the time of day associated, however we
will also have to compare these to precipitation and temperatures
dropping below a certain amount, and how that effects ridership.
weather.Panel <-
riem_measures(station = "DCA", date_start = "2021-05-01", date_end = "2021-05-31") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
#glimpse(weather.Panel)
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
top="Weather Data - DC Reagan National Airport (DCA) - May 2021")

Import Metro Data
Proximity to Metro stations can indicate bikeshare demand. Therefore,
we extracted DC Metro stations and calculate the distance from each
bikeshare station to its nearest Metro station.
#length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
start_station_id = unique(dat_census$start_station_id)) %>%
left_join(., dat_census %>%
select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude)%>%
distinct() %>%
group_by(start_station_id) %>%
slice(1))
#nrow(study.panel)
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station_id) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, dcCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
metroStops <- st_read("https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Transportation_Rail_Bus_WebMercator/MapServer/52/query?outFields=*&where=1%3D1&f=geojson") %>%
dplyr::select(NAME, LINE) %>%
st_transform(crs=4326)
ride.panel <-
ride.panel %>%
mutate(X = from_longitude, Y = from_latitude )%>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")
ride.panel <-
ride.panel %>%
mutate(
station.dist = nn_function(st_coordinates(ride.panel), st_coordinates(metroStops), 1)) %>%
mutate(station.cat = case_when(
station.dist > 0 & station.dist <= 500 ~ "1",
station.dist > 500 & station.dist <= 1000 ~ "3",
station.dist > 1000 ~ "3"))
st_drop_geometry(ride.panel)
Data Exploration
We begin by examining the time and frequency components of our
data.
First, we look at the overall time pattern where there is clearly a
daily cyclical spikes and less activity on weekends. Notice that the
weekend near the 28th of May (Memorial Day) doesn’t have the same dip in
activity.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr., Washington, DC, May 2021",
x="Date",
y="Number of trips")+
plotTheme()

Here, we examine the distributions of trips per station at different
times of the day. They show strong right skews, meaning that the
majority of stations have smaller pickup numbers and a minority of
stations experience a disproportionately higher volume of pickups. There
are stronger skews for AM and overnight ridership, suggesting that
higher pickup volumes occur during mid-day and the PM rush hour.
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station_name, time_of_day) %>%
tally()%>%
group_by(start_station_name, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean # of Hourly Trips Per Station, Washington DC, May 2021",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme()

We also examine ridership by day of the week and weekend vs. weekday.
There appears to be large differences between days of the week,
especially between weekends and weekdays. Notably, weekends have their
spike consistently around late morning to midday (hours 11-16), while
weekdays have an consistent spike after 5pm (hours 17-18). Because of
the overall variation across days of the week, we will employ it as a
variable in our model.
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips in DC, by day of the week, May 2021",
x="Hour",
y="Trip Counts")+
plotTheme()

ggplot(dat_census %>%
mutate(hour = hour(started_at),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in DC - weekend vs weekday, May 2021",
x="Hour",
y="Trip Counts")+
plotTheme()
The map series below shows which bike stations are used more often
throughout the day of the week and time of day. Majority of these areas
are yellow representing little ridership, but when relating back to the
spikes on the weekends and weekdays, we see where these spikes are
occurring. Mid-day spikes on the weekday show red dots closer to the
central part of DC while the weekends presents slightly more activity on
the southern side throughout the day.
ggplot()+
geom_sf(data = dcTracts %>%
st_transform(crs=4326),
color = "gray80",
fill = "gray50")+
geom_point(data = dat_census %>%
mutate(hour = hour(started_at),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
tally(),
aes(x=from_longitude, y = from_latitude, color = n),
fill = "transparent", alpha = 0.7, size = 0.8)+
scale_colour_viridis(direction = -1, end = 0.95,
discrete = FALSE, option = "B")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station, Washington, DC, May 2021",)+
mapTheme()

Generate time lags
Creating time lag variables provides additional insights into
rideshare demand by capturing variations in usage hours before and
during a given day. Analyzing ride usage in hours before and after can
forecast future activity at rideshare stations. Additionally, it’s
important to factor in holiday effects that disrupt expected demand
patterns on weekends or weekdays, treating them as outliers. For May, we
can use Memorial Day on May 28 as a three-day weekend and examine the
temporal proximity to the holiday. The correlations in these time lags
are notably strong, with a Pearson’s R of 0.92 for the lagHour,
indicating a nearly 1:1 correlation. This high correlation serves as
practical evidence linking ridership usage and holidays.
It aligns with the intuitive understanding that current demand should
resemble tomorrow’s and an hour from now, but expectations equally vary
twelve hours into the future in terms of demand, where we see a value of
-0.59, showing usage 12 hours after a measurement will have an opposite
effect.
ride.panel <-
ride.panel %>%
arrange(start_station_id, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 148,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 × 2
## Variable correlation
## <fct> <dbl>
## 1 lagHour 0.92
## 2 lag2Hours 0.78
## 3 lag3Hours 0.6
## 4 lag4Hours 0.42
## 5 lag12Hours -0.59
## 6 lag1day 0.78
Animated map
We select bikeshare trips on May 18th from the dataset to visualize
in the following animated map. We chose this date as a relatively
representative date on which there was no precipitation and moderate
temperatures (a high of 78 F). Majority of bike trips throughout the day
remain in the center of DC with frequent lone points on the edges of the
map.
week20 <- dat_census %>%
filter(week == 20 & dotw == "Tue")
week20.panel <-
expand.grid(
interval15 = unique(week20$interval15),
station = unique(dat_census$start_station_id))
ride.animation.data <-
mutate(week20, Trip_Counter = 1) %>%
select(interval15, start_station_id, from_longitude, from_latitude, Trip_Counter) %>%
group_by(interval15, from_longitude, from_latitude, start_station_id) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 1 ~ "1 trip",
Trip_Count > 1 & Trip_Count <= 2 ~ "2 trips",
Trip_Count > 2 & Trip_Count <= 5 ~ "3-5 trips",
Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
Trip_Count > 10 ~ ">10 trips")) %>%
mutate(Trips = fct_relevel(Trips, "0 trips","1 trip","2 trips",
"3-5 trips","6-10 trips",">10 trips"))
rideshare_animation <-
ggplot()+
geom_sf(data = dcTracts %>%
st_transform(crs=4326), color = 'gray80', fill = "gray50")+
geom_point(data = ride.animation.data,
aes(x = from_longitude, y = from_latitude, color = Trips),
size = 1.5,
alpha = 1.5) +
scale_colour_manual(values = palette5) +
labs(title = "Bikeshare trips on May 18th, 2021, Washington, DC",
subtitle = "Time: {current_frame}") +
transition_manual(interval15) +
mapTheme()
animate(rideshare_animation, duration=45, renderer = gifski_renderer())

Modeling
In order to train and test our prediction model, we first split our
data into a training and a test set. Then, we build five linear models
using our training data to predict trip counts. The models’ dependent
variables are as follows:
- Hour of day, day of week, temperature
- Pickup station, day of week, temperature
- Variables in models 1 and 2 + precipitation
- Variables in model 3 + time lags
- Variables in model 4 + distance from Metro station
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station_name + dotw + Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + station.dist,
data=ride.Train)
Predict for test data
Next, we use our models to predict trip counts on the test data.
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(Model1_TimeOnly = map(.x = data, fit = reg1, .f = model_pred),
Model2_SpaceOnly = map(.x = data, fit = reg2, .f = model_pred),
Model3_Time_Space = map(.x = data, fit = reg3, .f = model_pred),
Model4_TimeSpaceLag = map(.x = data, fit = reg4, .f = model_pred),
Model5_All = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions
## # A tibble: 10 × 8
## week data Regression Prediction Observed Absolute_Error MAE sd_AE
## <dbl> <list> <chr> <list> <list> <list> <dbl> <dbl>
## 1 18 <sf> Model1_TimeOnly <dbl> <dbl> <dbl [47,232]> 0.684 0.978
## 2 19 <sf> Model1_TimeOnly <dbl> <dbl> <dbl [55,104]> 0.612 0.929
## 3 18 <sf> Model2_SpaceOnly <dbl> <dbl> <dbl [47,232]> 0.636 0.892
## 4 19 <sf> Model2_SpaceOnly <dbl> <dbl> <dbl [55,104]> 0.608 0.826
## 5 18 <sf> Model3_Time_Spac… <dbl> <dbl> <dbl [47,232]> 0.643 0.871
## 6 19 <sf> Model3_Time_Spac… <dbl> <dbl> <dbl [55,104]> 0.606 0.809
## 7 18 <sf> Model4_Time_Spac… <dbl> <dbl> <dbl [47,232]> 0.532 0.811
## 8 19 <sf> Model4_Time_Spac… <dbl> <dbl> <dbl [55,104]> 0.509 0.769
## 9 18 <sf> Model5_All <dbl> <dbl> <dbl [47,232]> 0.532 0.811
## 10 19 <sf> Model5_All <dbl> <dbl> <dbl [55,104]> 0.509 0.769
Examining Model Errors
Overall, our models are accurate to less than an average of one ride
per hour. The best performing models are those that incorporate time
lags, with a negligible improvement from Model 4 to Model 5.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme()

After plotting predicted vs. observed pickups, we observe that the
correspondence between predicted and observed values is strongest in
models 4 and 5, which is consistent with the previous observations. Each
model does somewhat of a decent job predicting the spike seen at the end
off work-hours on weekdays, but what makes models 4 and 5 best suited is
because they are able to measure that spike. The predicted outcomes for
1,2, and 3 reflect similarly to our observations on weekends, where the
peak was gradual, longer, and more consistent.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id)) %>%
dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bikeshare pickup time series", subtitle = "Washington DC; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme()

Further examining Model 5, our best suited, we observe Mean Absolute
Error (MAE) values when mapped across stations. Notably, errors are more
pronounced for stations in close proximity to downtown DC and the
National Mall. Our next step involves a thorough examination of
space-time errors to gain further insights. Like all the other models,
it underperforms perdictions where the spikes post-workday occur, but
can accurately predict other times of the day. The MAE would have been
exacerbated if using any other model.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude)) %>%
select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "Model5_All") %>%
group_by(start_station_id, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = dcCensus, color = "gray80", fill = "gray50")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", alpha = 0.7, size = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title="Mean Abs Error, Test Set, Model 5")+
mapTheme()

Space-Time Error Evaluation
When contrasting predicted vs observed the pattern goes beyond under
performing the spike but also overpredictions in the overnight. Looking
at the graph, we see that predicting trips have a heavy lean towards the
left side, indicating that there is predictions of more trips happening
overnight when the observed outcomes are smaller. The model is not
capable of explaining the day-night patterns of people that are
reflected in bikeshare activity.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw)) %>%
select(interval60, start_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "Model5_All")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme()

We also need to look at these errors of predictions versus
observations spatially, the spikes that the model missed are marked as
the higher MAE values in the National Mall area during midday and PM
rushes. We then look at the overnight where it also has higher MAE in
that same center, where it is predicting small but consistent ride usage
throughout the night. The model tends to take a more conservative
approach throughout the cyclical day, under performing business and over
performing downtime. The model cannot explain these discrepancies alone,
and there may be factors not related to bikeshare specifically that can
show correlations in its predictions. For this analysis we will look
deeper into the AM rush and the attributes errors in terms of
socio-economic factors.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "Model5_All")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = dcCensus, color = "white", fill = "gray80")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", size = 0.5, alpha = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "B")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme()

We looked at three factors, median income, public transportation use,
and percentage of white people. Public transportation does not appear to
have a given effect on the predicted values but the white population
& median income show a steady incline where there are larger errors
when there is more white people or wealthier areas. Thus, our model
predicts higher rideshare use in areas with whiter populations or higher
income.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Median_Income = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, start_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
unnest() %>%
filter(Regression == "Model5_All")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station_id, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = dcCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="AM Rush errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme()

Cross Validation
To assess the goodness of our model, we use k-fold cross-validation,
normally we would run 100 folds for a proper evaluation. However, due to
computational constraints, we had to settle for 10 folds. The histogram
illustrates the average errors across each fold, providing a summary of
our cross-validation results. The distribution exhibits an approximate
normality, with a central tendency around 1.05. This value represents a
relatively low level of error, underscoring the precision of our
model.
set.seed(8)
cv2<- crossv_kfold(ride.Train, 10)
models <- map(cv2$train,~ lm(Trip_Count ~ start_station_name + hour(interval60)
+ dotw + Temperature
+ Precipitation + lagHour + lag2Hours +lag3Hours
+lag12Hours + lag1day + station.dist,
data=ride.Train))
errs <- map2_dbl(models, cv2$test, rmse)
ggplot()+
geom_histogram(aes(errs),bins=8)+
labs(title="Histogram of k-fold CV errors",
y="Frequency",
x="Mean fold error")+
plotTheme()

remove(models, cv2)
Conclusion
We developed a model forecasting bike demand across DC, factoring in
both time and location. Our goal is to keep every bike station stocked
with at least one bike and have 1 bike slot open at all times, ensuring
riders always find what they need. After analyzing the various factors
and observed patterns, we estimate the bikes riders will pick up,
enabling Capital Bikeshare to proactively plan for station and bike
availability.
In conjunction with data on bike station capacity, this model is
capable to be implemented in a dynamic incentive system that
redistributes bikes across stations by offering lower costs (or even
offsets to future rides) to riders in exchange for a different drop-off
location. The incentive should vary with distance to original
destination, number of bikes already at the target drop-off station as
well as predicted bike demand at that station. Capital Bikeshare can
experiment with the incentive structure to optimize for bike
availability across stations.
---
title: "Bike Share Monitoring in Washington D.C"
author: "Trevor Kapuvari"
date: "`r Sys.Date()`"
output: 
  html_document:
    theme: flatly
    toc: true
    toc_float: true
    code_folding: "hide"
    code_download: true
---

# Introduction

As the nation's capital, Washington D.C. stands as a vibrant and densely populated metropolitan hub. To address the city's need for efficient transportation in such limited space, SmartBike DC, now known as Capital Bikeshare, was established with an initial deployment of 10 stations and 120 bicycles. As of now, this bikeshare network extends throughout the DC metro area, catering to over 3 million riders annually. The system connects residents to workplaces, recreational areas, Metro stations, and various points of interest across the region to fill the gaps where alternative transportation may not be available.

A significant challenge faced by bikeshare systems is the optimal redistribution of bikes within the network. Ensuring the availability of at least one bike and one open slot at every station is crucial, as a station without bikes or bike parking inconveniences riders, compelling them to seek alternative transportation. In this initiative, we focus on developing a predictive model for Capital Bikeshare in the District of Columbia. The model aims to facilitate bike redistribution by forecasting demand in 15-minute intervals for the upcoming week. Departing from traditional truck-based resupply methods, which                are resource-intensive, we propose a novel approach—offering dynamic incentives to riders for redistributing bikes across the network.


Our analysis is restricted to the District of Columbia because of computational limitations on model cross validation.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

```{r setup_13, message=FALSE, warning = FALSE}
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(ggplot2)
library(FNN)
library(modelr)
library(gganimate)
library(gifski)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
```


```{r install_census_API_key, warning = FALSE, include=FALSE, eval = TRUE}
# Install Census API Key
tidycensus::census_api_key("811e1f6f8d1f299cb75a0f0c07e01aafd801fa79", overwrite = TRUE)
```

## Import Data

Our model development process involves utilizing a 5-week dataset from Capital Bikeshare. We extract information on all bikeshare trips that occurred in May 2021, including the names and coordinates of pickup and dropoff locations for each trip. 

```{r read_data}
data <- read.csv("C:/Users/Owner/Documents/RStudio/PublicPolicyAnalytics/BikeShare/202105-capitalbikeshare-tripdata.csv")
```

```{r time_bins, message=F, results='hide' }
data2 <- data %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

dcCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = "DC",
          geometry = TRUE,
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

dcTracts <- 
  dcCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

dat_census <- st_join(data2 %>% 
          filter(is.na(start_lat) == FALSE &
                   is.na(start_lng) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lng) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        dcTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., dcTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  filter(is.na(Origin.Tract) == FALSE,
         is.na(Destination.Tract) == FALSE) %>%
  as.data.frame() %>%
  select(-geometry)
```

## Engineering Feaures, Weather Data

Since climate conditions are important to whether a commuter decides to travel by bicycle or not, we import May 2021 data from the weather station at Reagan National Airport to create temperature and precipitation variables for our model. We visualize these data below. We see a cyclical pattern in temperature and can expect bike ridership to follow with these dips because of the time of day associated, however we will also have to compare these to precipitation and temperatures dropping below a certain amount, and how that effects ridership. 

```{r import_weather, message = FALSE, warning = FALSE }
weather.Panel <- 
  riem_measures(station = "DCA", date_start = "2021-05-01", date_end = "2021-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#glimpse(weather.Panel)

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - DC Reagan National Airport (DCA) - May 2021")
```

## Import Metro Data

Proximity to Metro stations can indicate bikeshare demand. Therefore, we extracted DC Metro stations and calculate the distance from each bikeshare station to its nearest Metro station.

```{r create space time panel, message = FALSE, warning = FALSE}
#length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(., dat_census %>%
              select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

#nrow(study.panel)      

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, dcCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
```

```{r metrostops, results='hide'}
metroStops <- st_read("https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Transportation_Rail_Bus_WebMercator/MapServer/52/query?outFields=*&where=1%3D1&f=geojson") %>% 
    dplyr::select(NAME, LINE) %>%
  st_transform(crs=4326)

ride.panel <-
  ride.panel %>%
  mutate(X = from_longitude, Y = from_latitude )%>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")

ride.panel <-
  ride.panel %>%
  mutate(
    station.dist = nn_function(st_coordinates(ride.panel), st_coordinates(metroStops), 1)) %>%
  mutate(station.cat = case_when(
    station.dist > 0 & station.dist <= 500 ~ "1",
    station.dist > 500 & station.dist <= 1000 ~ "3",
    station.dist > 1000 ~ "3"))

st_drop_geometry(ride.panel)

```

## Data Exploration

We begin by examining the time and frequency components of our data. 

First, we look at the overall time pattern where there is clearly a daily cyclical spikes and less activity on weekends. Notice that the weekend near the 28th of May (Memorial Day) doesn't have the same dip in activity.

```{r trip_timeseries }
ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr., Washington, DC, May 2021",
       x="Date", 
       y="Number of trips")+
  plotTheme()
```

Here, we examine the distributions of trips per station at different times of the day. They show strong right skews, meaning that the majority of stations have smaller pickup numbers and a minority of stations experience a disproportionately higher volume of pickups. There are stronger skews for AM and overnight ridership, suggesting that higher pickup volumes occur during mid-day and the PM rush hour.

```{r mean_trips_hist, warning = FALSE, message = FALSE }
dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean # of Hourly Trips Per Station, Washington DC, May 2021",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()
```

We also examine ridership by day of the week and weekend vs. weekday. There appears to be large differences between days of the week, especially between weekends and weekdays. Notably, weekends have their spike consistently around late morning to midday (hours 11-16), while weekdays have an consistent spike after 5pm (hours 17-18). Because of the overall variation across days of the week, we will employ it as a variable in our model.

```{r trips_hour_dotw}
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in DC, by day of the week, May 2021",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

ggplot(dat_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in DC - weekend vs weekday, May 2021",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()
```
The map series below shows which bike stations are used more often throughout the day of the week and time of day. Majority of these areas are yellow representing little ridership, but when relating back to the spikes on the weekends and weekdays, we see where these spikes are occurring. Mid-day spikes on the weekday show red dots closer to the central part of DC while the weekends presents slightly more activity on the southern side throughout the day. 

```{r origin_map, fig.height=7}
ggplot()+
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326),
          color = "gray80",
          fill = "gray50")+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 0.7, size = 0.8)+
  scale_colour_viridis(direction = -1, end = 0.95,
                       discrete = FALSE, option = "B")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station, Washington, DC, May 2021",)+
  mapTheme()
```


## Generate time lags

Creating time lag variables provides additional insights into rideshare demand by capturing variations in usage hours before and during a given day. Analyzing ride usage in hours before and after can forecast future activity at rideshare stations. Additionally, it's important to factor in holiday effects that disrupt expected demand patterns on weekends or weekdays, treating them as outliers. For May, we can use Memorial Day on May 28 as a three-day weekend and examine the temporal proximity to the holiday. The correlations in these time lags are notably strong, with a Pearson's R of 0.92 for the lagHour, indicating a nearly 1:1 correlation. This high correlation serves as practical evidence linking ridership usage and holidays. 

It aligns with the intuitive understanding that current demand should resemble tomorrow's and an hour from now, but expectations equally vary twelve hours into the future in terms of demand, where we see a value of -0.59, showing usage 12 hours after a measurement will have an opposite effect. 


```{r time_lags , message = FALSE}
ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

```

```{r evaluate_lags , warning = FALSE, message = FALSE}
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))
```

## Animated map

We select bikeshare trips on May 18th from the dataset to visualize in the following animated map. We chose this date as a relatively representative date on which there was no precipitation and moderate temperatures (a high of 78 F). Majority of bike trips throughout the day remain in the center of DC with frequent lone points on the edges of the map. 

```{r animated map, message=F, warning=F}
week20 <- dat_census %>%
  filter(week == 20 & dotw == "Tue")

week20.panel <-
  expand.grid(
    interval15 = unique(week20$interval15),
    station = unique(dat_census$start_station_id))

ride.animation.data <-
  mutate(week20, Trip_Counter = 1) %>%
  select(interval15, start_station_id, from_longitude, from_latitude, Trip_Counter) %>%
  group_by(interval15, from_longitude, from_latitude, start_station_id) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 1 ~ "1 trip",
                           Trip_Count > 1 & Trip_Count <= 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 5 ~ "3-5 trips",
                           Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
                           Trip_Count > 10 ~ ">10 trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trip","2 trips",
                              "3-5 trips","6-10 trips",">10 trips"))

rideshare_animation <-
  ggplot()+
  geom_sf(data = dcTracts %>%
            st_transform(crs=4326), color = 'gray80', fill = "gray50")+
  geom_point(data = ride.animation.data, 
             aes(x = from_longitude, y = from_latitude, color = Trips), 
             size = 1.5, 
             alpha = 1.5) +
  scale_colour_manual(values = palette5) +
  labs(title = "Bikeshare trips on May 18th, 2021, Washington, DC",
       subtitle = "Time: {current_frame}") +
  transition_manual(interval15) +
  mapTheme()

animate(rideshare_animation, duration=45, renderer = gifski_renderer())
```

# Modeling

In order to train and test our prediction model, we first split our data into a training and a test set. Then, we build five linear models using our training data to predict trip counts. The models' dependent variables are as follows:

1. Hour of day, day of week, temperature
2. Pickup station, day of week, temperature
3. Variables in models 1 and 2 + precipitation
4. Variables in model 3 + time lags
5. Variables in model 4 + distance from Metro station


```{r train_test - regressions}
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + station.dist, 
     data=ride.Train)
```

## 4.2. Predict for test data

Next, we use our models to predict trip counts on the test data. 

```{r nest data - generate predictions, warning = FALSE, message = FALSE}
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(Model1_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
           Model2_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
           Model3_Time_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           Model4_Time_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           Model5_Full = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
```

## Examining Model Errors

Overall, our models are accurate to less than an average of one ride per hour. The best performing models are those that incorporate time lags, with a negligible improvement from Model 4 to Model 5. 

```{r plot_errors_by_model }
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme()
```

After plotting predicted vs. observed pickups, we observe that the correspondence between predicted and observed values is strongest in models 4 and 5, which is consistent with the previous observations. Each model does somewhat of a decent job predicting the spike seen at the end off work-hours on weekdays, but what makes models 4 and 5 best suited is because they are able to measure that spike. The predicted outcomes for 1,2, and 3 reflect similarly to our observations on weekends, where the peak was gradual, longer, and more consistent. 

```{r error_vs_actual_timeseries , warning = FALSE, message = FALSE, fig.height=8}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bikeshare pickup time series", subtitle = "Washington DC; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme()
```

Further examining Model 5, our best suited, we observe Mean Absolute Error (MAE) values when mapped across stations. Notably, errors are more pronounced for stations in close proximity to downtown DC and the National Mall. Our next step involves a thorough examination of space-time errors to gain further insights. Like all the other models, it underperforms perdictions where the spikes post-workday occur, but can accurately predict other times of the day. The MAE would have been exacerbated if using any other model.

```{r errors_by_station, warning = FALSE, message = FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "Model5_Full") %>%
  group_by(start_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = dcCensus, color = "gray80", fill = "gray50")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.7, size = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme()
```

### Space-Time Error Evaluation

When contrasting predicted vs observed the pattern goes beyond under performing the spike but also overpredictions in the overnight. Looking at the graph, we see that predicting trips have a heavy lean towards the left side, indicating that there is predictions of more trips  happening overnight when the observed outcomes are smaller. The model is not capable of explaining the day-night patterns of people that are reflected in bikeshare activity. 


```{r obs_pred_all, warning=FALSE, message = FALSE, cache=TRUE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model5_Full")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()
```

We also need to look at these errors of predictions versus observations spatially, the spikes that the model missed are marked as the higher MAE values in the National Mall area during midday and PM rushes. We then look at the overnight where it also has higher MAE in that same center, where it is predicting small but consistent ride usage throughout the night. The model tends to take a more conservative approach throughout the cyclical day, under performing business and over performing downtime. The model cannot explain these discrepancies alone, and there may be factors not related to bikeshare specifically that  can show correlations in its predictions. For this analysis we will look deeper into the AM rush and the attributes errors in terms of socio-economic factors. 

```{r station_summary, warning=FALSE, message = FALSE, fig.height=7 }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model5_Full")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcCensus, color = "white", fill = "gray80")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme()
  
```

We looked at three factors, median income, public transportation use, and percentage of white people. Public transportation does not appear to have a given effect on the predicted values but the white population & median income show a steady incline where there are larger errors when there is more white people or wealthier areas. Thus, our model predicts higher rideshare use in areas with whiter populations or higher income.     

```{r station_summary2, warning=FALSE, message = FALSE, fig.width=10 }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Median_Income = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
    unnest() %>%
  filter(Regression == "Model5_Full")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = dcCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="AM Rush errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme()
  
```



## Cross Validation

To assess the goodness of our model, we use k-fold cross-validation, normally we would run 100 folds for a proper evaluation. However, due to computational constraints, we had to settle for 10 folds. The histogram illustrates the average errors across each fold, providing a summary of our cross-validation results. The distribution exhibits an approximate normality, with a central tendency around 1.05. This value represents a relatively low level of error, underscoring the precision of our model.

```{r cv, cache=T}
set.seed(8)

cv2<- crossv_kfold(ride.Train, 10)

models <- map(cv2$train,~ lm(Trip_Count ~  start_station_name + hour(interval60) 
                             + dotw + Temperature 
                             + Precipitation + lagHour + lag2Hours +lag3Hours 
                             +lag12Hours + lag1day + station.dist, 
                            data=ride.Train))

errs <- map2_dbl(models, cv2$test, rmse)

ggplot()+
  geom_histogram(aes(errs),bins=8)+
  labs(title="Histogram of k-fold CV errors",
       y="Frequency",
       x="Mean fold error")+
  plotTheme()

remove(models, cv2)
```

## Conclusion


We developed a model forecasting bike demand across DC, factoring in both time and location. Our goal is to keep every bike station stocked with at least one bike and have 1 bike slot open at all times, ensuring riders always find what they need. After analyzing the various factors and observed patterns, we estimate the bikes riders will pick up, enabling Capital Bikeshare to proactively plan for station and bike availability. 

In conjunction with data on bike station capacity, this model is capable to be implemented in a dynamic incentive system that redistributes bikes across stations by offering lower costs (or even offsets to future rides) to riders in exchange for a different drop-off location. The incentive should vary with distance to original destination, number of bikes already at the target drop-off station as well as predicted bike demand at that station. Capital Bikeshare can experiment with the incentive structure to optimize for bike availability across stations.  


