Introduction

As the nation’s capital, Washington D.C. stands as a vibrant and densely populated metropolitan hub. To address the city’s need for efficient transportation in such limited space, SmartBike DC, now known as Capital Bikeshare, was established with an initial deployment of 10 stations and 120 bicycles. As of now, this bikeshare network extends throughout the DC metro area, catering to over 3 million riders annually. The system connects residents to workplaces, recreational areas, Metro stations, and various points of interest across the region to fill the gaps where alternative transportation may not be available.

A significant challenge faced by bikeshare systems is the optimal redistribution of bikes within the network. Ensuring the availability of at least one bike and one open slot at every station is crucial, as a station without bikes or bike parking inconveniences riders, compelling them to seek alternative transportation. In this initiative, we focus on developing a predictive model for Capital Bikeshare in the District of Columbia. The model aims to facilitate bike redistribution by forecasting demand in 15-minute intervals for the upcoming week. Departing from traditional truck-based resupply methods, which are resource-intensive, we propose a novel approach—offering dynamic incentives to riders for redistributing bikes across the network.

Our analysis is restricted to the District of Columbia because of computational limitations on model cross validation.

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(ggplot2)
library(FNN)
library(modelr)
library(gganimate)
library(gifski)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

Import Data

Our model development process involves utilizing a bike-trip dataset from Capital Bikeshare. We extract information on all bikeshare trips, including the names and coordinates of pickup and dropoff locations for each trip.

data <- read.csv("C:/Users/Owner/Documents/RStudio/PublicPolicyAnalytics/BikeShare/202105-capitalbikeshare-tripdata.csv")
data2 <- data %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

dcCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = "DC",
          geometry = TRUE,
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

dcTracts <- 
  dcCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

dat_census <- st_join(data2 %>% 
          filter(is.na(start_lat) == FALSE &
                   is.na(start_lng) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lng) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        dcTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., dcTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  filter(is.na(Origin.Tract) == FALSE,
         is.na(Destination.Tract) == FALSE) %>%
  as.data.frame() %>%
  select(-geometry)

Engineering Feaures, Weather Data

Since climate conditions are important to whether a commuter decides to travel by bicycle or not, we import 2021 data from the weather station at Reagan National Airport to create temperature and precipitation variables for our model. We visualize these data below. We see a cyclical pattern in temperature and can expect bike ridership to follow with these dips because of the time of day associated, however we will also have to compare these to precipitation and temperatures dropping below a certain amount, and how that effects ridership.

weather.Panel <- 
  riem_measures(station = "DCA", date_start = "2021-05-01", date_end = "2021-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#glimpse(weather.Panel)

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - DC Reagan National Airport (DCA) - May 2021")

Import Metro Data

Proximity to Metro stations can indicate bikeshare demand. Therefore, we extracted DC Metro stations and calculate the distance from each bikeshare station to its nearest Metro station.

#length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(., dat_census %>%
              select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

#nrow(study.panel)      

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, dcCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
metroStops <- st_read("https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Transportation_Rail_Bus_WebMercator/MapServer/52/query?outFields=*&where=1%3D1&f=geojson") %>% 
    dplyr::select(NAME, LINE) %>%
  st_transform(crs=4326)

ride.panel <-
  ride.panel %>%
  mutate(X = from_longitude, Y = from_latitude )%>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")

ride.panel <-
  ride.panel %>%
  mutate(
    station.dist = nn_function(st_coordinates(ride.panel), st_coordinates(metroStops), 1)) %>%
  mutate(station.cat = case_when(
    station.dist > 0 & station.dist <= 500 ~ "1",
    station.dist > 500 & station.dist <= 1000 ~ "3",
    station.dist > 1000 ~ "3"))

st_drop_geometry(ride.panel)

Data Exploration

We begin by examining the time and frequency components of our data.

First, we look at the overall time pattern where there is clearly a daily cyclical spikes and less activity on weekends. Notice that the weekend near the 28th of May (Memorial Day) doesn’t have the same dip in activity.

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr., Washington, DC, May 2021",
       x="Date", 
       y="Number of trips")+
  plotTheme()

Here, we examine the distributions of trips per station at different times of the day. They show strong right skews, meaning that the majority of stations have smaller pickup numbers and a minority of stations experience a disproportionately higher volume of pickups. There are stronger skews for AM and overnight ridership, suggesting that higher pickup volumes occur during mid-day and the PM rush hour.

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean # of Hourly Trips Per Station, Washington DC, May 2021",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

We also examine ridership by day of the week and weekend vs. weekday. There appears to be large differences between days of the week, especially between weekends and weekdays. Notably, weekends have their spike consistently around late morning to midday (hours 11-16), while weekdays have an consistent spike after 5pm (hours 17-18). Because of the overall variation across days of the week, we will employ it as a variable in our model.

ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in DC, by day of the week, May 2021",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

ggplot(dat_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in DC - weekend vs weekday, May 2021",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

The map series below shows which bike stations are used more often throughout the day of the week and time of day. Majority of these areas are yellow representing little ridership, but when relating back to the spikes on the weekends and weekdays, we see where these spikes are occurring. Mid-day spikes on the weekday show red dots closer to the central part of DC while the weekends presents slightly more activity on the southern side throughout the day.

ggplot()+
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326),
          color = "gray80",
          fill = "gray50")+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 0.7, size = 0.8)+
  scale_colour_viridis(direction = -1, end = 0.95,
                       discrete = FALSE, option = "B")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station, Washington, DC, May 2021",)+
  mapTheme()

Generate time lags

Creating time lag variables provides additional insights into rideshare demand by capturing variations in usage hours before and during a given day. Analyzing ride usage in hours before and after can forecast future activity at rideshare stations. Additionally, it’s important to factor in holiday effects that disrupt expected demand patterns on weekends or weekdays, treating them as outliers. For May, we can use Memorial Day on May 28 as a three-day weekend and examine the temporal proximity to the holiday. The correlations in these time lags are notably strong, with a Pearson’s R of 0.92 for the lagHour, indicating a nearly 1:1 correlation. This high correlation serves as practical evidence linking ridership usage and holidays.

It aligns with the intuitive understanding that current demand should resemble tomorrow’s and an hour from now, but expectations equally vary twelve hours into the future in terms of demand, where we see a value of -0.59, showing usage 12 hours after a measurement will have an opposite effect.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 × 2
##   Variable   correlation
##   <fct>            <dbl>
## 1 lagHour           0.92
## 2 lag2Hours         0.78
## 3 lag3Hours         0.6 
## 4 lag4Hours         0.42
## 5 lag12Hours       -0.59
## 6 lag1day           0.78

Animated map

We select bikeshare trips on May 18th from the dataset to visualize in the following animated map. We chose this date as a relatively representative date on which there was no precipitation and moderate temperatures (a high of 78 F). Majority of bike trips throughout the day remain in the center of DC with frequent lone points on the edges of the map.

week20 <- dat_census %>%
  filter(week == 20 & dotw == "Tue")

week20.panel <-
  expand.grid(
    interval15 = unique(week20$interval15),
    station = unique(dat_census$start_station_id))

ride.animation.data <-
  mutate(week20, Trip_Counter = 1) %>%
  select(interval15, start_station_id, from_longitude, from_latitude, Trip_Counter) %>%
  group_by(interval15, from_longitude, from_latitude, start_station_id) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 1 ~ "1 trip",
                           Trip_Count > 1 & Trip_Count <= 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 5 ~ "3-5 trips",
                           Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
                           Trip_Count > 10 ~ ">10 trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trip","2 trips",
                              "3-5 trips","6-10 trips",">10 trips"))

rideshare_animation <-
  ggplot()+
  geom_sf(data = dcTracts %>%
            st_transform(crs=4326), color = 'gray80', fill = "gray50")+
  geom_point(data = ride.animation.data, 
             aes(x = from_longitude, y = from_latitude, color = Trips), 
             size = 1.5, 
             alpha = 1.5) +
  scale_colour_manual(values = palette5) +
  labs(title = "Bikeshare trips on May 18th, 2021, Washington, DC",
       subtitle = "Time: {current_frame}") +
  transition_manual(interval15) +
  mapTheme()

animate(rideshare_animation, duration=45, renderer = gifski_renderer())

Modeling

In order to train and test our prediction model, we first split our data into a training and a test set. Then, we build five linear models using our training data to predict trip counts. The models’ dependent variables are as follows:

  1. Hour of day, day of week, temperature
  2. Pickup station, day of week, temperature
  3. Variables in models 1 and 2 + precipitation
  4. Variables in model 3 + time lags
  5. Variables in model 4 + distance from Metro station
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + station.dist, 
     data=ride.Train)

Predict for test data

Next, we use our models to predict trip counts on the test data.

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(Model1_TimeOnly = map(.x = data, fit = reg1, .f = model_pred),
           Model2_SpaceOnly = map(.x = data, fit = reg2, .f = model_pred),
           Model3_Time_Space = map(.x = data, fit = reg3, .f = model_pred),
           Model4_TimeSpaceLag = map(.x = data, fit = reg4, .f = model_pred),
           Model5_All = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
## # A tibble: 10 × 8
##     week data   Regression        Prediction Observed Absolute_Error   MAE sd_AE
##    <dbl> <list> <chr>             <list>     <list>   <list>         <dbl> <dbl>
##  1    18 <sf>   Model1_TimeOnly    <dbl>      <dbl>    <dbl [47,232]> 0.684 0.978
##  2    19 <sf>   Model1_TimeOnly    <dbl>      <dbl>    <dbl [55,104]> 0.612 0.929
##  3    18 <sf>   Model2_SpaceOnly   <dbl>      <dbl>    <dbl [47,232]> 0.636 0.892
##  4    19 <sf>   Model2_SpaceOnly   <dbl>      <dbl>    <dbl [55,104]> 0.608 0.826
##  5    18 <sf>   Model3_Time_Spac… <dbl>      <dbl>    <dbl [47,232]> 0.643 0.871
##  6    19 <sf>   Model3_Time_Spac… <dbl>      <dbl>    <dbl [55,104]> 0.606 0.809
##  7    18 <sf>   Model4_Time_Spac… <dbl>      <dbl>    <dbl [47,232]> 0.532 0.811
##  8    19 <sf>   Model4_Time_Spac… <dbl>      <dbl>    <dbl [55,104]> 0.509 0.769
##  9    18 <sf>   Model5_All       <dbl>      <dbl>    <dbl [47,232]> 0.532 0.811
## 10    19 <sf>   Model5_All       <dbl>      <dbl>    <dbl [55,104]> 0.509 0.769

Examining Model Errors

Overall, our models are accurate to less than an average of one ride per hour. The best performing models are those that incorporate time lags, with a negligible improvement from Model 4 to Model 5.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme()

After plotting predicted vs. observed pickups, we observe that the correspondence between predicted and observed values is strongest in models 4 and 5, which is consistent with the previous observations. Each model does somewhat of a decent job predicting the spike seen at the end off work-hours on weekdays, but what makes models 4 and 5 best suited is because they are able to measure that spike. The predicted outcomes for 1,2, and 3 reflect similarly to our observations on weekends, where the peak was gradual, longer, and more consistent.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bikeshare pickup time series", subtitle = "Washington DC; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme()

Further examining Model 5, our best suited, we observe Mean Absolute Error (MAE) values when mapped across stations. Notably, errors are more pronounced for stations in close proximity to downtown DC and the National Mall. Our next step involves a thorough examination of space-time errors to gain further insights. Like all the other models, it underperforms perdictions where the spikes post-workday occur, but can accurately predict other times of the day. The MAE would have been exacerbated if using any other model.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "Model5_All") %>%
  group_by(start_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = dcCensus, color = "gray80", fill = "gray50")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.7, size = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme()

Space-Time Error Evaluation

When contrasting predicted vs observed the pattern goes beyond under performing the spike but also overpredictions in the overnight. Looking at the graph, we see that predicting trips have a heavy lean towards the left side, indicating that there is predictions of more trips happening overnight when the observed outcomes are smaller. The model is not capable of explaining the day-night patterns of people that are reflected in bikeshare activity.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model5_All")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()

We also need to look at these errors of predictions versus observations spatially, the spikes that the model missed are marked as the higher MAE values in the National Mall area during midday and PM rushes. We then look at the overnight where it also has higher MAE in that same center, where it is predicting small but consistent ride usage throughout the night. The model tends to take a more conservative approach throughout the cyclical day, under performing business and over performing downtime. The model cannot explain these discrepancies alone, and there may be factors not related to bikeshare specifically that can show correlations in its predictions. For this analysis we will look deeper into the AM rush and the attributes errors in terms of socio-economic factors.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model5_All")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcCensus, color = "white", fill = "gray80")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme()

We looked at three factors, median income, public transportation use, and percentage of white people. Public transportation does not appear to have a given effect on the predicted values but the white population & median income show a steady incline where there are larger errors when there is more white people or wealthier areas. Thus, our model predicts higher rideshare use in areas with whiter populations or higher income.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Median_Income = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
    unnest() %>%
  filter(Regression == "Model5_All")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = dcCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="AM Rush errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme()

Cross Validation

To assess the goodness of our model, we use k-fold cross-validation, normally we would run 100 folds for a proper evaluation. However, due to computational constraints, we had to settle for 10 folds. The histogram illustrates the average errors across each fold, providing a summary of our cross-validation results. The distribution exhibits an approximate normality, with a central tendency around 1.05. This value represents a relatively low level of error, underscoring the precision of our model.

set.seed(8)

cv2<- crossv_kfold(ride.Train, 10)

models <- map(cv2$train,~ lm(Trip_Count ~  start_station_name + hour(interval60) 
                             + dotw + Temperature 
                             + Precipitation + lagHour + lag2Hours +lag3Hours 
                             +lag12Hours + lag1day + station.dist, 
                            data=ride.Train))

errs <- map2_dbl(models, cv2$test, rmse)

ggplot()+
  geom_histogram(aes(errs),bins=8)+
  labs(title="Histogram of k-fold CV errors",
       y="Frequency",
       x="Mean fold error")+
  plotTheme()

remove(models, cv2)

Conclusion

We developed a model forecasting bike demand across DC, factoring in both time and location. Our goal is to keep every bike station stocked with at least one bike and have 1 bike slot open at all times, ensuring riders always find what they need. After analyzing the various factors and observed patterns, we estimate the bikes riders will pick up, enabling Capital Bikeshare to proactively plan for station and bike availability.

In conjunction with data on bike station capacity, this model is capable to be implemented in a dynamic incentive system that redistributes bikes across stations by offering lower costs (or even offsets to future rides) to riders in exchange for a different drop-off location. The incentive should vary with distance to original destination, number of bikes already at the target drop-off station as well as predicted bike demand at that station. Capital Bikeshare can experiment with the incentive structure to optimize for bike availability across stations.

---
title: "Bike Share Monitoring in Washington D.C"
author: "Trevor Kapuvari"
date: "`r Sys.Date()`"
output: 
  html_document:
    theme: flatly
    toc: true
    toc_float: true
    code_folding: "hide"
    code_download: true
---

# Introduction

As the nation's capital, Washington D.C. stands as a vibrant and densely populated metropolitan hub. To address the city's need for efficient transportation in such limited space, SmartBike DC, now known as Capital Bikeshare, was established with an initial deployment of 10 stations and 120 bicycles. As of now, this bikeshare network extends throughout the DC metro area, catering to over 3 million riders annually. The system connects residents to workplaces, recreational areas, Metro stations, and various points of interest across the region to fill the gaps where alternative transportation may not be available.

A significant challenge faced by bikeshare systems is the optimal redistribution of bikes within the network. Ensuring the availability of at least one bike and one open slot at every station is crucial, as a station without bikes or bike parking inconveniences riders, compelling them to seek alternative transportation. In this initiative, we focus on developing a predictive model for Capital Bikeshare in the District of Columbia. The model aims to facilitate bike redistribution by forecasting demand in 15-minute intervals for the upcoming week. Departing from traditional truck-based resupply methods, which                are resource-intensive, we propose a novel approach—offering dynamic incentives to riders for redistributing bikes across the network.


Our analysis is restricted to the District of Columbia because of computational limitations on model cross validation.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

```{r setup_13, message=FALSE, warning = FALSE}
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(ggplot2)
library(FNN)
library(modelr)
library(gganimate)
library(gifski)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
```


```{r install_census_API_key, warning = FALSE, include=FALSE, eval = TRUE}
# Install Census API Key
tidycensus::census_api_key("811e1f6f8d1f299cb75a0f0c07e01aafd801fa79", overwrite = TRUE)
```

## Import Data

Our model development process involves utilizing a 5-week dataset from Capital Bikeshare. We extract information on all bikeshare trips that occurred in May 2021, including the names and coordinates of pickup and dropoff locations for each trip. 

```{r read_data}
data <- read.csv("C:/Users/Owner/Documents/RStudio/PublicPolicyAnalytics/BikeShare/202105-capitalbikeshare-tripdata.csv")
```

```{r time_bins, message=F, results='hide' }
data2 <- data %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

dcCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = "DC",
          geometry = TRUE,
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

dcTracts <- 
  dcCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

dat_census <- st_join(data2 %>% 
          filter(is.na(start_lat) == FALSE &
                   is.na(start_lng) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lng) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        dcTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., dcTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  filter(is.na(Origin.Tract) == FALSE,
         is.na(Destination.Tract) == FALSE) %>%
  as.data.frame() %>%
  select(-geometry)
```

## Engineering Feaures, Weather Data

Since climate conditions are important to whether a commuter decides to travel by bicycle or not, we import May 2021 data from the weather station at Reagan National Airport to create temperature and precipitation variables for our model. We visualize these data below. We see a cyclical pattern in temperature and can expect bike ridership to follow with these dips because of the time of day associated, however we will also have to compare these to precipitation and temperatures dropping below a certain amount, and how that effects ridership. 

```{r import_weather, message = FALSE, warning = FALSE }
weather.Panel <- 
  riem_measures(station = "DCA", date_start = "2021-05-01", date_end = "2021-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#glimpse(weather.Panel)

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - DC Reagan National Airport (DCA) - May 2021")
```

## Import Metro Data

Proximity to Metro stations can indicate bikeshare demand. Therefore, we extracted DC Metro stations and calculate the distance from each bikeshare station to its nearest Metro station.

```{r create space time panel, message = FALSE, warning = FALSE}
#length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(., dat_census %>%
              select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

#nrow(study.panel)      

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, dcCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
```

```{r metrostops, results='hide'}
metroStops <- st_read("https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Transportation_Rail_Bus_WebMercator/MapServer/52/query?outFields=*&where=1%3D1&f=geojson") %>% 
    dplyr::select(NAME, LINE) %>%
  st_transform(crs=4326)

ride.panel <-
  ride.panel %>%
  mutate(X = from_longitude, Y = from_latitude )%>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")

ride.panel <-
  ride.panel %>%
  mutate(
    station.dist = nn_function(st_coordinates(ride.panel), st_coordinates(metroStops), 1)) %>%
  mutate(station.cat = case_when(
    station.dist > 0 & station.dist <= 500 ~ "1",
    station.dist > 500 & station.dist <= 1000 ~ "3",
    station.dist > 1000 ~ "3"))

st_drop_geometry(ride.panel)

```

## Data Exploration

We begin by examining the time and frequency components of our data. 

First, we look at the overall time pattern where there is clearly a daily cyclical spikes and less activity on weekends. Notice that the weekend near the 28th of May (Memorial Day) doesn't have the same dip in activity.

```{r trip_timeseries }
ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr., Washington, DC, May 2021",
       x="Date", 
       y="Number of trips")+
  plotTheme()
```

Here, we examine the distributions of trips per station at different times of the day. They show strong right skews, meaning that the majority of stations have smaller pickup numbers and a minority of stations experience a disproportionately higher volume of pickups. There are stronger skews for AM and overnight ridership, suggesting that higher pickup volumes occur during mid-day and the PM rush hour.

```{r mean_trips_hist, warning = FALSE, message = FALSE }
dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean # of Hourly Trips Per Station, Washington DC, May 2021",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()
```

We also examine ridership by day of the week and weekend vs. weekday. There appears to be large differences between days of the week, especially between weekends and weekdays. Notably, weekends have their spike consistently around late morning to midday (hours 11-16), while weekdays have an consistent spike after 5pm (hours 17-18). Because of the overall variation across days of the week, we will employ it as a variable in our model.

```{r trips_hour_dotw}
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in DC, by day of the week, May 2021",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

ggplot(dat_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in DC - weekend vs weekday, May 2021",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()
```
The map series below shows which bike stations are used more often throughout the day of the week and time of day. Majority of these areas are yellow representing little ridership, but when relating back to the spikes on the weekends and weekdays, we see where these spikes are occurring. Mid-day spikes on the weekday show red dots closer to the central part of DC while the weekends presents slightly more activity on the southern side throughout the day. 

```{r origin_map, fig.height=7}
ggplot()+
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326),
          color = "gray80",
          fill = "gray50")+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 0.7, size = 0.8)+
  scale_colour_viridis(direction = -1, end = 0.95,
                       discrete = FALSE, option = "B")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station, Washington, DC, May 2021",)+
  mapTheme()
```


## Generate time lags

Creating time lag variables provides additional insights into rideshare demand by capturing variations in usage hours before and during a given day. Analyzing ride usage in hours before and after can forecast future activity at rideshare stations. Additionally, it's important to factor in holiday effects that disrupt expected demand patterns on weekends or weekdays, treating them as outliers. For May, we can use Memorial Day on May 28 as a three-day weekend and examine the temporal proximity to the holiday. The correlations in these time lags are notably strong, with a Pearson's R of 0.92 for the lagHour, indicating a nearly 1:1 correlation. This high correlation serves as practical evidence linking ridership usage and holidays. 

It aligns with the intuitive understanding that current demand should resemble tomorrow's and an hour from now, but expectations equally vary twelve hours into the future in terms of demand, where we see a value of -0.59, showing usage 12 hours after a measurement will have an opposite effect. 


```{r time_lags , message = FALSE}
ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

```

```{r evaluate_lags , warning = FALSE, message = FALSE}
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))
```

## Animated map

We select bikeshare trips on May 18th from the dataset to visualize in the following animated map. We chose this date as a relatively representative date on which there was no precipitation and moderate temperatures (a high of 78 F). Majority of bike trips throughout the day remain in the center of DC with frequent lone points on the edges of the map. 

```{r animated map, message=F, warning=F}
week20 <- dat_census %>%
  filter(week == 20 & dotw == "Tue")

week20.panel <-
  expand.grid(
    interval15 = unique(week20$interval15),
    station = unique(dat_census$start_station_id))

ride.animation.data <-
  mutate(week20, Trip_Counter = 1) %>%
  select(interval15, start_station_id, from_longitude, from_latitude, Trip_Counter) %>%
  group_by(interval15, from_longitude, from_latitude, start_station_id) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 1 ~ "1 trip",
                           Trip_Count > 1 & Trip_Count <= 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 5 ~ "3-5 trips",
                           Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
                           Trip_Count > 10 ~ ">10 trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trip","2 trips",
                              "3-5 trips","6-10 trips",">10 trips"))

rideshare_animation <-
  ggplot()+
  geom_sf(data = dcTracts %>%
            st_transform(crs=4326), color = 'gray80', fill = "gray50")+
  geom_point(data = ride.animation.data, 
             aes(x = from_longitude, y = from_latitude, color = Trips), 
             size = 1.5, 
             alpha = 1.5) +
  scale_colour_manual(values = palette5) +
  labs(title = "Bikeshare trips on May 18th, 2021, Washington, DC",
       subtitle = "Time: {current_frame}") +
  transition_manual(interval15) +
  mapTheme()

animate(rideshare_animation, duration=45, renderer = gifski_renderer())
```

# Modeling

In order to train and test our prediction model, we first split our data into a training and a test set. Then, we build five linear models using our training data to predict trip counts. The models' dependent variables are as follows:

1. Hour of day, day of week, temperature
2. Pickup station, day of week, temperature
3. Variables in models 1 and 2 + precipitation
4. Variables in model 3 + time lags
5. Variables in model 4 + distance from Metro station


```{r train_test - regressions}
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + station.dist, 
     data=ride.Train)
```

## 4.2. Predict for test data

Next, we use our models to predict trip counts on the test data. 

```{r nest data - generate predictions, warning = FALSE, message = FALSE}
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(Model1_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
           Model2_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
           Model3_Time_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           Model4_Time_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           Model5_Full = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
```

## Examining Model Errors

Overall, our models are accurate to less than an average of one ride per hour. The best performing models are those that incorporate time lags, with a negligible improvement from Model 4 to Model 5. 

```{r plot_errors_by_model }
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme()
```

After plotting predicted vs. observed pickups, we observe that the correspondence between predicted and observed values is strongest in models 4 and 5, which is consistent with the previous observations. Each model does somewhat of a decent job predicting the spike seen at the end off work-hours on weekdays, but what makes models 4 and 5 best suited is because they are able to measure that spike. The predicted outcomes for 1,2, and 3 reflect similarly to our observations on weekends, where the peak was gradual, longer, and more consistent. 

```{r error_vs_actual_timeseries , warning = FALSE, message = FALSE, fig.height=8}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bikeshare pickup time series", subtitle = "Washington DC; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme()
```

Further examining Model 5, our best suited, we observe Mean Absolute Error (MAE) values when mapped across stations. Notably, errors are more pronounced for stations in close proximity to downtown DC and the National Mall. Our next step involves a thorough examination of space-time errors to gain further insights. Like all the other models, it underperforms perdictions where the spikes post-workday occur, but can accurately predict other times of the day. The MAE would have been exacerbated if using any other model.

```{r errors_by_station, warning = FALSE, message = FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "Model5_Full") %>%
  group_by(start_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = dcCensus, color = "gray80", fill = "gray50")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.7, size = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme()
```

### Space-Time Error Evaluation

When contrasting predicted vs observed the pattern goes beyond under performing the spike but also overpredictions in the overnight. Looking at the graph, we see that predicting trips have a heavy lean towards the left side, indicating that there is predictions of more trips  happening overnight when the observed outcomes are smaller. The model is not capable of explaining the day-night patterns of people that are reflected in bikeshare activity. 


```{r obs_pred_all, warning=FALSE, message = FALSE, cache=TRUE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model5_Full")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()
```

We also need to look at these errors of predictions versus observations spatially, the spikes that the model missed are marked as the higher MAE values in the National Mall area during midday and PM rushes. We then look at the overnight where it also has higher MAE in that same center, where it is predicting small but consistent ride usage throughout the night. The model tends to take a more conservative approach throughout the cyclical day, under performing business and over performing downtime. The model cannot explain these discrepancies alone, and there may be factors not related to bikeshare specifically that  can show correlations in its predictions. For this analysis we will look deeper into the AM rush and the attributes errors in terms of socio-economic factors. 

```{r station_summary, warning=FALSE, message = FALSE, fig.height=7 }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model5_Full")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcCensus, color = "white", fill = "gray80")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme()
  
```

We looked at three factors, median income, public transportation use, and percentage of white people. Public transportation does not appear to have a given effect on the predicted values but the white population & median income show a steady incline where there are larger errors when there is more white people or wealthier areas. Thus, our model predicts higher rideshare use in areas with whiter populations or higher income.     

```{r station_summary2, warning=FALSE, message = FALSE, fig.width=10 }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Median_Income = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
    unnest() %>%
  filter(Regression == "Model5_Full")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Median_Income, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = dcCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="AM Rush errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme()
  
```



## Cross Validation

To assess the goodness of our model, we use k-fold cross-validation, normally we would run 100 folds for a proper evaluation. However, due to computational constraints, we had to settle for 10 folds. The histogram illustrates the average errors across each fold, providing a summary of our cross-validation results. The distribution exhibits an approximate normality, with a central tendency around 1.05. This value represents a relatively low level of error, underscoring the precision of our model.

```{r cv, cache=T}
set.seed(8)

cv2<- crossv_kfold(ride.Train, 10)

models <- map(cv2$train,~ lm(Trip_Count ~  start_station_name + hour(interval60) 
                             + dotw + Temperature 
                             + Precipitation + lagHour + lag2Hours +lag3Hours 
                             +lag12Hours + lag1day + station.dist, 
                            data=ride.Train))

errs <- map2_dbl(models, cv2$test, rmse)

ggplot()+
  geom_histogram(aes(errs),bins=8)+
  labs(title="Histogram of k-fold CV errors",
       y="Frequency",
       x="Mean fold error")+
  plotTheme()

remove(models, cv2)
```

## Conclusion


We developed a model forecasting bike demand across DC, factoring in both time and location. Our goal is to keep every bike station stocked with at least one bike and have 1 bike slot open at all times, ensuring riders always find what they need. After analyzing the various factors and observed patterns, we estimate the bikes riders will pick up, enabling Capital Bikeshare to proactively plan for station and bike availability. 

In conjunction with data on bike station capacity, this model is capable to be implemented in a dynamic incentive system that redistributes bikes across stations by offering lower costs (or even offsets to future rides) to riders in exchange for a different drop-off location. The incentive should vary with distance to original destination, number of bikes already at the target drop-off station as well as predicted bike demand at that station. Capital Bikeshare can experiment with the incentive structure to optimize for bike availability across stations.  


