Motivation

Trains are a timeless mode of transportation that has revolutionized the world. As we aim to improve their reliability and accessibility, it is crucial to understand their performance records and allow riders to stay informed on travel updates, train reliability, and enable them to plan their trips accordingly. Hence, we have dedicated this brief to developing a two-part prediction model to forecast train delays.

This analysis serves as an experimental use-case that can be replicated to other transit regions in the United States and further enable inter-state travel in order to efficiently utilize train networks.

Data Sources

Our initial dataset is derived from Kaggle’s NJ Transit & Amtrak Rail Performance. Due to the high volume of data, we narrowed the scope of the model to focus on April & May 2020. The data features Amtrak trains between New York & New Jersey, as well as recording lines from NJ Transit. This sample provides insight on performance of train-use between two major metropolitan areas, New York & Philadelphia. Within the data we are able to explore the individual train’s scheduled time, actual time, planned stations, and sequence in train line over the course of two months. That information allows our model to predict a delay based on its past performance and time of the week/day to detect any temporal patterns that may exist.

Source: https://www.kaggle.com/datasets/pranavbadami/nj-transit-amtrak-nec-performance

if (!dir.exists("data/Kaggle_transit&amtrak_data/")){
unzip("data/Kaggle_transit&amtrak_data.zip",exdir="data/Kaggle_transit&amtrak_data")
  
  # archive <- archive("data/testing/Kaggle_transit&amtrak_data.7z")
  # trial <- archive_read(archive=archive,file=archive$path[28],format='7zip')
}

datApr2020 <- read.csv("data/Kaggle_transit&amtrak_data/2020_04.csv")
datMay2020 <- read.csv("data/Kaggle_transit&amtrak_data/2020_05.csv")


datApr2020 <- datApr2020 %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(scheduled_time), unit = "15 mins"),
         interval1 = floor_date(ymd_hms(scheduled_time), unit = "min"),
         week = week(interval60),
         dotw = wday(interval60,label=T),
         delayed = ifelse(delay_minutes > 0,TRUE,FALSE))

datMay2020 <- datMay2020 %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(scheduled_time), unit = "15 mins"),
         interval1 = floor_date(ymd_hms(scheduled_time), unit = "min"),
         week = week(interval60),
         dotw = wday(interval60,label=T),
         delayed = ifelse(delay_minutes > 0,TRUE,FALSE))

nj_trans_stations <- st_read("data/Rail_Stations_of_NJ_Transit.geojson")%>% st_transform(crs="ESRI:102711")

nj_trans_lines <- st_read("data/Rail_Lines_of_NJ_Transit.geojson")%>% st_transform(crs="ESRI:102711")


may_w_stations <- merge(x= nj_trans_stations,y=datMay2020,by.x="STATION_ID",by.y="from")%>%
  rename(from_station = STATION_ID,
         fromLat = LATITUDE, fromLon = LONGITUDE, 
         from_atis_id = ATIS_ID, from_muni = MUNICIPALITY,
         from_rail_service = RAIL_SERVICE)%>%
  select(-OBJECTID_1,-LINE_CODE)%>%
  st_drop_geometry()%>%
  merge(y=nj_trans_stations,by.x="to",by.y="STATION_ID")%>%
  rename(to_station = to,
         toLat = LATITUDE, toLon = LONGITUDE, 
         to_atis_id = ATIS_ID, to_muni = MUNICIPALITY,
         to_rail_service = RAIL_SERVICE) %>%
  select(-OBJECTID_1)%>%
  st_drop_geometry() %>%
  merge(y=nj_trans_lines,by="LINE_CODE")%>%
  select(-OBJECTID)%>%
  st_drop_geometry() %>%
  st_set_geometry(value="geometry.x")

apr_w_stations <- merge(x= nj_trans_stations,y=datApr2020,by.x="STATION_ID",by.y="from")%>%
  rename(from_station = STATION_ID,
         fromLat = LATITUDE, fromLon = LONGITUDE, 
         from_atis_id = ATIS_ID, from_muni = MUNICIPALITY,
         from_rail_service = RAIL_SERVICE)%>%
  select(-OBJECTID_1,-LINE_CODE)%>%
  st_drop_geometry()%>%
  merge(y=nj_trans_stations,by.x="to",by.y="STATION_ID")%>%
  rename(to_station = to,
         toLat = LATITUDE, toLon = LONGITUDE, 
         to_atis_id = ATIS_ID, to_muni = MUNICIPALITY,
         to_rail_service = RAIL_SERVICE) %>%
  select(-OBJECTID_1)%>%
  st_drop_geometry() %>%
  merge(y=nj_trans_lines,by="LINE_CODE")%>%
  # rename(toLat = LATITUDE, toLon = LONGITUDE, 
  #        to_atis_id = ATIS_ID, to_muni = MUNICIPALITY,
  #        to_rail_service = RAIL_SERVICE,
  #        to_line_code = LINE_CODE) %>%
  select(-OBJECTID)%>%
  st_drop_geometry() %>%
  st_set_geometry(value="geometry.x")

delayprobability <- datMay2020 %>%
  mutate(delay_prob = delay_minutes * 0)

delayprobability <- glm(delay_prob ~ date + train_id + stop_sequence + from_id + to_id + week, data = delayprobability, family="binomial")
summary(delayprobability)

rm(datMay2020,datApr2020)

Additional data from the American Community Survey (ACS) provided us insight on commuter data for New York, New Jersey, and southeast Pennsylvania area. The data provided us the sense of urgency and importance these rail lines are to commuters. Areas surrounding New York City greatly rely on trains and public transit for commuting or everyday travel as seen with recent articles detailing NJ Transit approaching pre-COVID levels while other areas have struggled to reclaim riders. This example with NJ Transit emphasizes the importance of the regional train network and the need to monitor delays and statuses of each line.

census <- 
  get_acs(geography = "county", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          geometry = TRUE, 
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E)%>%
  st_transform(crs="EPSG:4326")

tracts <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          geometry = TRUE,
          state="NJ",
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E)%>%
  st_transform(crs="EPSG:4326")

threeStates = census%>%filter(grepl("New Jersey",NAME)|grepl("Pennsylvania",NAME)|grepl("New York",NAME))
nj = census%>%filter(grepl("New Jersey",NAME))

rm(census)

Means of Transport

This maps indicates the importance of using NYC and NJ Transit as our initial case-study area for our model. Commuting by train is underused in most areas of the country especially when distant from any major cities. New Jersey has a special case of being between two major metropolitan areas, New York and Philadelphia. This location makes the state optimal for long-distance train studies. Even when the majority of New York State and Pennsylvania have lower numbers of commuters, denoted by blue, New Jersey’s roughly purple palette proves the importance of trains in the regional economy.

ggplot()+
  geom_sf(data=threeStates,aes(fill = Means_of_Transport))+
  labs(title = "Daily Passenger Train Commuters", fill = "Average Daily Commuters by Rail")+
  scale_y_continuous(labels = scales::comma_format())+ #why not comma separation wtf 
  scale_fill_gradient(low = "royalblue", high = "red")+ 
  plotTheme()+
  theme(panel.background=element_rect(
        fill = "#fcfcf4",
        colour = NA),
        panel.grid.major=element_blank(),
        axis.text=element_blank())

# length(unique(may_w_stations$interval60)) * length(unique(may_w_stations$from_id))


study.panelM <- 
  expand.grid(interval60=unique(may_w_stations$interval60), 
              from_id = unique(may_w_stations$from_id)) %>%
  left_join(., may_w_stations %>%
              # changed fromLon & Lat to fromCountyLon & Lat
              select(train_id, from_id, line,delay_minutes)%>% # Origin.Tract,
              distinct() %>%
              group_by(from_id) %>%
              slice(1))

study.panelA <- 
  expand.grid(interval60=unique(apr_w_stations$interval60), 
              from_id = unique(apr_w_stations$from_id)) %>%
  left_join(., apr_w_stations %>%
              # changed fromLon & Lat to fromCountyLon & Lat
              select(train_id, from_id, line,delay_minutes)%>% # Origin.Tract,
              distinct() %>%
              group_by(from_id) %>%
              slice(1))


ride.panelM <- may_w_stations %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panelM) %>% 
  # changed fromLon & Lat to fromCountyLon & Lat
  group_by(interval60, from_id, to_id, from_station, to_station, delay_minutes,SHAPE_Length,status,from_rail_service) %>%
  rename(line_length = SHAPE_Length) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  ungroup() %>%
  filter(is.na(from_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  # filter(is.na(Origin.County) == FALSE) %>%
  # left_join(study.panel, threeStates %>%
  #             as.data.frame()) %>% # extra parenthesis
              # select(-geometry), by = c("Origin.Tract" = "GEOID")) %>%
  arrange(from_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(delay_minutes,1),
         lag2Hours = dplyr::lag(delay_minutes,2),
         lag3Hours = dplyr::lag(delay_minutes,3),
         lag4Hours = dplyr::lag(delay_minutes,4),
         lag12Hours = dplyr::lag(delay_minutes,12),
         lag1day = dplyr::lag(delay_minutes,24)) %>%
         # # Indigenous Peoples' (Columbus) Day is federally recognized
         # #  but not an official city holiday of Austin
         # ## I include it to account for holiday lag
         # holiday = ifelse(yday(interval60) == 282,1,0)) %>%
   mutate(day = yday(interval60)) # %>%
   # mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
   #                               dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
   #                               dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
   #                               dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
   #                               dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
   #                               dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
   #       holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

ride.panelA <- apr_w_stations %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panelA) %>% 
  # changed fromLon & Lat to fromCountyLon & Lat
  group_by(interval60, from_id, to_id, from_station, to_station, delay_minutes,SHAPE_Length,status,from_rail_service) %>%
  rename(line_length = SHAPE_Length) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  ungroup() %>%
  filter(is.na(from_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  # filter(is.na(Origin.County) == FALSE) %>%
  # left_join(study.panel, threeStates %>%
  #             as.data.frame()) %>% # extra parenthesis
              # select(-geometry), by = c("Origin.Tract" = "GEOID")) %>%
  arrange(from_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(delay_minutes,1),
         lag2Hours = dplyr::lag(delay_minutes,2),
         lag3Hours = dplyr::lag(delay_minutes,3),
         lag4Hours = dplyr::lag(delay_minutes,4),
         lag12Hours = dplyr::lag(delay_minutes,12),
         lag1day = dplyr::lag(delay_minutes,24)) %>%
         # # Indigenous Peoples' (Columbus) Day is federally recognized
         # #  but not an official city holiday of Austin
         # ## I include it to account for holiday lag
         # holiday = ifelse(yday(interval60) == 282,1,0)) %>%
   mutate(day = yday(interval60)) # %>%
   # mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
   #                               dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
   #                               dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
   #                               dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
   #                               dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
   #                               dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
   #       holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

New Jersey Transit Map

The map below depicts New Jersey Transit lines that we predicted delays for in the model. Predictions within the model ranged between 1 minute to almost 2 hours, for the purpose of the analysis and due to available data, we are focusing our attention of these lines.

ggplot()+
  geom_sf(data=nj%>%st_transform(st_crs(nj_trans_stations)),color = "white",fill="darkgray")+
  geom_sf(data=nj_trans_stations, color="black")+
  geom_sf(data=nj_trans_lines,aes(color = LINE_NAME))+
  labs(title = "New Jersey Transit Map", color = "Train Line")+
  plotTheme()+
  theme(panel.background=element_rect(
        fill = "#fcfcf4",
        colour = NA),
        panel.grid.major=element_blank(),
        axis.text=element_blank())

Exploratory Analysis

Temporal Train Cycle

The graph below shows a temporal cycle of train trips by NJ Transit. One important factor when accounting for delays is the time of day. The pattern between peak hours and when there are few trains can assist the model in detecting patterns in delays caused by time, and how greatly that effects the probability or length of the delay.

ggplot(may_w_stations %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Train Trips in NYC Area (NJ Transit), May 2020",
       x="Date",
       y="Number of Trips")+
  plotTheme()+
  coord_cartesian(ylim=c(0,300))

Delay Count

The value of trains to the region also is rooted in their reliability and people’s understanding of such. Recorded NJ Transit train trips in May 2020 shows that 60452 of the total 63406 trips did not arrive at scheduled times–roughly 70% of all trips. This graph records all delays regardless of severity, but still means there was a higher likelihood of a commuter’s trip being delayed than not. It is important to understand some delays recorded were negligible (less than 10 minutes), yet others were almost 2 hours. Distinguishing between negligible delays is what riders would find useful before they occurs, emphasizing the need for a reliable method of predicting the impact of a delay alongside its probability.

ggplot(may_w_stations,mapping=aes(x=delayed))+
  geom_bar(aes(fill=delayed,show.legend=F))+
  geom_text(aes(label=..count..),stat="count",color="white",vjust=1.5)+
  labs(title="Delayed Train Trip Count for NJ Transit, May 2020",
       y="Number of trips",
       x = "Delayed")+
  guides(fill=guide_legend(title = "Delayed Train?"))+
  scale_color_manual(labels=c("False","True","Amtrak"),
                     values = c(GRAY9,BLUE3,GREEN3) # did nothing?
                     )+
  plotTheme()+
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_blank())

Tracking Station Business

The graph below shows that the majority of stations each day only saw one or two trains every hour. This frequency seen at majority of stations indicates a level of importance the timing these trains have in terms of efficiency. Large delays for stops that do not have multiple trains can result in riders having few to no alternative route options and succumbing to the lateness brought by delays. Places that see more trains are also jeopardized by delays for different reasons. If large enough, delays can create rail congestion and in turn propagate the delay to other trains that would have otherwise been timely.

ggplot(may_w_stations %>%
         group_by(interval60, from_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 1, color = GRAY8)+
  labs(title="Train Trips per Hour. NJ Transit & Amtrak Area, May 2020",
       x="Trip Counts", 
       y="Number of Stops")+
  plotTheme()+
  coord_cartesian(xlim=c(0,10))

Station Depature Tracking

The cleaned data provides a comprehensive list of all stations and the amount of departures they have every 60 minutes, we use these amounts to measure business and tracking individuals trains in the model, predicting if a delay on one station will consequently cause additional delays later in the sequence.

Predicting Delay Time and Significance

As data about the NJ Transit train infrastructure and specific trips were analyzed alongside introducing relevant outside information, the decision for two predictive models–one linear model and a binomial logistic regression model–was made followed by their constituent features. The aim of our two models is to predict the length of the delays and the probability of that delay occurring. Train riders desire information regarding delays when it can alter trip plans and is a likely scenario. For our metrics, we are looking at delays that are greater than 10 minutes and have a 50% chance or higher of being delayed. For our approach, we created multiple linear models that would assess features such as the time of the delay, location, temporal lags, and combinations of such.

Model A focused on previous delays, day of the week, and measuring time lags at 60 minute intervals.

Model B focused on station location and day of the week.

Model C focused on day of the week and time lags of 1, 2, 3, 12, and 24 hour intervals.

Model D incorporated Models A,B, and C into one.

  • The Linear Model

As stated before, the severity of a delay is important in helping riders make transportation decisions and preemptively evaluating the reliability of a specific NJ Transit train trip. For that reason, the linear model built hear will train on data for a month before the time period it is being used to predict for. Predictions on the number of minutes a specific trip is delayed are the main aim of this model with four different regression models being compared to determine the best combination of temporal and NJ Transit features to include.

We limited the inclusion of spatial data and features separated by areal units like spatial units with the belief that the individual rider and demographics would not have a direct effect on train timeliness with the exception of a major disruption or being in a position to enact some change in NJ Transit’s policy. Likewise, the brief window of data our model trains on led us to preclude weather data with its minimal immediate effect on speed or boardings and alightments. Any additional features were mainly derived from the existing collection of data such as the scheduled datetime for a trip, the start and stop station, and the line the trip would run on. Time based lags for given frames of time like an hour or day before were the set of newly introduced features.

  • The Logistic Model

This set of predictive models is incorporated into our user or rider-centric use case by notifying riders of significant delays in the TrainTracker application. Provided that a 5-minute delayed train is still delayed, we use a logistic model with the same features as the linear model to anticipate if a given trip is likely to be significant. We chose any delay of 10 minutes or more to be worthy of a notification to users alongside recommendation to request more detailed prediction in addition to the standard prediction done at the beginning of each week for all trips. That is to say that this model predicts if any trip will have a predicted delay of 10 minutes or more.

ride.Train <- ride.panelA
ride.Test <- ride.panelM

# ride.Full <- rbind(ride.Train,ride.Test)
# fullPartition <- createDataPartition(
#               y = paste(ride.Full$line), 
#               p = 0.6060172, list = FALSE)

The regression models created from the April 2020 training data include …

  • Model A - Simplest

  • This model includes the minimum of features. Namely, the hour of the trip’s scheduled arrival time and the day of the week (categorical).

  • Model B - Simple

  • This model includes the origin and destination station ideas alongside the day of the week only. Given the somewhat consistent number of trips across hours of the day as seen in the Train Trips in NYC Area line graph, we test the idea that the hour of the day is extraneous temporal data accounted for with the day of the week.

  • Model C - Temporal

  • This model includes the trip hour and day of the week alongside multiple time lags. Specifically, time lags for one hour, two hours, three hours, 12 hours, and 1 day are included providing only temporal predictors in this model.

  • Model D - Complex

  • This model includes the previously mentioned time lags and trip node IDs in addition to categorical data on the trip status–if it has departed or if its scheduled arrival time is estimated. With the potential for train unique train lines to have no data within specific months, the length of the line the trip uses is also added to incorporate consideration of the train line. The training and testing data presented occurs during the beginning of the COVID pandemic in the US (April and May 2020), but our proposed use of the model entails training data 28-35 days (1 month) before the section of time being predicted for to reduce overhead in the time taken to train the models. To that end, the process can be reproduced easily.

# uniqueLine <- ride.Test[ride.Test$line=="SILVER STAR  -R",]
# i <- sample(1:nrow(uniqueLine),size=1)
# ride.Train <- rbind(ride.Train,uniqueLine[i,])
# ride.All <- rbind(ride.Train,ride.Test)

reg1 <- 
  lm(delay_minutes ~  hour(interval60) + dotw,  data=ride.Train)

reg2 <- 
  lm(delay_minutes ~  from_id + to_id + dotw,  data=ride.Train) # + line_length,

# reg3 <- 
#   lm(Trip_Count ~  from_id + hour(interval60) + dotw + Temperature + Precipitation, 
#      data=ride.Train)

reg4 <- 
  lm(delay_minutes ~  hour(interval60) + dotw + 
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(delay_minutes ~  from_id + to_id + hour(interval60) + dotw + status + line_length +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day, 
     data=ride.Train)

Examining Accuracy and Generalizability

Linear Model Results

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

# predict function
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

Our models varied in their absolute errors between one another but stayed consistent throughout each week. Model B, having an emphasis on location (specifically Origin Destination) had the lowest error among all weeks, as our most accurate model. However, Model A, emphasizing lengths in previous delays, had the lowest standard deviation in its errors, being our most precise model. Both Models A (ATime_FE) and B (B_OriginDest_FE) had the most ideal results between standard deviation and mean absolute error, respectively.

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BOrigin_Dest_FE = map(.x = data, fit = reg2, .f = model_pred),
           # CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           CTime_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           DTime_Origin_Dest_FE_timeLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions[,c(1,3,7:8)] %>% filter(!is.na(week)) %>% kable() %>% kable_styling() %>%
  row_spec(c(1:3), color = "antiquewhite", background = GRAY2) %>%
    row_spec(c(4:6), color = "antiquewhite", background = GRAY3) %>%
    row_spec(c(7:9), color = "antiquewhite", background = GRAY4) %>%
    row_spec(c(10:12), color = "antiquewhite", background = GRAY5)
week Regression MAE sd_AE
18 ATime_FE 5.909408 0.0769389
19 ATime_FE 5.909866 0.0739356
20 ATime_FE 5.909554 0.0759943
18 BOrigin_Dest_FE 1.018740 1.0526795
19 BOrigin_Dest_FE 1.444092 1.1056186
20 BOrigin_Dest_FE 1.658286 1.0007127
18 CTime_FE_timeLags 3.805140 12.4506028
19 CTime_FE_timeLags 3.724359 12.2171291
20 CTime_FE_timeLags 3.723075 12.2170906
18 DTime_Origin_Dest_FE_timeLags 1.699053 5.5757934
19 DTime_Origin_Dest_FE_timeLags 2.114667 9.9848388
20 DTime_Origin_Dest_FE_timeLags 2.663422 11.2051183

Throughout the weeks, each model was consistent in their errors and reporting. Model A has the highest absolute error but was consistent throughout each week. Model B was the most accurate, and Models C and D were a middle-ground comparatively to A and B, but fluctuated slightly each week. For the purposes of a train timing, Model B appears to be the best model on the specific May 2020 testing set, the simple model (B) performs the best with consistently low error of 1 minute across each week and across the overall month. The complex model (D) performs second-best in the test data ranging from 1-2 minute errors in predictions in each week or about 2 minutes and 10 seconds for the month of May 2020. The bar charts below denote the MAE or average error of the predicted delay time in minutes by each regression model tested for the individual weeks and average over the month.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE,fill = Regression)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity",color=BLUE4) +
  geom_text(aes(label=round(MAE,2),group = Regression),color="white",vjust=1.5, position = position_dodge(width = .9),size=3)+
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme() + theme(axis.text.x = element_blank(),axis.text.y = element_blank())

week_predictions %>%
  group_by(Regression)%>%
  summarise(MAE = mean(MAE,na.rm=T))%>%
  dplyr::select(Regression, MAE) %>%
  gather(Variable, MAE, -Regression) %>%
  ggplot(aes(x=Regression,y=MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity",color=BLUE4) +
  geom_text(aes(label=round(MAE,2)),color="white",vjust=1.5)+
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification for May 2020") +
  plotTheme() + theme(axis.text.x = element_blank(),axis.text.y = element_blank())

Generalizability Measured through Temporal K-Fold

To ensure our model’s integrity we conducted 100 k-fold cross-validations among Model B and Model D. We used these two models because Model B was the most accurate while Model D was most generalized, incorporating all other models into its regression. For the cross validation, Model D has a consistently low absolute error with no error exceeding a very small fraction of a second. Model B’s cross validations has more outliers across the samples which was which is at odds with its performance shown in previous graphs. Overall this can indicate that Model B is more generalizable and can be scaled to larger regions, having larger error but still simple enough to cater to a scaled dataset. Model D, while precise for our use-case, can potentially be overfitted for New Jersey and the mid-COVID conditions surrounding the dataset. These cross validation results encourage tests with train data from other months to truly determine the best model.

# reg.vars <- c("checkout_kiosk_id", "hour(interval60)", "dotw", "Temperature", "Precipitation",
#                    "lagHour", "lag2Hours","lag3Hours","lag12Hours", "lag1day", "holidayLag", "holiday")
# 
# reg.CV <- crossValidate(
#   dataset = ride.Train %>% st_as_sf(sf_column_name = "Origin.Tract"),
#   id = "cvID",                           
#   dependentVariable = "Trip_Count",
#   indVariables = reg.vars) %>%
#     dplyr::select(cvID = checkout_kiosk_id, Trip_Count)

reg.cv <-
  train(delay_minutes ~  from_id + to_id + hour(interval60) + dotw, # + line_length +
                   # lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day, 
     data=ride.Train,
        method = "lm",
        trControl = trainControl(method="cv",number=100),
        na.action = na.omit)

reg.cvBest <-
  train(delay_minutes ~  from_id + to_id + hour(interval60) + dotw + status + line_length +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day,
     data=ride.Train,
        method = "lm",
        trControl = trainControl(method="cv",number=100),
        na.action = na.omit)

# ggplot(data = reg.cv$resample[1:nrow(reg.cv$resample),])+
#   geom_histogram(aes(x=MAE),fill="orange",color="pink",bins=35)+
#      labs(title = "Distribution of MAE",
#           subtitle = "k-fold cross-validation; k = 100",
#           xlab("Mean Absolute Error"))+
#   coord_cartesian(xlim = c(round(min(reg.cv$resample[3]),-3), round(max(reg.cv$resample[3]),-3)))
#      plotTheme()
     
options(scipen=8,digits=2)

  ggplot(reg.cv$resample,mapping=aes(x=MAE)) + 
    geom_histogram(fill = "lightgreen",color=BLUE4) +
    scale_fill_manual(values = palette5) +
    labs(title = "100-fold Cross Validation Sample MAEs - Model B") +
  plotTheme()+theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())+xlab(label = "Prediction Mean Absolute Error (min)")

   ggplot(reg.cvBest$resample,mapping=aes(x=(MAE %% 1 *60))) + 
    geom_histogram(fill = "lightgreen",color=BLUE4,bins=5) +
    scale_fill_manual(values = palette5) +
    labs(title = "100-fold Cross Validation Sample MAEs - Model D") +
  plotTheme()+theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())+xlab(label = "Prediction Mean Absolute Error (s)")

options(scipent=999)

Logistic Model Analysis

The purpose of the Logistical Model is to predict the probability of each delay occurring. Our dependent variable (delay of 10 minutes), has been compared to by all potential independent variables and features where we find which are most significant in determining the likelihood of the delay occurring. The logistical model is meant to compliment the linear regression model in that while the consequences of delays at specific stations vary, there is different likelihoods of each delay occurring because of the circumstances at each stations. An example can include stations further from major cities may have larger delays but a lower probability of them occurring.

Similarly to our Regression Model approach, we created two logistical models we classified as Simple and Complex.

The simple model focuses only on location and if the delays was 10 minutes or not. The complex model factors in everything given within the dataset including the delay length, location, distance between stations, etc.

ride.TrainLog <- ride.Train %>% 
  mutate(delay10 = ifelse(delay_minutes >=10,1,ifelse(!is.na(delay_minutes),0,NA)))

ride.TestLog <- ride.Test %>% 
  mutate(delay10 = ifelse(delay_minutes >=10,1,ifelse(!is.na(delay_minutes),0,NA)))

ride.AllLog <- rbind(ride.TrainLog,ride.TestLog)

reg2Model <- glm(delay10 ~ .,
                  data=as.data.frame(ride.TrainLog %>% 
                    dplyr::select(from_id,to_id,dotw,delay10) %>%
                      st_drop_geometry()),
                  family="binomial" (link="logit"))

reg5Model <- glm(delay10 ~ .,
                  data=as.data.frame(ride.TrainLog %>% 
                    dplyr::select(-from_station,-to_station,-delay_minutes,-line_length,
                                  -Trip_Count,-geometry.x,-week,-day,-from_rail_service)%>%
                      st_drop_geometry()),
                  family="binomial" (link="logit"))


stargazer(reg2Model,reg5Model,type="text")
## 
## ==============================================
##                       Dependent variable:     
##                   ----------------------------
##                             delay10           
##                       (1)           (2)       
## ----------------------------------------------
## interval60                        -0.00001    
##                                   (0.024)     
##                                               
## from_id             -0.002         0.0003     
##                     (0.008)       (4.900)     
##                                               
## to_id               -0.0001       0.00004     
##                     (0.001)       (4.100)     
##                                               
## statusdeparted                    -20.000     
##                                 (91,194.000)  
##                                               
## statusestimated                   -27.000     
##                                 (87,507.000)  
##                                               
## dotw.L               3.400         -0.600     
##                   (2,088.000)   (37,740.000)  
##                                               
## dotw.Q              -13.000        5.600      
##                   (2,113.000)   (48,436.000)  
##                                               
## dotw.C              -7.100         -6.600     
##                   (1,767.000)   (35,413.000)  
##                                               
## dotw4               -9.000         6.400      
##                    (890.000)    (38,943.000)  
##                                               
## dotw5                8.800        -14.000     
##                   (1,373.000)   (33,413.000)  
##                                               
## dotw6               -10.000        -6.600     
##                   (1,196.000)   (36,309.000)  
##                                               
## lagHour                           -112.000    
##                               (4,319,470.000) 
##                                               
## lag2Hours                         114.000     
##                               (4,318,285.000) 
##                                               
## lag3Hours                                     
##                                               
##                                               
## lag4Hours                          2.900      
##                                 (75,195.000)  
##                                               
## lag12Hours                         -0.420     
##                                 (3,273.000)   
##                                               
## lag1day                            0.290      
##                                 (1,211.000)   
##                                               
## Constant            -11.000      21,840.000   
##                    (618.000)  (38,218,827.000)
##                                               
## ----------------------------------------------
## Observations          457           454       
## Log Likelihood      -40.000        -0.000     
## Akaike Inf. Crit.   97.000         34.000     
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01

Outcomes

The two tables show the difference between what the simple and complex model predicted. A 0 represents no delay while a 1 represents a delay occurring. Both models have a 50% threshold where if the chance of a delay is 50% or greater, the model assumes the delay is occurring. The Simple Model (first results), has a high specificity (true negative) rate but failed to predict the 5 delays that occurred. The Complex Model performs much better, where it was capable of predicting 4 out of the 5 delays, meaning an 80% sensitivity (true positive) rate. Thus, we will use the complex model for any train delay prediction applications.

testProbs2 <- data.frame(Outcome = as.factor(ride.TestLog$delay10),
                        Probs = predict(reg2Model, ride.TestLog, type= "response"))

testProbs2_full <- data.frame(Outcome = as.factor(ride.AllLog$delay10),
                        Probs = predict(reg2Model, ride.AllLog, type= "response"))

testProbs5 <- data.frame(Outcome = as.factor(ride.TestLog$delay10),
                        Probs = predict(reg5Model, ride.TestLog, type= "response"))


testProbs5_full <- data.frame(Outcome = as.factor(ride.AllLog$delay10),
                        Probs = predict(reg5Model, ride.AllLog, type= "response"))

testProbs2 <- 
  testProbs2 %>%
  mutate(
    # for 0.5 threshold
    predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0))
    )

testProbs2_full <- 
  testProbs2_full %>%
  mutate(
    # for 0.5 threshold
    predOutcome  = as.factor(ifelse(testProbs2_full$Probs > 0.5 , 1, 0))
    )

testProbs5 <- 
  testProbs5 %>%
  mutate(
    # for 0.5 threshold
    predOutcome  = as.factor(ifelse(testProbs5$Probs > 0.5 , 1, 0))
    )

testProbs5_full <- 
  testProbs5_full %>%
  mutate(
    # for 0.5 threshold
    predOutcome  = as.factor(ifelse(testProbs5_full$Probs > 0.5 , 1, 0))
    )

simple <- caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome,
                       positive = "1")

complex <- caret::confusionMatrix(c(testProbs5$predOutcome), c(testProbs5$Outcome),
                       positive = "1")

simple$table
##           Reference
## Prediction   0   1
##          0 326   5
##          1   0   0
complex$table
##           Reference
## Prediction   0   1
##          0 321   1
##          1   0   4
AUC2 <- auc(testProbs2$Outcome, testProbs2$Probs)

AUC5 <- auc(testProbs5$Outcome, testProbs5$Probs)

Comparing the Model’s Goodness of Fit

ggarrange(nrow=2,
# kitchen sink densities
ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Program Entry", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
       subtitle = "Model B") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none"))

ggarrange(nrow=2,
# roc_curve_reg2
ggplot(testProbs2, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Model B"),

# roc_curve_reg5
ggplot(testProbs5, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Model D")
)

The distribution of prediction probabilities is very difficult to distinguish in part due to the small number of truly delayed samples as shown with the distribution plots being similar for model B and concentrated at zero for model D. The ROC curves give slightly more insight on their performance with the simple regression’s (model B’s) number of false negatives, where a train delayed by 10 minutes or more is not predicted as such, contributing to a smaller metric of 62.09% as opposed to model D’s 100%. The latter’s high value would be concern for overfitting despite the respective imperfect sensitivity and specificity of 0.8 and 1. Oversampling was also conducted to observe if the model’s probability densities were limited by having only 5 trips delayed beyond 10 minutes with tests on different months or timeframes of data being an alternative. With oversampling, model D performed the same and model B performed marginally better in performance given false positives only connote extra notifications to users when predicted early. Yet oversampling is overall not considered a necessary step for this logit model’s purpose with model D being considered the best.

ovunSample <- ovun.sample(delay10~.,data=ride.TrainLog %>% st_drop_geometry(),
                          N=nrow(ride.TrainLog)*1.5,method="both",p=0.5,seed=448)$data

reg2Model <- glm(delay10 ~ .,
                  data=as.data.frame(ovunSample %>% 
                    dplyr::select(from_id,to_id,dotw,delay10) %>%
                      st_drop_geometry()),
                  family="binomial" (link="logit"))

reg5Model <- glm(delay10 ~ .,
                  data=as.data.frame(ovunSample %>% 
                    dplyr::select(-from_station,-to_station,-delay_minutes,-line_length,
                                  -Trip_Count,-week,-day)%>%
                      st_drop_geometry()),
                  family="binomial" (link="logit"))

testProbs2ovun <- data.frame(Outcome = as.factor(ride.TestLog$delay10),
                        Probs = predict(reg2Model, ride.TestLog, type= "response"))

testProbs5ovun <- data.frame(Outcome = as.factor(ride.TestLog$delay10),
                        Probs = predict(reg5Model, ride.TestLog, type= "response"))

testProbs2ovun <- 
  testProbs2ovun %>%
  mutate(
    # for 0.5 threshold
    predOutcome  = as.factor(ifelse(testProbs2ovun$Probs > 0.5 , 1, 0))
    )


testProbs5ovun <- 
  testProbs5ovun %>%
  mutate(
    # for 0.5 threshold
    predOutcome  = as.factor(ifelse(testProbs5ovun$Probs > 0.5 , 1, 0))
    )


simpleOvun <- caret::confusionMatrix(testProbs2ovun$predOutcome, testProbs2ovun$Outcome,
                       positive = "1")

complexOvun <- caret::confusionMatrix(c(testProbs5ovun$predOutcome), c(testProbs5ovun$Outcome),
                       positive = "1")

simpleOvun$table
complexOvun$table

auc(testProbs2ovun$Outcome, testProbs2ovun$Probs)

auc(testProbs5ovun$Outcome, testProbs5ovun$Probs)

ggarrange(nrow=2,
# kitchen sink densities
ggplot(testProbs2ovun, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Program Entry", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
       subtitle = "Model B, Oversampled") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none"),

# engineered densities
ggplot(testProbs5ovun, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Program Entry", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
       subtitle = "Model D, Oversampled") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none"))


ggarrange(nrow=2,
# roc_curve_reg2
ggplot(testProbs2ovun, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Model B Oversampled"),

# roc_curve_reg5
ggplot(testProbs5ovun, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Model D Oversampled")
)
r5p <- lapply(week_predictions[week_predictions$Regression=="DTime_Origin_Dest_FE_timeLags",]$Prediction,mean,na.rm=T)[c(1:3)]

r5o <- lapply(week_predictions[week_predictions$Regression=="DTime_Origin_Dest_FE_timeLags",]$Observed,mean,na.rm=T)[c(1:3)]

r5meanP <- mean(c(r5p[[1]],r5p[[2]],r5p[[3]]))
r5meanO <- mean(c(r5o[[1]],r5o[[2]],r5o[[3]]))

r2p <- lapply(week_predictions[week_predictions$Regression=="BOrigin_Dest_FE",]$Prediction,mean,na.rm=T)[c(1:3)]
r2o <- lapply(week_predictions[week_predictions$Regression=="BOrigin_Dest_FE",]$Observed,mean,na.rm=T)[c(1:3)]

r2meanP <- mean(c(r2p[[1]],r2p[[2]],r2p[[3]]))
r2meanO <- mean(c(r2o[[1]],r2o[[2]],r2o[[3]]))


## numbers for below threshold (10 minutes) is larger than valid row count, will stop working for now

# past_thresh <- nrow(ride.TestLog[ride.TestLog$delay10==1,])
# below_thresh <- nrow(ride.TestLog[ride.TestLog$delay10==0,])
# # print(past_thresh+below_thresh==nrow(ride.TestLog[is.na(ride.TestLog$delay10)==F,]))
# below_thresh/nrow(ride.TestLog[is.na(ride.TestLog$delay10)==F,])

Results

The simple model considering only the day of the week and the trip’s origin and destination has a mean predicted delay of 2 minutes and 19 seconds, and the complex model considering previous factors and time lags has a mean predicted delay of 2 minutes and 42 seconds while the true average observed delay is 0 minutes and 0.35 seconds for May.

This model and its results serves as an experiment for regional train delay predictions, and improved analyses of this model would require scaling data beyond the New Jersey region and potentially factoring in additional features that correlate with delays. This analysis also could have deeper evaluations regarding weather conditions altering delay patterns. Additionally, creating new models that blend the benefits and characteristics of each regression and logit can develop an ideal prediction for probability of delays and length of delays for New Jersey or the northeast.