Motivation of Analysis

Emil City’s Home Repair Tax Credit Program has consistently under performed from their desired goals of reaching out to eligible homeowners for housing repairs and direct community benefit. The strategy of contacting residents by random sampling simply leaves their success rate to chance.

The purpose of the tax credit as a whole is to redevelop neighborhoods without displacing people or changing the character of the neighborhoods, and provide direct community support. Our goal is to develop a model that will increase the chances of identifying individuals who are most likely to take the credit.

Recording Attributes of Campaigns, Residents, and External Metrics

Every campaign provides extra data that can be used to identify potential beneficiaries. Analyzing the times they say yes, while few, enhance our model in predicting future success.

Distribution by Percentage and Response

The lines graphs below represent the frequency of respondents accepting or declining to the tax credit. Majority of these features provide little to no insight on when the respondent will accept the credit. Notably when looking at unemployment rate, consumer confidence index, and inflation, there are indices where there are more times they say “yes” than “no”. Factors like these become indicators for our model later on, where we measure if a variable is at a specific index or value, then they are more likely to take the credit.

It must be noted that these graphs are only capable of measuring numeric data, nominal data is addressed using an alternative method.

ggplot()+
  geom_density(data=housing_data_numeric,aes(x=value,color=y))+
  facet_wrap(~variable,scales = "free")+
  scale_color_manual(values=c('orange','lightgreen'),name='Response')+
    labs(x="Recorded Factors of Contacted Resident", y="Density Distribution", 
      title = "Distribution Success by Feature")+
  theme_bw()

Distribution by Count

The bar graphs below displays the amount of people contacted by each measured feature. Each graph creates an understanding of the sample size of each attribute and verifies that, when examining a relative difference between response amounts, that it is less susceptible to random chance and proves a reasonably observed correlation.

ggplot()+
  geom_histogram(data=housing_data_numeric,aes(x=value),bins=30)+
  facet_wrap(~variable,scales = "free")+
  labs(x="Recorded Factors of Contacted Resident", y="Count")+
  theme_light()

# Correlations and Importance

Similar to the graphs above, understanding the mean value of each attribute between “yes” and “no” informs us of a “sweet spot” that can further indicate when a respondent will become a beneficiary. Graphs that show two similarly sized bars indicate that the attribute does not show a difference on the resident’s response. We notice that the inflation rate, unemployment rate, and pdays(number of days after last contact) are factors the city can use to indicate when they are going to have people applying to the program.

ggplot(data=housing_data_numeric_summary,aes(x=y,y=mean,fill=y))+
  geom_bar(stat='identity')+
  facet_wrap(~variable,scales = "free")+
  scale_fill_manual(values=c('orange','lightgreen'),name='Response')+
  labs(x="Recorded Factors of Contacted Resident", y="Mean Value", 
      title = "Measuring Likelihood to Enter the Housing Subsidy Program")+
  theme_bw()

Feature Engineering

For this nominal data, we first looked at each and every category and comparing the responses to such. Then we feature engineer the categories by “binning” them into grouped features, some by intuitive, and others by their “yes” rate. Examples include binning jobs by “employed or unemployed” and the most successful months grouped as “high-rate months”. The purpose of the binning is to hone in on when they have the largest success rate, regardless of sample size.

Aggregating Correlated Features

housing_data <- housing_data %>% mutate(
    education_group = ifelse(education %in% c("university.degree", "professional.course", "unkown"),"degree","no_degree"),
    job_group = ifelse(job %in% c("retired", "student","unemployed","unkown"),"un-employed","employed"),
    season = case_when(month %in% c("mar", "apr", "may") ~ "spring",
                      month %in% c("jun", "jul", "aug") ~ "summer",
                      month %in% c("sep", "oct") ~ "autumn",
                      month %in% c("nov", "dec") ~ "autumn"),
    inflation_bucket = case_when(
                      inflation_rate > 2.5 ~ "More than 2.5",
                      inflation_rate <= 2.5 ~ "Less than 2.5"),
    spent_on_repairs_bucket = case_when(
                      spent_on_repairs >= 5100 ~ "More than 5100",
                      spent_on_repairs < 5100 ~ "Less than 5100"),
    pdays_0 = ifelse(pdays == 0,"New Contact","Not New Target"),
    campaign_log = log(campaign),
    unemploy_rate_bucket = ifelse(unemploy_rate < -0.5,"Less than -0.5","More than -0.5"),
    month_high = ifelse(month %in% c("mar","dec","oct",'sep'),"high_rate_month","low_rate_month"),
    cons.conf.idx_bucket = (ifelse(cons.conf.idx > -42, "More than -42","Less than -42"))
    )

Recorded Attributes by Percentage

After binning the categorical variables to what we believe show the greatest discrepancy in responses, we visualize that discrepancy through the percentages in the graph below. We observe noticeable differences when picking the successful months exclusively, grouping the unemployed together, and when inflation is less than 2.5.

We note that previously contacted people who said yes also have a higher success rate, while it is factored in the model, may limit the program’s scalability and defeat the objective of reaching out to new residents.

ggplot(data=housing_data_cat_summ, aes(x = value, y = percent, fill = y)) + 
  geom_bar(stat = "identity")+
  facet_wrap(~variable,scales = "free")+
  labs(y = "Percentage", fill = "Response", title="Recorded Attributes for Contacted Homeowners")+
  scale_fill_manual(values=c('orange','lightgreen'))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Recorded Attributes by Count

Regardless of success rates by features, arbitrary or calculated, the sample size is important to factor as a high success rate with a small sample size can cause a sampling error and not provide insight when scaled. Despite this consideration, we are working with few “yes” responses to begin with, so any assumptions, feature engineering, or experimental speculation may be useful.

We see that some samples are so small they cannot be displayed in the graphs, and the yes count on any bar is far overshadowed by the no’s and total count. This lack of “successful” data points emphasizes that, in order to create any improvement in outreach strategy, we need to understand the circumstances from the few times a respondent takes the tax credit, not for causation but correlation.

ggplot(data=housing_data_cat_summ, aes(x = value, y = nvalue_count, fill = y)) + 
  geom_bar(stat = "identity")+
  facet_wrap(~variable,scales = "free")+
  labs(y = "Total Number", fill = "Response", title="Recorded Attributes for Contacted Homeowners")+
  scale_fill_manual(values=c('orange','lightgreen'))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Kitchen Sink Logistic Regression Model

The purpose of the logistic regression model is to predict the probability of a variable being 1 or 0. In this case, it is not necessarily predicting if a respondent says yes or no, but what is the probability they will say yes (or no), represented as a percent. A kitchen sink model is when we take all variables available and use them as calculations for the regression. The purpose of adding all variables available is meant for experimental insight, regardless of statistical assumptions or cautions. We see what variables improve or degrade the model, and then create a new model of filtered variables accordingly.

The results below provide us the summary of the regression, primarily showing the correlation between the probability of a “yes” and the variables measured. The statistic we analyze most is the Z score, which measures the relationship between the variable and the average score of the collective.

housing_model <- glm(y_numeric ~ .,
                     data = housing_train %>% 
                     select(-y,-...1,-job_group,-education_group,-season, -cons.conf.idx_bucket, -unemploy_rate_bucket, -month_high, -campaign_log, -pdays_0, -spent_on_repairs_bucket, -inflation_bucket),
                     family="binomial" (link="logit"))

 
summary(housing_model)$coefficients %>%
  kbl(col.names = c('Beta Coefficient','Standard Error','Z value','p-value')) %>%
  kable_classic()
Beta Coefficient Standard Error Z value p-value
(Intercept) -139.3133554 124.5460377 -1.1185691 0.2633240
age 0.0186645 0.0081698 2.2845739 0.0223378
jobblue-collar -0.3480401 0.2609945 -1.3335151 0.1823628
jobentrepreneur -0.4527138 0.4405705 -1.0275626 0.3041556
jobhousemaid -0.1505919 0.4507725 -0.3340751 0.7383229
jobmanagement -0.5743124 0.2962660 -1.9385026 0.0525619
jobretired -0.3899474 0.3576602 -1.0902733 0.2755928
jobself-employed -0.4012639 0.3956752 -1.0141245 0.3105234
jobservices -0.1539659 0.2766189 -0.5565993 0.5778012
jobstudent 0.0847638 0.3886936 0.2180736 0.8273718
jobtechnician -0.0363663 0.2302056 -0.1579730 0.8744781
jobunemployed 0.1515835 0.3745516 0.4047067 0.6856931
jobunknown -0.5037387 0.6657558 -0.7566419 0.4492645
maritalmarried 0.2671366 0.2417254 1.1051239 0.2691059
maritalsingle 0.3319164 0.2775839 1.1957339 0.2318004
maritalunknown -12.6458344 496.6473524 -0.0254624 0.9796861
educationbasic.6y 0.1651788 0.3852280 0.4287820 0.6680818
educationbasic.9y 0.0887448 0.3116651 0.2847442 0.7758401
educationhigh.school 0.0117097 0.3003195 0.0389907 0.9688978
educationilliterate -13.9458923 1455.3976333 -0.0095822 0.9923546
educationprofessional.course 0.1706240 0.3265777 0.5224607 0.6013496
educationuniversity.degree 0.0661469 0.3011001 0.2196842 0.8261171
educationunknown 0.1729066 0.3818325 0.4528335 0.6506686
taxLienunknown 0.1083271 0.2035672 0.5321441 0.5946262
taxLienyes -12.4143685 1455.3976227 -0.0085299 0.9931942
mortgageunknown -0.6967393 0.6041322 -1.1532894 0.2487916
mortgageyes -0.0428265 0.1363086 -0.3141876 0.7533786
taxbill_in_phlyes 0.0625151 0.1860637 0.3359876 0.7368802
contacttelephone -0.8823059 0.2774425 -3.1801397 0.0014720
monthaug 0.0632912 0.4147843 0.1525882 0.8787230
monthdec 0.3271233 0.6704837 0.4878915 0.6256267
monthjul 0.1687479 0.3505082 0.4814379 0.6302053
monthjun 0.2398657 0.4354699 0.5508203 0.5817569
monthmar 2.0808148 0.5314372 3.9154482 0.0000902
monthmay -0.2045716 0.2956646 -0.6919042 0.4889975
monthnov -0.3645804 0.4158487 -0.8767141 0.3806420
monthoct -0.1274885 0.5313998 -0.2399106 0.8103995
monthsep 0.2083822 0.6038580 0.3450848 0.7300306
day_of_weekmon -0.0337928 0.2155025 -0.1568091 0.8753953
day_of_weekthu 0.0235827 0.2175979 0.1083775 0.9136963
day_of_weektue 0.0722626 0.2167325 0.3334183 0.7388186
day_of_weekwed 0.2365578 0.2227694 1.0618952 0.2882833
campaign -0.0923050 0.0427933 -2.1569967 0.0310059
pdays -0.0283463 0.0395935 -0.7159338 0.4740322
previous 0.0782615 0.1683275 0.4649358 0.6419775
poutcomenonexistent 0.5456035 0.2966325 1.8393247 0.0658674
poutcomesuccess 1.8989076 0.3195301 5.9428131 0.0000000
unemploy_rate -0.8864430 0.4702282 -1.8851337 0.0594118
cons.price.idx 1.3932641 0.8218601 1.6952571 0.0900267
cons.conf.idx 0.0437257 0.0268727 1.6271420 0.1037069
inflation_rate 0.0347431 0.4240013 0.0819411 0.9346935
spent_on_repairs 0.0013935 0.0100794 0.1382484 0.8900441

Results

Here we look specifically at the McFadden Score, this index measures the log of the probability that the model’s success rate is not due to chance. A higher score is ideal but because of the data we are working with, we can prioritize other readings such as sensitivity and specificity.

pR2(housing_model) %>%
  kbl(col.name=c("Value")) %>%
  kable_classic()
## fitting null model for pseudo-r2
Value
llh -792.4723584
llhNull -1047.0219127
G2 509.0991087
McFadden 0.2431177
r2ML 0.1654950
r2CU 0.3153108
pR2(housing_model)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
##  -792.4723584 -1047.0219127   509.0991087     0.2431177     0.1654950 
##          r2CU 
##     0.3153108
testProbs <- data.frame(Outcome = as.factor(housing_test$y_numeric),
                        Probs = predict(housing_model, housing_test, type= "response"))

Confusion Matrix

A confusion matrix measures the outcome of whether we reached out, and if the respondent did or would have taken the tax credit.

Reference is the real outcome while Prediction is what the model predicts. Reference represents if we reach out, while the prediction is if they take the credit or not with all factors considered.

We see there are plenty of people we do not reach out to and do not take the credit (0,0). There are 19 that we reach out to and accept the tax credit (1,1), then we have those we do not reach out to but need the tax credit (0,1), and those we reach out to but say no (1,0).

We define (1,0) as false positive and (0,1) as false negative. Because of the overwhelming negatives, the model has a higher specificity rate. But what is important for the program is minimizing false negatives and maximizing the specificity rate (true positives).

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

kitchensink_confuse <- caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

kitchensink_confuse$table %>%
  data.frame() %>%
  spread(key=Reference,value=Freq) %>%
  rename('Not_Take_Subsidy' = '0', 'Take_Subsidy' = '1') %>%
  mutate(Total = Not_Take_Subsidy + Take_Subsidy,
         Prediction = c('Not Take Subsidy','Take Subsidy')) %>%
  kbl() %>%
  add_header_above(header=c(" " = 1,"Actual" = 3)) %>%
  kable_minimal()
Actual
Prediction Not_Take_Subsidy Take_Subsidy Total
Not Take Subsidy 1175 87 1262
Take Subsidy 24 19 43

Personalized Logistic Regression Model

We create out own logistic regression model that utilizes the grouped variables and removes the individual variables that would otherwise be considered on their own. For example, instead of each month as a variable in the model, we have “high month vs low month” as an indicator instead, simplifying the categories. Taking each variable, binned or grouped, that had the highest discrepancy in success rates helps the model predict likelihood of a yes, even if the sample is small. The ultimate goal is to minimize the rate of false negatives because those represent people who want the tax credit but are not reached out to.

housing_model2 <- glm(y_numeric ~ .,
                     data = housing_data %>% select(-y, -...1),
                     family="binomial" (link="logit"))

summary(housing_model2)$coefficients %>%
  kbl(col.names = c('Beta Coefficient','Standard Error','Z value','p-value')) %>%
  kable_classic()
Beta Coefficient Standard Error Z value p-value
(Intercept) -185.0781298 109.8112351 -1.6854207 0.0919074
age 0.0159925 0.0068844 2.3230068 0.0201788
jobblue-collar -0.3280994 0.2258342 -1.4528329 0.1462701
jobentrepreneur -0.5664931 0.3995950 -1.4176680 0.1562877
jobhousemaid -0.2185839 0.4010634 -0.5450107 0.5857462
jobmanagement -0.4879077 0.2503855 -1.9486264 0.0513401
jobretired -0.3079498 0.3047764 -1.0104123 0.3122978
jobself-employed -0.6201963 0.3488899 -1.7776273 0.0754651
jobservices -0.2052016 0.2391887 -0.8579068 0.3909439
jobstudent -0.0203117 0.3504041 -0.0579665 0.9537753
jobtechnician -0.0462570 0.1921863 -0.2406882 0.8097968
jobunemployed 0.0921775 0.3350057 0.2751519 0.7831995
jobunknown -0.5551708 0.6448636 -0.8609120 0.3892865
maritalmarried 0.1716322 0.2022282 0.8487054 0.3960452
maritalsingle 0.2716739 0.2315568 1.1732491 0.2406959
maritalunknown 0.1633412 1.1826888 0.1381101 0.8901534
educationbasic.6y 0.2492921 0.3403870 0.7323785 0.4639376
educationbasic.9y 0.1370122 0.2730245 0.5018312 0.6157863
educationhigh.school 0.1137234 0.2628612 0.4326365 0.6652788
educationilliterate -12.0675015 535.4113730 -0.0225387 0.9820182
educationprofessional.course 0.2180084 0.2850961 0.7646842 0.4444596
educationuniversity.degree 0.1965484 0.2639039 0.7447725 0.4564093
educationunknown 0.2197850 0.3444588 0.6380589 0.5234354
taxLienunknown -0.0227896 0.1781410 -0.1279302 0.8982042
taxLienyes -10.3847652 535.4113674 -0.0193959 0.9845253
mortgageunknown -0.2814371 0.4445298 -0.6331119 0.5266606
mortgageyes -0.0947365 0.1175087 -0.8062081 0.4201229
taxbill_in_phlyes 0.0747261 0.1592315 0.4692924 0.6388607
contacttelephone -1.0171726 0.2375370 -4.2821657 0.0000185
monthaug 0.0253213 0.3851079 0.0657511 0.9475760
monthdec 0.8880403 0.6081733 1.4601765 0.1442416
monthjul -0.1075706 0.3452723 -0.3115528 0.7553804
monthjun 0.0243213 0.4111909 0.0591484 0.9528339
monthmar 1.9102596 0.4683499 4.0787019 0.0000453
monthmay -0.2957782 0.2604128 -1.1358051 0.2560381
monthnov -0.5343667 0.5092118 -1.0493998 0.2939941
monthoct -0.1895110 0.5125614 -0.3697333 0.7115812
monthsep -0.0932119 0.5254761 -0.1773856 0.8592055
day_of_weekmon -0.0581173 0.1828745 -0.3177990 0.7506374
day_of_weekthu 0.0122096 0.1838679 0.0664040 0.9470562
day_of_weektue -0.0196597 0.1876518 -0.1047667 0.9165610
day_of_weekwed 0.1715558 0.1877637 0.9136794 0.3608853
campaign -0.2183238 0.1017009 -2.1467237 0.0318153
pdays -0.0811551 0.0521652 -1.5557348 0.1197712
previous 0.1535239 0.1634915 0.9390325 0.3477140
poutcomenonexistent 0.6365348 0.2720416 2.3398435 0.0192918
poutcomesuccess 1.0152470 0.6056019 1.6764263 0.0936547
unemploy_rate -1.0373266 0.4302672 -2.4108891 0.0159137
cons.price.idx 1.7492001 0.7631590 2.2920520 0.0219026
cons.conf.idx 0.0633690 0.0511650 1.2385217 0.2155227
inflation_rate 0.0638506 0.4905384 0.1301643 0.8964365
spent_on_repairs 0.0040149 0.0087034 0.4612961 0.6445862
inflation_bucketMore than 2.5 -0.1044430 1.6115511 -0.0648090 0.9483261
pdays_0Not New Target 1.1529975 0.7721658 1.4931994 0.1353850
campaign_log 0.4286657 0.2666864 1.6073776 0.1079716
cons.conf.idx_bucketMore than -42 -0.1432138 0.5721987 -0.2502869 0.8023655

Results

Even though the McFadden score is lower indicating there is higher likelihood of chance in our model, the 0.2 difference is negligible.

pR2(housing_model2) %>%
  kbl(col.name=c("Value")) %>%
  kable_classic()
## fitting null model for pseudo-r2
Value
llh -1100.2386754
llhNull -1422.9216004
G2 645.3658502
McFadden 0.2267749
r2ML 0.1450226
r2CU 0.2906973
pR2(housing_model2)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -1100.2386754 -1422.9216004   645.3658502     0.2267749     0.1450226 
##          r2CU 
##     0.2906973
testProbs2 <- data.frame(Outcome = as.factor(housing_test$y_numeric),
                        Probs = predict(housing_model2, housing_test, type= "response"))

Comparing Sensitivity and Specificity

When looking at the specificity rate (0,1), we notice there is 54% fewer false negatives in our model. That means there are 11 fewer people that need the credit yet were not contacted compared to the kitchen sink model.

testProbs2 <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))

ownmodel_confuse <- caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome, 
                       positive = "1")

ownmodel_confuse$table %>%
  data.frame() %>%
  spread(key=Reference,value=Freq) %>%
  rename('Not_Take_Subsidy' = '0', 'Take_Subsidy' = '1') %>%
  mutate(Total = Not_Take_Subsidy + Take_Subsidy,
         Prediction = c('Not_Take_Subsidy','Take_Subsidy')) %>%
  kbl() %>%
  add_header_above(header=c(" " = 1,"Actual" = 3)) %>%
  kable_minimal()
Actual
Prediction Not_Take_Subsidy Take_Subsidy Total
Not_Take_Subsidy 1186 87 1273
Take_Subsidy 13 19 32

Distribution of Predicted Observed Outcomes

The density distribution of predicted probability by observed outcomes displays a large right skew for the “0” outcome, meaning they did not accept the tax credit. That means our model is excellent in predicting when they are going to say no. But this is only half of the ideal model we would look for. We want a left skew for our “1” outcome distribution, meaning they took the credit. That way we can find the optimal threshold to use as a cut-off for when we decide to call or not, even further increasing our odds.

This distribution does not provide us the best estimate, we have alternative ways to find the optimal threshold.

ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probability", y = "Density of Outcome",
       title = "Distribution of Predicted Probabilities by Observed Outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

Optimizing The Threshold, Cost-Benefit Analysis

It is important to understand the purpose of the housing subsidy program, beyond people simply taking the tax credit to repair their home. The purpose of the subsidy is to inject funding back to communities that raise property value without gentrifying the area, supporting homeowners, and uplifting the quality of life in the city.

The purpose of eliminating the false-negatives is to ensure that those who want the subsidy are contacted. And the efforts it takes the city to reach those people inherently costs money too, so the goal is to be as efficient as possible in sorting when someone will or will not take that subsidy.

The optimal threshold tells us when the credits allocated benefit the community the most, and create the largest return on investment. Every person that wants the credit but wasn’t contacted is foregone benefit to the city. The city’s purpose of the program is not a tangible profit but to rebuild communities efficiently, minimizing the net loss it has when spending their budget. Now we look at where that line is drawn.

For every $5,000 credit subsidize, we create an average of $10,000 in value to the home directly, and an accumulated aggregate average of $56,000 for surrounding homes. That is a 1,320% return on investment. While the city itself loses tangible budget for searching, calling, allocating, the “profits” are at the homeowner level. Hence the importance providing these allocations.

results_sens_10 <- vector()

for (i in seq(1,10,1)){
  print(i)
   
  t <- testProbs %>% mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 1-(i/10) , 1, 0)))
  results_sens_10[i] <-  caret::confusionMatrix(t$predOutcome, t$Outcome, positive = "1")$table[2]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
results_sens_10
##  [1]    0    2    7    9   24   36   58   87  217 1199
results_sens_01 <- vector()

for (i in seq(1,10,1)){
  print(i)
   
  t <- testProbs %>% mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 1-(i/10) , 1, 0)))
  results_sens_01[i] <-  caret::confusionMatrix(t$predOutcome, t$Outcome, positive = "1")$table[3]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
results_sens_01
##  [1] 106 104  96  93  87  78  68  60  45   0
results_sens_11 <- vector()

for (i in seq(1,10,1)){
  print(i)
   
  t <- testProbs %>% mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 1-(i/10) , 1, 0)))
  results_sens_11[i] <-  caret::confusionMatrix(t$predOutcome, t$Outcome, positive = "1")$table[4]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
results_sens_11
##  [1]   0   2  10  13  19  28  38  46  61 106
total_rev_11 <- results_sens_11*250 + results_sens_11*10000 + results_sens_11*20000
total_rev_11
##  [1]       0   60500  302500  393250  574750  847000 1149500 1391500 1845250
## [10] 3206500
total_cost_11 <- -(results_sens_11*2850)
total_cost_11
##  [1]       0   -5700  -28500  -37050  -54150  -79800 -108300 -131100 -173850
## [10] -302100
#01 is the foregone revenue which we will treat as cost
total_cost_01 <- -(results_sens_01*250 + results_sens_01*10000 + results_sens_01*20000)
total_cost_01
##  [1] -3206500 -3146000 -2904000 -2813250 -2631750 -2359500 -2057000 -1815000
##  [9] -1361250        0
total_cost_10 <- -(results_sens_10*2850)
total_cost_10
##  [1]        0    -5700   -19950   -25650   -68400  -102600  -165300  -247950
##  [9]  -618450 -3417150
net <- total_rev_11 + total_cost_11 + total_cost_01 + total_cost_10


cost <- total_cost_11 + total_cost_01 + total_cost_10
revenue <- total_rev_11
net_cost <- revenue + cost
data.frame()
## data frame with 0 columns and 0 rows
total_credits_given <- results_sens_11
total_credits_given
##  [1]   0   2  10  13  19  28  38  46  61 106
data.frame(threshold_in_pct = seq(10,100,10),revenue,cost,net_cost) %>%
  kbl(col.name=c("Treshold Percentage","Revenue (USD)","Cost (USD)","Net Cost (USD)")) %>%
  kable_classic()
Treshold Percentage Revenue (USD) Cost (USD) Net Cost (USD)
10 0 -3206500 -3206500
20 60500 -3157400 -3096900
30 302500 -2952450 -2649950
40 393250 -2875950 -2482700
50 574750 -2754300 -2179550
60 847000 -2541900 -1694900
70 1149500 -2330600 -1181100
80 1391500 -2194050 -802550
90 1845250 -2153550 -308300
100 3206500 -3719250 -512750

As shown from the table above, the 90% mark has the smallest net lose of only $308,300. While revenue gained from the program does increase even at 100%, it starts to cost even more because of the excess credits given that do not add to the city’s value anymore. We will use 90% threshold.

Foregone Benefit and Credit Allocation

With public interest in mind, we see that 90% proves itself as the “sweet-spot” because of the dip as seen in the foregone benefit chart. The 90% mark (second bar to the right) is smaller than the 100% mark despite the fact the pattern throughout the rest of the graph would indicate it between the $750,000 and $500,000 mark (between 80% and 90%). Yet the pattern expect by credit allocations is linear and predictable, with the 90% providing 61 credits, in-between its neighboring values. The outlier shown in foregone benefit yet uniformity with the credit allocation proves that this model found the optimal threshold to create the most value per credit.

We understand intuition would assume 50% would be optimal, especially without analysis and in theory alone. We must compare in order to analyze its viability.

results_final <- data.frame(threshold_in_pct = seq(10,100,10),
                            net_public_loss = -1*net ,
                            total_credits_given = total_credits_given)
results_final
##    threshold_in_pct net_public_loss total_credits_given
## 1                10         3206500                   0
## 2                20         3096900                   2
## 3                30         2649950                  10
## 4                40         2482700                  13
## 5                50         2179550                  19
## 6                60         1694900                  28
## 7                70         1181100                  38
## 8                80          802550                  46
## 9                90          308300                  61
## 10              100          512750                 106
grid.arrange(ncol=2,
ggplot(data = results_final, aes(x = threshold_in_pct, y = net_public_loss)) +
  geom_bar(stat='identity',fill='orange4')+
  labs(title = "Foregone Benefit")+
  theme_bw()+
  labs(y='Opportunity Cost',x='Treshold'),
ggplot(data = results_final, aes(x = threshold_in_pct, y = total_credits_given)) +
  geom_bar(stat='identity',fill='lightblue')+
  labs(title = "Credit Allocation")+
  theme_bw()+
  labs(y='Total Credits Given',x='Treshold')
)

## Comparing 50% to the Optimized Threshold

The table below puts 50% and 90% side-by-side. When using 50% as the threshold to contact a resident, the city loses over $2.1 million dollars in total, making each credit an average lose of $114,713.

Now look at 90%, 61 credit given while losing $308,300 means a net loss of $5,054 per credit, a significant improve that is relatively more affordable as a housing subsidy.

This means that decrease net public loss by 7x and increase credits given by 3x. However, alone the house appreciation of the real estate and its surrounding houses is at approximately 2 million dollars if 61 credits were given.

results_final %>%
  dplyr::filter(threshold_in_pct %in% c(90,50)) %>%
  kbl(col.names = c("Treshold Precentage","Net Public Loss","Credits Given")) %>%
  kable_classic()
Treshold Precentage Net Public Loss Credits Given
50 2179550 19
90 308300 61

Optimized Threshold Confusion Matrix, Visualized

The histogram below shows sensitivity and false predictions from the model. We removed true negatives due to the overwhelming count on the display. The graphs shows that at the optimal threshold (90%), there is a slight uptick in true positives, and a larger amount of false positives.

The graph, specifically the 90% mark, has the fewest false negatives*, meaning the most distressed homeowners who want the subsidy, are contacted using this model.

  • While the 100% mark has no recorded false negatives, the benefits are negated by the false positives.
results <- data.frame(threshold_in_pct = seq(10,100,10),
                     True_positive=results_sens_11,
                     False_negative=results_sens_01,
                     False_positive=results_sens_10) %>%
  gather(variable,value,-threshold_in_pct)

ggplot(results, aes(x = threshold_in_pct, y= value, fill = variable)) + 
  labs(x = "Threshold Percentage", y = "Count", fill = "Confusion Matrix Outcomes")+
  geom_bar(stat='identity')+
  scale_fill_viridis_d()+
  theme_bw()

Optimal Threshold Confusion Matrix, By the Numbers

The model below shows the results using our optimal threshold.

testProbs3 <- data.frame(Outcome = as.factor(housing_test$y_numeric),
                        Probs = predict(housing_model2, housing_test, type= "response"))

testProbs3 <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs3$Probs > 0.1 , 1, 0)))

test3table <- caret::confusionMatrix(testProbs3$predOutcome, testProbs3$Outcome, 
                       positive = "1")

test3table$table %>%
  data.frame() %>%
  spread(key=Reference,value=Freq) %>%
  rename('Not_Take_Subsidy' = '0', 'Take_Subsidy' = '1') %>%
  mutate(Total = Not_Take_Subsidy + Take_Subsidy,
         Prediction = c('Not_Take_Subsidy','Take_Subsidy')) %>%
  kbl() %>%
  add_header_above(header=c(" " = 1,"Actual" = 3)) %>%
  kable_minimal()
Actual
Prediction Not_Take_Subsidy Take_Subsidy Total
Not_Take_Subsidy 1028 47 1075
Take_Subsidy 171 59 230

Optimal Threshold Density Plot Distribution

The density distribution of predicted probabilities by observed outcomes shows a large right skew distribution for negative outcomes. This means that our model still does a great job of predicting true negatives.

In an ideal scenario, we would want our positive outcome density distribution to be heavily right skewed. The reason behind this is that we could then choose a more ideal threshold for the cut off and thereby further increase the quality of our model.

Given the quality of the data we had, we applied all relevant feature engineering techniques to isolate correct positive outcomes. Our biggest lever, however, as this graph displays, is the choice of optimal threshold.

ggplot(testProbs3, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probability", y = "Density of Predicted Outcomes",
       title = "Distribution of Predicted Probabilities by Observed Outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

Reciever Operating Characteristic Curve

The Area under the curve indicates how good our model is compared to random chance. Since random chance would indicate 50%, we are 24 points higher than random chance, showing our model is excellent but imperfect. The imperfections also emphasize scalability, and that the model is not overfitted to only this situation, and can expand as necessary.

auc(testProbs3$Outcome, testProbs3$Probs)
## Area under the curve: 0.743
ggplot(testProbs3, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "mediumorchid2") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey40') +
  labs(title = "Reciever Operating Characteristic Curve")

CV Fit

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(y ~ ., data = housing_data %>% select(-y_numeric, -...1), method="glm", family="binomial",  metric="ROC", trControl = ctrl)

cvFit
## Generalized Linear Model 
## 
## 4119 samples
##   29 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4077, 4078, 4077, 4078, 4077, 4077, ... 
## Resampling results:
## 
##   ROC        Sens       Spec 
##   0.7632462  0.9811712  0.219
dplyr::select(cvFit$resample, -Resample) %>%  gather(metric, value) %>%  left_join(gather(cvFit$results[2:4], metric, mean)) %>%  ggplot(aes(value)) +     geom_histogram(bins=35, fill = "#FF006A") +    facet_wrap(~metric) +    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +    scale_x_continuous(limits = c(0, 1)) +    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",         subtitle = "Across-fold mean reprented as dotted lines")

The results of a model evaluation using a technique called 100-fold cross-validation. It emphasizes the importance of looking at the variation in performance across different folds to assess the model’s generalizability.

Instead of focusing on the means, we will focus on the across folds distribution for each metric. This is because tighter across folds distribution implies a higher degree of generalizability for the model.

The ROC metric, which measures the model’s ability to distinguish between positive and negative cases, shows a consistent performance around a score of 0.76.

The specificity metric, performs very well because the dataset originally had a majority of negative cases. However, the sensitivity metric has a wider variation across folds, with values ranging from 0.25 to 0.75. This suggests that the model captures some trends in the data but not consistently across all cases. The small sample size of positive cases in the initial dataset makes sensitivity results sensitive to small differences in how many positive cases the model predicts.

Overall, the model performs reasonably well, but there are challenges in achieving consistent sensitivity due to the dataset’s imbalance.

Results & Recommendation

Our model successfully meets its intended objective of reducing false negatives while maintaining an acceptable level of false positives. Achieving this outcome required extensive feature engineering, which fine-tuned our logistic regression model to identify subsidy beneficiaries. Considering the limited number of beneficiaries in the dataset, our model can be deemed effective. This success is also evident in the significant increase in distributed credits and overall public benefits.

As a result, we recommend using this model compared to the current technique of randomness, and advise that more data is gathering to continue enhancing the model.