Emil City’s Home Repair Tax Credit Program has consistently under performed from their desired goals of reaching out to eligible homeowners for housing repairs and direct community benefit. The strategy of contacting residents by random sampling simply leaves their success rate to chance.
The purpose of the tax credit as a whole is to redevelop neighborhoods without displacing people or changing the character of the neighborhoods, and provide direct community support. Our goal is to develop a model that will increase the chances of identifying individuals who are most likely to take the credit.
Every campaign provides extra data that can be used to identify potential beneficiaries. Analyzing the times they say yes, while few, enhance our model in predicting future success.
The lines graphs below represent the frequency of respondents accepting or declining to the tax credit. Majority of these features provide little to no insight on when the respondent will accept the credit. Notably when looking at unemployment rate, consumer confidence index, and inflation, there are indices where there are more times they say “yes” than “no”. Factors like these become indicators for our model later on, where we measure if a variable is at a specific index or value, then they are more likely to take the credit.
It must be noted that these graphs are only capable of measuring numeric data, nominal data is addressed using an alternative method.
ggplot()+
geom_density(data=housing_data_numeric,aes(x=value,color=y))+
facet_wrap(~variable,scales = "free")+
scale_color_manual(values=c('orange','lightgreen'),name='Response')+
labs(x="Recorded Factors of Contacted Resident", y="Density Distribution",
title = "Distribution Success by Feature")+
theme_bw()
The bar graphs below displays the amount of people contacted by each measured feature. Each graph creates an understanding of the sample size of each attribute and verifies that, when examining a relative difference between response amounts, that it is less susceptible to random chance and proves a reasonably observed correlation.
ggplot()+
geom_histogram(data=housing_data_numeric,aes(x=value),bins=30)+
facet_wrap(~variable,scales = "free")+
labs(x="Recorded Factors of Contacted Resident", y="Count")+
theme_light()
# Correlations and Importance
Similar to the graphs above, understanding the mean value of each attribute between “yes” and “no” informs us of a “sweet spot” that can further indicate when a respondent will become a beneficiary. Graphs that show two similarly sized bars indicate that the attribute does not show a difference on the resident’s response. We notice that the inflation rate, unemployment rate, and pdays(number of days after last contact) are factors the city can use to indicate when they are going to have people applying to the program.
ggplot(data=housing_data_numeric_summary,aes(x=y,y=mean,fill=y))+
geom_bar(stat='identity')+
facet_wrap(~variable,scales = "free")+
scale_fill_manual(values=c('orange','lightgreen'),name='Response')+
labs(x="Recorded Factors of Contacted Resident", y="Mean Value",
title = "Measuring Likelihood to Enter the Housing Subsidy Program")+
theme_bw()
For this nominal data, we first looked at each and every category and comparing the responses to such. Then we feature engineer the categories by “binning” them into grouped features, some by intuitive, and others by their “yes” rate. Examples include binning jobs by “employed or unemployed” and the most successful months grouped as “high-rate months”. The purpose of the binning is to hone in on when they have the largest success rate, regardless of sample size.
The purpose of the logistic regression model is to predict the probability of a variable being 1 or 0. In this case, it is not necessarily predicting if a respondent says yes or no, but what is the probability they will say yes (or no), represented as a percent. A kitchen sink model is when we take all variables available and use them as calculations for the regression. The purpose of adding all variables available is meant for experimental insight, regardless of statistical assumptions or cautions. We see what variables improve or degrade the model, and then create a new model of filtered variables accordingly.
The results below provide us the summary of the regression, primarily showing the correlation between the probability of a “yes” and the variables measured. The statistic we analyze most is the Z score, which measures the relationship between the variable and the average score of the collective.
housing_model <- glm(y_numeric ~ .,
data = housing_train %>%
select(-y,-...1,-job_group,-education_group,-season, -cons.conf.idx_bucket, -unemploy_rate_bucket, -month_high, -campaign_log, -pdays_0, -spent_on_repairs_bucket, -inflation_bucket),
family="binomial" (link="logit"))
summary(housing_model)$coefficients %>%
kbl(col.names = c('Beta Coefficient','Standard Error','Z value','p-value')) %>%
kable_classic()
Beta Coefficient | Standard Error | Z value | p-value | |
---|---|---|---|---|
(Intercept) | -139.3133554 | 124.5460377 | -1.1185691 | 0.2633240 |
age | 0.0186645 | 0.0081698 | 2.2845739 | 0.0223378 |
jobblue-collar | -0.3480401 | 0.2609945 | -1.3335151 | 0.1823628 |
jobentrepreneur | -0.4527138 | 0.4405705 | -1.0275626 | 0.3041556 |
jobhousemaid | -0.1505919 | 0.4507725 | -0.3340751 | 0.7383229 |
jobmanagement | -0.5743124 | 0.2962660 | -1.9385026 | 0.0525619 |
jobretired | -0.3899474 | 0.3576602 | -1.0902733 | 0.2755928 |
jobself-employed | -0.4012639 | 0.3956752 | -1.0141245 | 0.3105234 |
jobservices | -0.1539659 | 0.2766189 | -0.5565993 | 0.5778012 |
jobstudent | 0.0847638 | 0.3886936 | 0.2180736 | 0.8273718 |
jobtechnician | -0.0363663 | 0.2302056 | -0.1579730 | 0.8744781 |
jobunemployed | 0.1515835 | 0.3745516 | 0.4047067 | 0.6856931 |
jobunknown | -0.5037387 | 0.6657558 | -0.7566419 | 0.4492645 |
maritalmarried | 0.2671366 | 0.2417254 | 1.1051239 | 0.2691059 |
maritalsingle | 0.3319164 | 0.2775839 | 1.1957339 | 0.2318004 |
maritalunknown | -12.6458344 | 496.6473524 | -0.0254624 | 0.9796861 |
educationbasic.6y | 0.1651788 | 0.3852280 | 0.4287820 | 0.6680818 |
educationbasic.9y | 0.0887448 | 0.3116651 | 0.2847442 | 0.7758401 |
educationhigh.school | 0.0117097 | 0.3003195 | 0.0389907 | 0.9688978 |
educationilliterate | -13.9458923 | 1455.3976333 | -0.0095822 | 0.9923546 |
educationprofessional.course | 0.1706240 | 0.3265777 | 0.5224607 | 0.6013496 |
educationuniversity.degree | 0.0661469 | 0.3011001 | 0.2196842 | 0.8261171 |
educationunknown | 0.1729066 | 0.3818325 | 0.4528335 | 0.6506686 |
taxLienunknown | 0.1083271 | 0.2035672 | 0.5321441 | 0.5946262 |
taxLienyes | -12.4143685 | 1455.3976227 | -0.0085299 | 0.9931942 |
mortgageunknown | -0.6967393 | 0.6041322 | -1.1532894 | 0.2487916 |
mortgageyes | -0.0428265 | 0.1363086 | -0.3141876 | 0.7533786 |
taxbill_in_phlyes | 0.0625151 | 0.1860637 | 0.3359876 | 0.7368802 |
contacttelephone | -0.8823059 | 0.2774425 | -3.1801397 | 0.0014720 |
monthaug | 0.0632912 | 0.4147843 | 0.1525882 | 0.8787230 |
monthdec | 0.3271233 | 0.6704837 | 0.4878915 | 0.6256267 |
monthjul | 0.1687479 | 0.3505082 | 0.4814379 | 0.6302053 |
monthjun | 0.2398657 | 0.4354699 | 0.5508203 | 0.5817569 |
monthmar | 2.0808148 | 0.5314372 | 3.9154482 | 0.0000902 |
monthmay | -0.2045716 | 0.2956646 | -0.6919042 | 0.4889975 |
monthnov | -0.3645804 | 0.4158487 | -0.8767141 | 0.3806420 |
monthoct | -0.1274885 | 0.5313998 | -0.2399106 | 0.8103995 |
monthsep | 0.2083822 | 0.6038580 | 0.3450848 | 0.7300306 |
day_of_weekmon | -0.0337928 | 0.2155025 | -0.1568091 | 0.8753953 |
day_of_weekthu | 0.0235827 | 0.2175979 | 0.1083775 | 0.9136963 |
day_of_weektue | 0.0722626 | 0.2167325 | 0.3334183 | 0.7388186 |
day_of_weekwed | 0.2365578 | 0.2227694 | 1.0618952 | 0.2882833 |
campaign | -0.0923050 | 0.0427933 | -2.1569967 | 0.0310059 |
pdays | -0.0283463 | 0.0395935 | -0.7159338 | 0.4740322 |
previous | 0.0782615 | 0.1683275 | 0.4649358 | 0.6419775 |
poutcomenonexistent | 0.5456035 | 0.2966325 | 1.8393247 | 0.0658674 |
poutcomesuccess | 1.8989076 | 0.3195301 | 5.9428131 | 0.0000000 |
unemploy_rate | -0.8864430 | 0.4702282 | -1.8851337 | 0.0594118 |
cons.price.idx | 1.3932641 | 0.8218601 | 1.6952571 | 0.0900267 |
cons.conf.idx | 0.0437257 | 0.0268727 | 1.6271420 | 0.1037069 |
inflation_rate | 0.0347431 | 0.4240013 | 0.0819411 | 0.9346935 |
spent_on_repairs | 0.0013935 | 0.0100794 | 0.1382484 | 0.8900441 |
Here we look specifically at the McFadden Score, this index measures the log of the probability that the model’s success rate is not due to chance. A higher score is ideal but because of the data we are working with, we can prioritize other readings such as sensitivity and specificity.
pR2(housing_model) %>%
kbl(col.name=c("Value")) %>%
kable_classic()
## fitting null model for pseudo-r2
Value | |
---|---|
llh | -792.4723584 |
llhNull | -1047.0219127 |
G2 | 509.0991087 |
McFadden | 0.2431177 |
r2ML | 0.1654950 |
r2CU | 0.3153108 |
pR2(housing_model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -792.4723584 -1047.0219127 509.0991087 0.2431177 0.1654950
## r2CU
## 0.3153108
testProbs <- data.frame(Outcome = as.factor(housing_test$y_numeric),
Probs = predict(housing_model, housing_test, type= "response"))
A confusion matrix measures the outcome of whether we reached out, and if the respondent did or would have taken the tax credit.
Reference is the real outcome while Prediction is what the model predicts. Reference represents if we reach out, while the prediction is if they take the credit or not with all factors considered.
We see there are plenty of people we do not reach out to and do not take the credit (0,0). There are 19 that we reach out to and accept the tax credit (1,1), then we have those we do not reach out to but need the tax credit (0,1), and those we reach out to but say no (1,0).
We define (1,0) as false positive and (0,1) as false negative. Because of the overwhelming negatives, the model has a higher specificity rate. But what is important for the program is minimizing false negatives and maximizing the specificity rate (true positives).
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
kitchensink_confuse <- caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
kitchensink_confuse$table %>%
data.frame() %>%
spread(key=Reference,value=Freq) %>%
rename('Not_Take_Subsidy' = '0', 'Take_Subsidy' = '1') %>%
mutate(Total = Not_Take_Subsidy + Take_Subsidy,
Prediction = c('Not Take Subsidy','Take Subsidy')) %>%
kbl() %>%
add_header_above(header=c(" " = 1,"Actual" = 3)) %>%
kable_minimal()
Prediction | Not_Take_Subsidy | Take_Subsidy | Total |
---|---|---|---|
Not Take Subsidy | 1175 | 87 | 1262 |
Take Subsidy | 24 | 19 | 43 |
We create out own logistic regression model that utilizes the grouped variables and removes the individual variables that would otherwise be considered on their own. For example, instead of each month as a variable in the model, we have “high month vs low month” as an indicator instead, simplifying the categories. Taking each variable, binned or grouped, that had the highest discrepancy in success rates helps the model predict likelihood of a yes, even if the sample is small. The ultimate goal is to minimize the rate of false negatives because those represent people who want the tax credit but are not reached out to.
housing_model2 <- glm(y_numeric ~ .,
data = housing_data %>% select(-y, -...1),
family="binomial" (link="logit"))
summary(housing_model2)$coefficients %>%
kbl(col.names = c('Beta Coefficient','Standard Error','Z value','p-value')) %>%
kable_classic()
Beta Coefficient | Standard Error | Z value | p-value | |
---|---|---|---|---|
(Intercept) | -185.0781298 | 109.8112351 | -1.6854207 | 0.0919074 |
age | 0.0159925 | 0.0068844 | 2.3230068 | 0.0201788 |
jobblue-collar | -0.3280994 | 0.2258342 | -1.4528329 | 0.1462701 |
jobentrepreneur | -0.5664931 | 0.3995950 | -1.4176680 | 0.1562877 |
jobhousemaid | -0.2185839 | 0.4010634 | -0.5450107 | 0.5857462 |
jobmanagement | -0.4879077 | 0.2503855 | -1.9486264 | 0.0513401 |
jobretired | -0.3079498 | 0.3047764 | -1.0104123 | 0.3122978 |
jobself-employed | -0.6201963 | 0.3488899 | -1.7776273 | 0.0754651 |
jobservices | -0.2052016 | 0.2391887 | -0.8579068 | 0.3909439 |
jobstudent | -0.0203117 | 0.3504041 | -0.0579665 | 0.9537753 |
jobtechnician | -0.0462570 | 0.1921863 | -0.2406882 | 0.8097968 |
jobunemployed | 0.0921775 | 0.3350057 | 0.2751519 | 0.7831995 |
jobunknown | -0.5551708 | 0.6448636 | -0.8609120 | 0.3892865 |
maritalmarried | 0.1716322 | 0.2022282 | 0.8487054 | 0.3960452 |
maritalsingle | 0.2716739 | 0.2315568 | 1.1732491 | 0.2406959 |
maritalunknown | 0.1633412 | 1.1826888 | 0.1381101 | 0.8901534 |
educationbasic.6y | 0.2492921 | 0.3403870 | 0.7323785 | 0.4639376 |
educationbasic.9y | 0.1370122 | 0.2730245 | 0.5018312 | 0.6157863 |
educationhigh.school | 0.1137234 | 0.2628612 | 0.4326365 | 0.6652788 |
educationilliterate | -12.0675015 | 535.4113730 | -0.0225387 | 0.9820182 |
educationprofessional.course | 0.2180084 | 0.2850961 | 0.7646842 | 0.4444596 |
educationuniversity.degree | 0.1965484 | 0.2639039 | 0.7447725 | 0.4564093 |
educationunknown | 0.2197850 | 0.3444588 | 0.6380589 | 0.5234354 |
taxLienunknown | -0.0227896 | 0.1781410 | -0.1279302 | 0.8982042 |
taxLienyes | -10.3847652 | 535.4113674 | -0.0193959 | 0.9845253 |
mortgageunknown | -0.2814371 | 0.4445298 | -0.6331119 | 0.5266606 |
mortgageyes | -0.0947365 | 0.1175087 | -0.8062081 | 0.4201229 |
taxbill_in_phlyes | 0.0747261 | 0.1592315 | 0.4692924 | 0.6388607 |
contacttelephone | -1.0171726 | 0.2375370 | -4.2821657 | 0.0000185 |
monthaug | 0.0253213 | 0.3851079 | 0.0657511 | 0.9475760 |
monthdec | 0.8880403 | 0.6081733 | 1.4601765 | 0.1442416 |
monthjul | -0.1075706 | 0.3452723 | -0.3115528 | 0.7553804 |
monthjun | 0.0243213 | 0.4111909 | 0.0591484 | 0.9528339 |
monthmar | 1.9102596 | 0.4683499 | 4.0787019 | 0.0000453 |
monthmay | -0.2957782 | 0.2604128 | -1.1358051 | 0.2560381 |
monthnov | -0.5343667 | 0.5092118 | -1.0493998 | 0.2939941 |
monthoct | -0.1895110 | 0.5125614 | -0.3697333 | 0.7115812 |
monthsep | -0.0932119 | 0.5254761 | -0.1773856 | 0.8592055 |
day_of_weekmon | -0.0581173 | 0.1828745 | -0.3177990 | 0.7506374 |
day_of_weekthu | 0.0122096 | 0.1838679 | 0.0664040 | 0.9470562 |
day_of_weektue | -0.0196597 | 0.1876518 | -0.1047667 | 0.9165610 |
day_of_weekwed | 0.1715558 | 0.1877637 | 0.9136794 | 0.3608853 |
campaign | -0.2183238 | 0.1017009 | -2.1467237 | 0.0318153 |
pdays | -0.0811551 | 0.0521652 | -1.5557348 | 0.1197712 |
previous | 0.1535239 | 0.1634915 | 0.9390325 | 0.3477140 |
poutcomenonexistent | 0.6365348 | 0.2720416 | 2.3398435 | 0.0192918 |
poutcomesuccess | 1.0152470 | 0.6056019 | 1.6764263 | 0.0936547 |
unemploy_rate | -1.0373266 | 0.4302672 | -2.4108891 | 0.0159137 |
cons.price.idx | 1.7492001 | 0.7631590 | 2.2920520 | 0.0219026 |
cons.conf.idx | 0.0633690 | 0.0511650 | 1.2385217 | 0.2155227 |
inflation_rate | 0.0638506 | 0.4905384 | 0.1301643 | 0.8964365 |
spent_on_repairs | 0.0040149 | 0.0087034 | 0.4612961 | 0.6445862 |
inflation_bucketMore than 2.5 | -0.1044430 | 1.6115511 | -0.0648090 | 0.9483261 |
pdays_0Not New Target | 1.1529975 | 0.7721658 | 1.4931994 | 0.1353850 |
campaign_log | 0.4286657 | 0.2666864 | 1.6073776 | 0.1079716 |
cons.conf.idx_bucketMore than -42 | -0.1432138 | 0.5721987 | -0.2502869 | 0.8023655 |
Even though the McFadden score is lower indicating there is higher likelihood of chance in our model, the 0.2 difference is negligible.
pR2(housing_model2) %>%
kbl(col.name=c("Value")) %>%
kable_classic()
## fitting null model for pseudo-r2
Value | |
---|---|
llh | -1100.2386754 |
llhNull | -1422.9216004 |
G2 | 645.3658502 |
McFadden | 0.2267749 |
r2ML | 0.1450226 |
r2CU | 0.2906973 |
pR2(housing_model2)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -1100.2386754 -1422.9216004 645.3658502 0.2267749 0.1450226
## r2CU
## 0.2906973
testProbs2 <- data.frame(Outcome = as.factor(housing_test$y_numeric),
Probs = predict(housing_model2, housing_test, type= "response"))
When looking at the specificity rate (0,1), we notice there is 54% fewer false negatives in our model. That means there are 11 fewer people that need the credit yet were not contacted compared to the kitchen sink model.
testProbs2 <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))
ownmodel_confuse <- caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome,
positive = "1")
ownmodel_confuse$table %>%
data.frame() %>%
spread(key=Reference,value=Freq) %>%
rename('Not_Take_Subsidy' = '0', 'Take_Subsidy' = '1') %>%
mutate(Total = Not_Take_Subsidy + Take_Subsidy,
Prediction = c('Not_Take_Subsidy','Take_Subsidy')) %>%
kbl() %>%
add_header_above(header=c(" " = 1,"Actual" = 3)) %>%
kable_minimal()
Prediction | Not_Take_Subsidy | Take_Subsidy | Total |
---|---|---|---|
Not_Take_Subsidy | 1186 | 87 | 1273 |
Take_Subsidy | 13 | 19 | 32 |
The density distribution of predicted probability by observed outcomes displays a large right skew for the “0” outcome, meaning they did not accept the tax credit. That means our model is excellent in predicting when they are going to say no. But this is only half of the ideal model we would look for. We want a left skew for our “1” outcome distribution, meaning they took the credit. That way we can find the optimal threshold to use as a cut-off for when we decide to call or not, even further increasing our odds.
This distribution does not provide us the best estimate, we have alternative ways to find the optimal threshold.
ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Probability", y = "Density of Outcome",
title = "Distribution of Predicted Probabilities by Observed Outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
It is important to understand the purpose of the housing subsidy program, beyond people simply taking the tax credit to repair their home. The purpose of the subsidy is to inject funding back to communities that raise property value without gentrifying the area, supporting homeowners, and uplifting the quality of life in the city.
The purpose of eliminating the false-negatives is to ensure that those who want the subsidy are contacted. And the efforts it takes the city to reach those people inherently costs money too, so the goal is to be as efficient as possible in sorting when someone will or will not take that subsidy.
The optimal threshold tells us when the credits allocated benefit the community the most, and create the largest return on investment. Every person that wants the credit but wasn’t contacted is foregone benefit to the city. The city’s purpose of the program is not a tangible profit but to rebuild communities efficiently, minimizing the net loss it has when spending their budget. Now we look at where that line is drawn.
For every $5,000 credit subsidize, we create an average of $10,000 in value to the home directly, and an accumulated aggregate average of $56,000 for surrounding homes. That is a 1,320% return on investment. While the city itself loses tangible budget for searching, calling, allocating, the “profits” are at the homeowner level. Hence the importance providing these allocations.
results_sens_10 <- vector()
for (i in seq(1,10,1)){
print(i)
t <- testProbs %>% mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 1-(i/10) , 1, 0)))
results_sens_10[i] <- caret::confusionMatrix(t$predOutcome, t$Outcome, positive = "1")$table[2]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
results_sens_10
## [1] 0 2 7 9 24 36 58 87 217 1199
results_sens_01 <- vector()
for (i in seq(1,10,1)){
print(i)
t <- testProbs %>% mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 1-(i/10) , 1, 0)))
results_sens_01[i] <- caret::confusionMatrix(t$predOutcome, t$Outcome, positive = "1")$table[3]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
results_sens_01
## [1] 106 104 96 93 87 78 68 60 45 0
results_sens_11 <- vector()
for (i in seq(1,10,1)){
print(i)
t <- testProbs %>% mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 1-(i/10) , 1, 0)))
results_sens_11[i] <- caret::confusionMatrix(t$predOutcome, t$Outcome, positive = "1")$table[4]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
results_sens_11
## [1] 0 2 10 13 19 28 38 46 61 106
total_rev_11 <- results_sens_11*250 + results_sens_11*10000 + results_sens_11*20000
total_rev_11
## [1] 0 60500 302500 393250 574750 847000 1149500 1391500 1845250
## [10] 3206500
total_cost_11 <- -(results_sens_11*2850)
total_cost_11
## [1] 0 -5700 -28500 -37050 -54150 -79800 -108300 -131100 -173850
## [10] -302100
#01 is the foregone revenue which we will treat as cost
total_cost_01 <- -(results_sens_01*250 + results_sens_01*10000 + results_sens_01*20000)
total_cost_01
## [1] -3206500 -3146000 -2904000 -2813250 -2631750 -2359500 -2057000 -1815000
## [9] -1361250 0
total_cost_10 <- -(results_sens_10*2850)
total_cost_10
## [1] 0 -5700 -19950 -25650 -68400 -102600 -165300 -247950
## [9] -618450 -3417150
net <- total_rev_11 + total_cost_11 + total_cost_01 + total_cost_10
cost <- total_cost_11 + total_cost_01 + total_cost_10
revenue <- total_rev_11
net_cost <- revenue + cost
data.frame()
## data frame with 0 columns and 0 rows
total_credits_given <- results_sens_11
total_credits_given
## [1] 0 2 10 13 19 28 38 46 61 106
data.frame(threshold_in_pct = seq(10,100,10),revenue,cost,net_cost) %>%
kbl(col.name=c("Treshold Percentage","Revenue (USD)","Cost (USD)","Net Cost (USD)")) %>%
kable_classic()
Treshold Percentage | Revenue (USD) | Cost (USD) | Net Cost (USD) |
---|---|---|---|
10 | 0 | -3206500 | -3206500 |
20 | 60500 | -3157400 | -3096900 |
30 | 302500 | -2952450 | -2649950 |
40 | 393250 | -2875950 | -2482700 |
50 | 574750 | -2754300 | -2179550 |
60 | 847000 | -2541900 | -1694900 |
70 | 1149500 | -2330600 | -1181100 |
80 | 1391500 | -2194050 | -802550 |
90 | 1845250 | -2153550 | -308300 |
100 | 3206500 | -3719250 | -512750 |
As shown from the table above, the 90% mark has the smallest net lose of only $308,300. While revenue gained from the program does increase even at 100%, it starts to cost even more because of the excess credits given that do not add to the city’s value anymore. We will use 90% threshold.
With public interest in mind, we see that 90% proves itself as the “sweet-spot” because of the dip as seen in the foregone benefit chart. The 90% mark (second bar to the right) is smaller than the 100% mark despite the fact the pattern throughout the rest of the graph would indicate it between the $750,000 and $500,000 mark (between 80% and 90%). Yet the pattern expect by credit allocations is linear and predictable, with the 90% providing 61 credits, in-between its neighboring values. The outlier shown in foregone benefit yet uniformity with the credit allocation proves that this model found the optimal threshold to create the most value per credit.
We understand intuition would assume 50% would be optimal, especially without analysis and in theory alone. We must compare in order to analyze its viability.
results_final <- data.frame(threshold_in_pct = seq(10,100,10),
net_public_loss = -1*net ,
total_credits_given = total_credits_given)
results_final
## threshold_in_pct net_public_loss total_credits_given
## 1 10 3206500 0
## 2 20 3096900 2
## 3 30 2649950 10
## 4 40 2482700 13
## 5 50 2179550 19
## 6 60 1694900 28
## 7 70 1181100 38
## 8 80 802550 46
## 9 90 308300 61
## 10 100 512750 106
grid.arrange(ncol=2,
ggplot(data = results_final, aes(x = threshold_in_pct, y = net_public_loss)) +
geom_bar(stat='identity',fill='orange4')+
labs(title = "Foregone Benefit")+
theme_bw()+
labs(y='Opportunity Cost',x='Treshold'),
ggplot(data = results_final, aes(x = threshold_in_pct, y = total_credits_given)) +
geom_bar(stat='identity',fill='lightblue')+
labs(title = "Credit Allocation")+
theme_bw()+
labs(y='Total Credits Given',x='Treshold')
)
## Comparing 50% to the Optimized Threshold
The table below puts 50% and 90% side-by-side. When using 50% as the threshold to contact a resident, the city loses over $2.1 million dollars in total, making each credit an average lose of $114,713.
Now look at 90%, 61 credit given while losing $308,300 means a net loss of $5,054 per credit, a significant improve that is relatively more affordable as a housing subsidy.
This means that decrease net public loss by 7x and increase credits given by 3x. However, alone the house appreciation of the real estate and its surrounding houses is at approximately 2 million dollars if 61 credits were given.
results_final %>%
dplyr::filter(threshold_in_pct %in% c(90,50)) %>%
kbl(col.names = c("Treshold Precentage","Net Public Loss","Credits Given")) %>%
kable_classic()
Treshold Precentage | Net Public Loss | Credits Given |
---|---|---|
50 | 2179550 | 19 |
90 | 308300 | 61 |
The histogram below shows sensitivity and false predictions from the model. We removed true negatives due to the overwhelming count on the display. The graphs shows that at the optimal threshold (90%), there is a slight uptick in true positives, and a larger amount of false positives.
The graph, specifically the 90% mark, has the fewest false negatives*, meaning the most distressed homeowners who want the subsidy, are contacted using this model.
results <- data.frame(threshold_in_pct = seq(10,100,10),
True_positive=results_sens_11,
False_negative=results_sens_01,
False_positive=results_sens_10) %>%
gather(variable,value,-threshold_in_pct)
ggplot(results, aes(x = threshold_in_pct, y= value, fill = variable)) +
labs(x = "Threshold Percentage", y = "Count", fill = "Confusion Matrix Outcomes")+
geom_bar(stat='identity')+
scale_fill_viridis_d()+
theme_bw()
The model below shows the results using our optimal threshold.
testProbs3 <- data.frame(Outcome = as.factor(housing_test$y_numeric),
Probs = predict(housing_model2, housing_test, type= "response"))
testProbs3 <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs3$Probs > 0.1 , 1, 0)))
test3table <- caret::confusionMatrix(testProbs3$predOutcome, testProbs3$Outcome,
positive = "1")
test3table$table %>%
data.frame() %>%
spread(key=Reference,value=Freq) %>%
rename('Not_Take_Subsidy' = '0', 'Take_Subsidy' = '1') %>%
mutate(Total = Not_Take_Subsidy + Take_Subsidy,
Prediction = c('Not_Take_Subsidy','Take_Subsidy')) %>%
kbl() %>%
add_header_above(header=c(" " = 1,"Actual" = 3)) %>%
kable_minimal()
Prediction | Not_Take_Subsidy | Take_Subsidy | Total |
---|---|---|---|
Not_Take_Subsidy | 1028 | 47 | 1075 |
Take_Subsidy | 171 | 59 | 230 |
The density distribution of predicted probabilities by observed outcomes shows a large right skew distribution for negative outcomes. This means that our model still does a great job of predicting true negatives.
In an ideal scenario, we would want our positive outcome density distribution to be heavily right skewed. The reason behind this is that we could then choose a more ideal threshold for the cut off and thereby further increase the quality of our model.
Given the quality of the data we had, we applied all relevant feature engineering techniques to isolate correct positive outcomes. Our biggest lever, however, as this graph displays, is the choice of optimal threshold.
ggplot(testProbs3, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Probability", y = "Density of Predicted Outcomes",
title = "Distribution of Predicted Probabilities by Observed Outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
The Area under the curve indicates how good our model is compared to random chance. Since random chance would indicate 50%, we are 24 points higher than random chance, showing our model is excellent but imperfect. The imperfections also emphasize scalability, and that the model is not overfitted to only this situation, and can expand as necessary.
auc(testProbs3$Outcome, testProbs3$Probs)
## Area under the curve: 0.743
ggplot(testProbs3, aes(d = as.numeric(Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "mediumorchid2") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey40') +
labs(title = "Reciever Operating Characteristic Curve")
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit <- train(y ~ ., data = housing_data %>% select(-y_numeric, -...1), method="glm", family="binomial", metric="ROC", trControl = ctrl)
cvFit
## Generalized Linear Model
##
## 4119 samples
## 29 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4077, 4078, 4077, 4078, 4077, 4077, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7632462 0.9811712 0.219
dplyr::select(cvFit$resample, -Resample) %>% gather(metric, value) %>% left_join(gather(cvFit$results[2:4], metric, mean)) %>% ggplot(aes(value)) + geom_histogram(bins=35, fill = "#FF006A") + facet_wrap(~metric) + geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) + scale_x_continuous(limits = c(0, 1)) + labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics", subtitle = "Across-fold mean reprented as dotted lines")
The results of a model evaluation using a technique called 100-fold cross-validation. It emphasizes the importance of looking at the variation in performance across different folds to assess the model’s generalizability.
Instead of focusing on the means, we will focus on the across folds distribution for each metric. This is because tighter across folds distribution implies a higher degree of generalizability for the model.
The ROC metric, which measures the model’s ability to distinguish between positive and negative cases, shows a consistent performance around a score of 0.76.
The specificity metric, performs very well because the dataset originally had a majority of negative cases. However, the sensitivity metric has a wider variation across folds, with values ranging from 0.25 to 0.75. This suggests that the model captures some trends in the data but not consistently across all cases. The small sample size of positive cases in the initial dataset makes sensitivity results sensitive to small differences in how many positive cases the model predicts.
Overall, the model performs reasonably well, but there are challenges in achieving consistent sensitivity due to the dataset’s imbalance.
Our model successfully meets its intended objective of reducing false negatives while maintaining an acceptable level of false positives. Achieving this outcome required extensive feature engineering, which fine-tuned our logistic regression model to identify subsidy beneficiaries. Considering the limited number of beneficiaries in the dataset, our model can be deemed effective. This success is also evident in the significant increase in distributed credits and overall public benefits.
As a result, we recommend using this model compared to the current technique of randomness, and advise that more data is gathering to continue enhancing the model.