This aim of this report is to develop a proactive approach for targeting home owners who qualify for a home repair tax credit program. This report will examine whether a model which can make the marketing materials result in a better response rate can be developed based on the currently available data.
In this section, all features in the dataset were visualized to explore the correlation between them and whether homeowners took the credit. Numerical features were visualized as histgrams and distribution plots. Categorical features were visualized as histgrams.
Useful features are those that exhibit significant differences across the homeowners who took the credit (yes) and those who did not (no) outcomes.
Among all numerical features, US inflation rates, # of contacts before this campaign and unemployment rate appears large difference between yes and no.
subsidy %>%
dplyr::select(y,age,previous,unemploy_rate,cons.price.idx,cons.conf.idx,inflation_rate,spent_on_repairs) %>%
gather(Variable, value, -y) %>%
ggplot(aes(y, value, fill=y)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(x="Likelihood of taking credit", y="Value",
title = "Feature associations with the homewoner took the credit",
subtitle = "(continous outcomes)") +
theme(legend.position = "none")
Among all distribution plots, inflation_rate and unemployment_rate’s distribution on ‘no’ is clearly larger than on ‘yes’. This is useful when considering which features should be included in the model as is showed in the next section.
subsidy %>%
dplyr::select(y,age,previous,unemploy_rate,cons.price.idx,cons.conf.idx,inflation_rate,spent_on_repairs) %>%
gather(Variable, value, -y) %>%
ggplot() +
geom_density(aes(value, color=y), fill = "transparent") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(title = "Feature distributions take credit vs. not take credit",
subtitle = "(continous outcomes)")
Among all numerical features, most of them shows large difference of ‘no’ between each category except for ‘day of week’. Only a few, such as contact way/ education/ job, show difference on ‘yes’ between each category.
subsidy %>%
dplyr::select(y, job, marital,education,taxLien,mortgage,taxbill_in_phl,contact,month,day_of_week,campaign,pdays,poutcome,) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free",ncol=3) +
scale_fill_manual(values = palette2) +
labs(x="Likelihood of taking credit", y="Value",
title = "Feature associations with the likelihood of taking credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In this section, a model with all features and a model with engineered features.
A logistic regression model was run with dependent variable ‘y’ and independent variables all features. In the output, there are several features are significantly related to ‘y’: contact-telephone, month-mar, unemployed rate, cons.price.idx, cons.conf.idx.
In the 0.5 threshold, the confusion matrix shows the sensitivity of this model is 0.25.
set.seed(3456)
trainIndex <- createDataPartition(y=paste(subsidy$taxLien,subsidy$Model,subsidy$education), p = .65,
list = FALSE,
times = 1)
subsidyTrain <- subsidy[ trainIndex,]
subsidyTest <- subsidy[-trainIndex,]
subsidyModel <- glm(y_numeric ~ .,
data=subsidyTrain %>% dplyr::select(-y),
family="binomial" (link="logit"))
summary(subsidyModel)
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = subsidyTrain %>% dplyr::select(-y))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1943 -0.4147 -0.3256 -0.2586 3.0926
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -278.85327760 139.96804153 -1.992 0.046342 *
## X -0.00001331 0.00006015 -0.221 0.824909
## age 0.01403305 0.00873220 1.607 0.108044
## jobblue-collar -0.32797503 0.27163327 -1.207 0.227271
## jobentrepreneur -0.41474972 0.43895099 -0.945 0.344727
## jobhousemaid 0.11787593 0.45568524 0.259 0.795883
## jobmanagement -0.52798701 0.31095067 -1.698 0.089512 .
## jobretired -0.41678128 0.38885628 -1.072 0.283804
## jobself-employed -0.85936919 0.45052464 -1.907 0.056458 .
## jobservices -0.15625210 0.28775184 -0.543 0.587123
## jobstudent -0.12256127 0.41895061 -0.293 0.769871
## jobtechnician -0.34064892 0.24896860 -1.368 0.171237
## jobunemployed 0.00698463 0.43362252 0.016 0.987149
## jobunknown -0.98454063 0.86728103 -1.135 0.256290
## maritalmarried -0.11230091 0.23346500 -0.481 0.630504
## maritalsingle 0.03104307 0.26773677 0.116 0.907695
## maritalunknown 1.60183882 1.23871333 1.293 0.195960
## educationbasic.6y 0.51798429 0.42188129 1.228 0.219523
## educationbasic.9y 0.44289711 0.33865468 1.308 0.190937
## educationhigh.school 0.24226477 0.33033899 0.733 0.463325
## educationilliterate -11.87185886 535.41147938 -0.022 0.982310
## educationprofessional.course 0.49448324 0.36306834 1.362 0.173212
## educationuniversity.degree 0.45266628 0.33045587 1.370 0.170742
## educationunknown 0.53513138 0.42052846 1.273 0.203188
## taxLienunknown -0.01495724 0.21407698 -0.070 0.944298
## taxLienyes -10.11713559 535.41149007 -0.019 0.984924
## mortgageunknown -0.31180939 0.50655531 -0.616 0.538193
## mortgageyes -0.10869431 0.14492166 -0.750 0.453242
## taxbill_in_phlyes 0.00211370 0.19493817 0.011 0.991349
## contacttelephone -1.11254297 0.31141354 -3.573 0.000354 ***
## monthaug 0.57341407 0.46743091 1.227 0.219922
## monthdec -0.25418106 0.85194737 -0.298 0.765434
## monthjul 0.20651187 0.37997601 0.543 0.586795
## monthjun -0.06145945 0.49278182 -0.125 0.900746
## monthmar 2.42531246 0.60735745 3.993 0.0000652 ***
## monthmay 0.17510806 0.32383613 0.541 0.588693
## monthnov -0.27510007 0.44063086 -0.624 0.532409
## monthoct 0.58010443 0.56615565 1.025 0.305534
## monthsep 0.32053540 0.66903917 0.479 0.631869
## day_of_weekmon -0.09048675 0.22277094 -0.406 0.684605
## day_of_weekthu 0.01702850 0.22152945 0.077 0.938729
## day_of_weektue -0.21626831 0.23145659 -0.934 0.350108
## day_of_weekwed -0.01629696 0.23302673 -0.070 0.944245
## campaign -0.06706867 0.03847471 -1.743 0.081301 .
## pdays -0.00048934 0.00080546 -0.608 0.543496
## previous 0.25188999 0.22915027 1.099 0.271666
## poutcomenonexistent 0.54030334 0.35407733 1.526 0.127023
## poutcomesuccess 1.19601053 0.80913495 1.478 0.139372
## unemploy_rate -1.38677195 0.54137687 -2.562 0.010420 *
## cons.price.idx 2.39574823 0.93242048 2.569 0.010188 *
## cons.conf.idx 0.05953345 0.02924320 2.036 0.041770 *
## inflation_rate -0.03039306 0.44901409 -0.068 0.946034
## spent_on_repairs 0.01057464 0.01109433 0.953 0.340510
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1876.5 on 2686 degrees of freedom
## Residual deviance: 1465.4 on 2634 degrees of freedom
## AIC: 1571.4
##
## Number of Fisher Scoring iterations: 12
testProbs <- data.frame(Outcome = as.factor(subsidyTest$y_numeric),
Probs = predict(subsidyModel, subsidyTest, type= "response"))
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1248 114
## 1 32 38
##
## Accuracy : 0.898
## 95% CI : (0.8812, 0.9132)
## No Information Rate : 0.8939
## P-Value [Acc > NIR] : 0.3217
##
## Kappa : 0.2952
##
## Mcnemar's Test P-Value : 0.00000000002033
##
## Sensitivity : 0.25000
## Specificity : 0.97500
## Pos Pred Value : 0.54286
## Neg Pred Value : 0.91630
## Prevalence : 0.10615
## Detection Rate : 0.02654
## Detection Prevalence : 0.04888
## Balanced Accuracy : 0.61250
##
## 'Positive' Class : 1
##
A logistic regression model was run with dependent variable ‘y’ and independent variables engineered new features. According to outcomes before, categorical features ‘month’ and ‘contact’ are significantly related to ‘y’. The variable ‘contact’ is a binary feature, so only month was engineered. ‘mar’ in ‘month’ was kept and all other values were transferred as ‘other’.
There are several other features were excluded to improve the sensitivity of the model. As showed in the data exploration, ‘day_of_week’ shows no correlation with both ‘no’ and ‘yes’. As in this case the sensitivity of the model is much more important than specificity, though inflation rate and unemployment rate seems related to ‘y’ , the larger distribution on ‘no’ than ‘yes’ make these 2 features contribute little to the sensitivity of the model. So these 2 features are also excluded in this model to improve the sensitivity.
The confusion Matrix shows that the sensitivity was improved a little.
subsidy_new <-
mutate(subsidy, new_mon=ifelse(month == 'mar', 'mar', 'other'))
set.seed(3456)
trainIndex <- createDataPartition(y=paste(subsidy_new$taxLien,subsidy$Model,subsidy$education), p = .65,
list = FALSE,
times = 1)
subsidyTrain <- subsidy_new[ trainIndex,]
subsidyTest <- subsidy_new[-trainIndex,]
subsidyModel2 <- glm(y_numeric ~ .,
data=subsidyTrain %>% dplyr::select(-y,-day_of_week, -unemploy_rate, -inflation_rate,-month),
family="binomial" (link="logit"))
summary(subsidyModel2)
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = subsidyTrain %>% dplyr::select(-y, -day_of_week, -unemploy_rate,
## -inflation_rate, -month))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4042 -0.4245 -0.3375 -0.2655 2.8305
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 18.385442502 12.394795404 1.483
## X 0.000005457 0.000059144 0.092
## age 0.014084099 0.008645903 1.629
## jobblue-collar -0.328321314 0.267491785 -1.227
## jobentrepreneur -0.416206160 0.431515179 -0.965
## jobhousemaid 0.032307809 0.455801885 0.071
## jobmanagement -0.525733977 0.303860730 -1.730
## jobretired -0.271528807 0.380269765 -0.714
## jobself-employed -0.848469458 0.445118140 -1.906
## jobservices -0.095920345 0.283965404 -0.338
## jobstudent 0.051353159 0.399288331 0.129
## jobtechnician -0.328945276 0.244979713 -1.343
## jobunemployed 0.013433207 0.427546916 0.031
## jobunknown -0.668413773 0.814625685 -0.821
## maritalmarried -0.116959798 0.230111689 -0.508
## maritalsingle 0.068101414 0.263025404 0.259
## maritalunknown 1.398697141 1.211485743 1.155
## educationbasic.6y 0.556756221 0.417777753 1.333
## educationbasic.9y 0.420065179 0.336513129 1.248
## educationhigh.school 0.215436101 0.326359668 0.660
## educationilliterate -11.708924743 535.411405761 -0.022
## educationprofessional.course 0.494074398 0.357294032 1.383
## educationuniversity.degree 0.464662554 0.324158513 1.433
## educationunknown 0.581991997 0.416360931 1.398
## taxLienunknown -0.030758380 0.212102407 -0.145
## taxLienyes -10.589970763 535.411417001 -0.020
## mortgageunknown -0.332695539 0.492142562 -0.676
## mortgageyes -0.135073012 0.142559013 -0.947
## taxbill_in_phlyes -0.052853490 0.190933028 -0.277
## contacttelephone -0.762340912 0.203868556 -3.739
## campaign -0.067315083 0.038413038 -1.752
## pdays -0.000383424 0.000796245 -0.482
## previous 0.255886681 0.230599269 1.110
## poutcomenonexistent 0.576825146 0.351800331 1.640
## poutcomesuccess 1.311396406 0.793617370 1.652
## cons.price.idx 0.322071403 0.131790297 2.444
## cons.conf.idx 0.034349685 0.013641773 2.518
## spent_on_repairs -0.009335729 0.000978786 -9.538
## new_monother -1.682384153 0.458118200 -3.672
## Pr(>|z|)
## (Intercept) 0.137990
## X 0.926485
## age 0.103315
## jobblue-collar 0.219670
## jobentrepreneur 0.334784
## jobhousemaid 0.943492
## jobmanagement 0.083598 .
## jobretired 0.475201
## jobself-employed 0.056629 .
## jobservices 0.735522
## jobstudent 0.897665
## jobtechnician 0.179355
## jobunemployed 0.974935
## jobunknown 0.411922
## maritalmarried 0.611261
## maritalsingle 0.795700
## maritalunknown 0.248283
## educationbasic.6y 0.182643
## educationbasic.9y 0.211926
## educationhigh.school 0.509178
## educationilliterate 0.982552
## educationprofessional.course 0.166719
## educationuniversity.degree 0.151731
## educationunknown 0.162171
## taxLienunknown 0.884698
## taxLienyes 0.984220
## mortgageunknown 0.499031
## mortgageyes 0.343390
## taxbill_in_phlyes 0.781921
## contacttelephone 0.000184 ***
## campaign 0.079705 .
## pdays 0.630133
## previous 0.267146
## poutcomenonexistent 0.101080
## poutcomesuccess 0.098447 .
## cons.price.idx 0.014533 *
## cons.conf.idx 0.011803 *
## spent_on_repairs < 0.0000000000000002 ***
## new_monother 0.000240 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1876.5 on 2686 degrees of freedom
## Residual deviance: 1497.2 on 2648 degrees of freedom
## AIC: 1575.2
##
## Number of Fisher Scoring iterations: 12
testProbs2 <- data.frame(Outcome = as.factor(subsidyTest$y_numeric),
Probs = predict(subsidyModel2, subsidyTest, type= "response"))
testProbs2 <-
testProbs2 %>%
mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1256 110
## 1 24 42
##
## Accuracy : 0.9064
## 95% CI : (0.8901, 0.921)
## No Information Rate : 0.8939
## P-Value [Acc > NIR] : 0.06469
##
## Kappa : 0.3431
##
## Mcnemar's Test P-Value : 0.000000000000209
##
## Sensitivity : 0.27632
## Specificity : 0.98125
## Pos Pred Value : 0.63636
## Neg Pred Value : 0.91947
## Prevalence : 0.10615
## Detection Rate : 0.02933
## Detection Prevalence : 0.04609
## Balanced Accuracy : 0.62878
##
## 'Positive' Class : 1
##
In this section, both two models were cross validated with 100 folds to test the fit of goodness.
In the results, two models have very similar average AUC(ROC as showed,around 0.76) and Sensitivity (around 0.21), which means the model with new features is no better than the original one. The AUC around 0.76 shows that both models are generally good fit.
subsidy2 <- subsidy %>%
mutate(y_fixed =
case_when(y == "no" ~ 'no',
y == "yes" ~ 'a_yes')) %>%
dplyr::select(-y)
### all features
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit <- train(y_fixed~ .,
data=subsidy2 %>%
dplyr::select(-y_numeric),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit
## Generalized Linear Model
##
## 4119 samples
## 20 predictor
## 2 classes: 'a_yes', 'no'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4078, 4079, 4078, 4079, 4079, 4077, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7663551 0.217 0.9801652
### new features
subsidy2_new <- subsidy_new %>%
mutate(y_fixed =
case_when(y == "no" ~ 'no',
y == "yes" ~ 'a_yes')) %>%
dplyr::select(-y)
cvFit2 <- train(y_fixed~ .,
data=subsidy2_new %>%
dplyr::select(-y_numeric,-day_of_week, -unemploy_rate, -inflation_rate,-month),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit2
## Generalized Linear Model
##
## 4119 samples
## 17 predictor
## 2 classes: 'a_yes', 'no'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4078, 4078, 4077, 4077, 4077, 4078, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7616295 0.2155 0.983964
According to the facetted plots, two models have quite similar patterns for all sensitivity, specificity and ROC. The specificity is tight to its mean, which means this model generalizes well with respect to Specificity. It does not generalize as well with respect to Sensitivity.
##all features
dplyr::select(cvFit$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
##new features
dplyr::select(cvFit2$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
The ROC curve with 0.78 AUC also shows a generally good fit of the model.
auc(testProbs2$Outcome, testProbs2$Probs)
## Area under the curve: 0.7847
ggplot(testProbs2, aes(d = as.numeric(Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve - clickModel")
The cost & benefit was calculated as below:
1.True Positive - Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit. For each homeowner: -2850 - 0.25 * 5000 + 0.25 * (10000+56000) = $12400.
2.True Negative - Predicted correctly homeowner would not take the credit;no marketing resources were allocated, and no credit was allocated. For each homeowner: $0.
3.False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated. For each homeowner: -$2850
4.False Negative - We predicted that a homeowner would not take the credit but they did. For each homeowner: $0.
cost_benefit_table <-
testProbs2 %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", Count * 0,
ifelse(Variable == "True_Positive", (12400) * Count,
ifelse(Variable == "False_Negative", 0 * Count,
ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
bind_cols(data.frame(Description = c(
"We correctly predicted taking credit",
"We correctly predicted not taking credit",
"We predicted not taking credit and homeowner did",
"We predicted taking credit and homeowner did not")))
kable(cost_benefit_table,
caption = "Cost/Benefit Table") %>% kable_styling()
Variable | Count | Revenue | Description |
---|---|---|---|
True_Negative | 1256 | 0 | We correctly predicted taking credit |
True_Positive | 42 | 520800 | We correctly predicted not taking credit |
False_Negative | 110 | 0 | We predicted not taking credit and homeowner did |
False_Positive | 24 | -68400 | We predicted taking credit and homeowner did not |
In this section, threshold was iterated to find the one which can bring the largest total revenue.
Below is the plots of the results of confusion matrix count by threshold (the revenue of each confusion matrix type on every threshold).
whichThreshold <-
iterateThresholds(
data=testProbs, observedClass = Outcome, predictedProbs = Probs)
whichThreshold2 <-
whichThreshold %>%
dplyr::select(starts_with("Count"), Threshold) %>%
gather(Variable, Count, -Threshold) %>%
mutate(count =
case_when(Variable == "Count_TN" ~ Count * 1,
Variable == "Count_TP" ~ 1 * Count,
Variable == "Count_FN" ~ 1 * Count,
Variable == "Count_FP" ~ 1 * Count))
whichThreshold <-
whichThreshold %>%
dplyr::select(starts_with("Count"), Threshold) %>%
gather(Variable, Count, -Threshold) %>%
mutate(Revenue =
case_when(Variable == "Count_TN" ~ Count * 0,
Variable == "Count_TP" ~ (12400) * Count,
Variable == "Count_FN" ~ 0 * Count,
Variable == "Count_FP" ~ (-2850) * Count)) %>%
mutate(count =
case_when(Variable == "Count_TN" ~ Count * 0,
Variable == "Count_TP" ~ Count * 1,
Variable == "Count_FN" ~ 0 * Count,
Variable == "Count_FP" ~ 0 * Count))
whichThreshold %>%
ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5[c(5, 1:3)]) +
labs(title = "Revenue by confusion matrix type and threshold",
y = "Revenue") +
plotTheme() +
guides(colour=guide_legend(title = "Confusion Matrix"))
Below are plots that show threshold as a function of total revenue and total count of credits. Here, as revenue was not counted for ‘false negative’, the total count of credits also excluded homeowners who took the credit but the model failed to predicted.
The plot threshold as a function of total revenue show that the threshold which can bring the largest total revenue is o.15.
whichThreshold_revenue <-
whichThreshold %>%
group_by(Threshold) %>%
summarize(Revenue = sum(Revenue),
total_count = sum(count))
grid.arrange(
ggplot(whichThreshold_revenue)+
geom_line(aes(x = Threshold, y = Revenue))+
geom_vline(xintercept = pull(arrange(whichThreshold_revenue, -Revenue)[1,1]))+
labs(title = "Model Revenues By Threshold For Test Sample",
subtitle = "Vertical Line Denotes Optimal Threshold"),
ggplot(whichThreshold_revenue)+
geom_line(aes(x = Threshold, y = total_count))+
labs(title = "Total_credit By Threshold For Test Sample",
subtitle = "Vertical Line Denotes Optimal Threshold"))
compare_table <- whichThreshold_revenue %>%
filter(Threshold == 0.5 | Threshold == 0.15)
kable(compare_table,
caption = "Threshold Compare Table") %>% kable_styling()
Threshold | Revenue | total_count |
---|---|---|
0.15 | 789250 | 91 |
0.50 | 380000 | 38 |
This model can be put into industry. It is generally good fit, the best threshold 0.15 will lead HCD to give allocation and credit to much more homeowners.
There are also possibilities to improve the model, one way is to collect a data set including more related variables. To gain a better data set, social research on what reasons really determine whether people take the credits might be needed. Another way is to include as much features and pick those related through the process did in this report.