Motivation

This aim of this report is to develop a proactive approach for targeting home owners who qualify for a home repair tax credit program. This report will examine whether a model which can make the marketing materials result in a better response rate can be developed based on the currently available data.

Data Exploration

In this section, all features in the dataset were visualized to explore the correlation between them and whether homeowners took the credit. Numerical features were visualized as histgrams and distribution plots. Categorical features were visualized as histgrams.

Useful features are those that exhibit significant differences across the homeowners who took the credit (yes) and those who did not (no) outcomes.

Numerical Data

Among all numerical features, US inflation rates, # of contacts before this campaign and unemployment rate appears large difference between yes and no.

subsidy %>%
  dplyr::select(y,age,previous,unemploy_rate,cons.price.idx,cons.conf.idx,inflation_rate,spent_on_repairs) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="Likelihood of taking credit", y="Value", 
           title = "Feature associations with the homewoner took the credit",
           subtitle = "(continous outcomes)") +
      theme(legend.position = "none")

Among all distribution plots, inflation_rate and unemployment_rate’s distribution on ‘no’ is clearly larger than on ‘yes’. This is useful when considering which features should be included in the model as is showed in the next section.

subsidy %>%
    dplyr::select(y,age,previous,unemploy_rate,cons.price.idx,cons.conf.idx,inflation_rate,spent_on_repairs) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(title = "Feature distributions take credit vs. not take credit",
         subtitle = "(continous outcomes)")

Category Data

Among all numerical features, most of them shows large difference of ‘no’ between each category except for ‘day of week’. Only a few, such as contact way/ education/ job, show difference on ‘yes’ between each category.

subsidy %>%
    dplyr::select(y, job, marital,education,taxLien,mortgage,taxbill_in_phl,contact,month,day_of_week,campaign,pdays,poutcome,) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free",ncol=3) +
        scale_fill_manual(values = palette2) +
        labs(x="Likelihood of taking credit", y="Value",
             title = "Feature associations with the likelihood of taking credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Logistic Regression Model

In this section, a model with all features and a model with engineered features.

Model with all features

A logistic regression model was run with dependent variable ‘y’ and independent variables all features. In the output, there are several features are significantly related to ‘y’: contact-telephone, month-mar, unemployed rate, cons.price.idx, cons.conf.idx.

In the 0.5 threshold, the confusion matrix shows the sensitivity of this model is 0.25.

set.seed(3456)
trainIndex <- createDataPartition(y=paste(subsidy$taxLien,subsidy$Model,subsidy$education), p = .65,
                                  list = FALSE,
                                  times = 1)
subsidyTrain <- subsidy[ trainIndex,]
subsidyTest  <- subsidy[-trainIndex,]

subsidyModel <- glm(y_numeric ~ .,
                  data=subsidyTrain %>% dplyr::select(-y),
                  family="binomial" (link="logit"))

summary(subsidyModel)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = subsidyTrain %>% dplyr::select(-y))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1943  -0.4147  -0.3256  -0.2586   3.0926  
## 
## Coefficients:
##                                   Estimate    Std. Error z value  Pr(>|z|)    
## (Intercept)                  -278.85327760  139.96804153  -1.992  0.046342 *  
## X                              -0.00001331    0.00006015  -0.221  0.824909    
## age                             0.01403305    0.00873220   1.607  0.108044    
## jobblue-collar                 -0.32797503    0.27163327  -1.207  0.227271    
## jobentrepreneur                -0.41474972    0.43895099  -0.945  0.344727    
## jobhousemaid                    0.11787593    0.45568524   0.259  0.795883    
## jobmanagement                  -0.52798701    0.31095067  -1.698  0.089512 .  
## jobretired                     -0.41678128    0.38885628  -1.072  0.283804    
## jobself-employed               -0.85936919    0.45052464  -1.907  0.056458 .  
## jobservices                    -0.15625210    0.28775184  -0.543  0.587123    
## jobstudent                     -0.12256127    0.41895061  -0.293  0.769871    
## jobtechnician                  -0.34064892    0.24896860  -1.368  0.171237    
## jobunemployed                   0.00698463    0.43362252   0.016  0.987149    
## jobunknown                     -0.98454063    0.86728103  -1.135  0.256290    
## maritalmarried                 -0.11230091    0.23346500  -0.481  0.630504    
## maritalsingle                   0.03104307    0.26773677   0.116  0.907695    
## maritalunknown                  1.60183882    1.23871333   1.293  0.195960    
## educationbasic.6y               0.51798429    0.42188129   1.228  0.219523    
## educationbasic.9y               0.44289711    0.33865468   1.308  0.190937    
## educationhigh.school            0.24226477    0.33033899   0.733  0.463325    
## educationilliterate           -11.87185886  535.41147938  -0.022  0.982310    
## educationprofessional.course    0.49448324    0.36306834   1.362  0.173212    
## educationuniversity.degree      0.45266628    0.33045587   1.370  0.170742    
## educationunknown                0.53513138    0.42052846   1.273  0.203188    
## taxLienunknown                 -0.01495724    0.21407698  -0.070  0.944298    
## taxLienyes                    -10.11713559  535.41149007  -0.019  0.984924    
## mortgageunknown                -0.31180939    0.50655531  -0.616  0.538193    
## mortgageyes                    -0.10869431    0.14492166  -0.750  0.453242    
## taxbill_in_phlyes               0.00211370    0.19493817   0.011  0.991349    
## contacttelephone               -1.11254297    0.31141354  -3.573  0.000354 ***
## monthaug                        0.57341407    0.46743091   1.227  0.219922    
## monthdec                       -0.25418106    0.85194737  -0.298  0.765434    
## monthjul                        0.20651187    0.37997601   0.543  0.586795    
## monthjun                       -0.06145945    0.49278182  -0.125  0.900746    
## monthmar                        2.42531246    0.60735745   3.993 0.0000652 ***
## monthmay                        0.17510806    0.32383613   0.541  0.588693    
## monthnov                       -0.27510007    0.44063086  -0.624  0.532409    
## monthoct                        0.58010443    0.56615565   1.025  0.305534    
## monthsep                        0.32053540    0.66903917   0.479  0.631869    
## day_of_weekmon                 -0.09048675    0.22277094  -0.406  0.684605    
## day_of_weekthu                  0.01702850    0.22152945   0.077  0.938729    
## day_of_weektue                 -0.21626831    0.23145659  -0.934  0.350108    
## day_of_weekwed                 -0.01629696    0.23302673  -0.070  0.944245    
## campaign                       -0.06706867    0.03847471  -1.743  0.081301 .  
## pdays                          -0.00048934    0.00080546  -0.608  0.543496    
## previous                        0.25188999    0.22915027   1.099  0.271666    
## poutcomenonexistent             0.54030334    0.35407733   1.526  0.127023    
## poutcomesuccess                 1.19601053    0.80913495   1.478  0.139372    
## unemploy_rate                  -1.38677195    0.54137687  -2.562  0.010420 *  
## cons.price.idx                  2.39574823    0.93242048   2.569  0.010188 *  
## cons.conf.idx                   0.05953345    0.02924320   2.036  0.041770 *  
## inflation_rate                 -0.03039306    0.44901409  -0.068  0.946034    
## spent_on_repairs                0.01057464    0.01109433   0.953  0.340510    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.5  on 2686  degrees of freedom
## Residual deviance: 1465.4  on 2634  degrees of freedom
## AIC: 1571.4
## 
## Number of Fisher Scoring iterations: 12

testProbs <- data.frame(Outcome = as.factor(subsidyTest$y_numeric),
                        Probs = predict(subsidyModel, subsidyTest, type= "response"))
testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1248  114
##          1   32   38
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8812, 0.9132)
##     No Information Rate : 0.8939          
##     P-Value [Acc > NIR] : 0.3217          
##                                           
##                   Kappa : 0.2952          
##                                           
##  Mcnemar's Test P-Value : 0.00000000002033
##                                           
##             Sensitivity : 0.25000         
##             Specificity : 0.97500         
##          Pos Pred Value : 0.54286         
##          Neg Pred Value : 0.91630         
##              Prevalence : 0.10615         
##          Detection Rate : 0.02654         
##    Detection Prevalence : 0.04888         
##       Balanced Accuracy : 0.61250         
##                                           
##        'Positive' Class : 1               
##

Model with new features

A logistic regression model was run with dependent variable ‘y’ and independent variables engineered new features. According to outcomes before, categorical features ‘month’ and ‘contact’ are significantly related to ‘y’. The variable ‘contact’ is a binary feature, so only month was engineered. ‘mar’ in ‘month’ was kept and all other values were transferred as ‘other’.

There are several other features were excluded to improve the sensitivity of the model. As showed in the data exploration, ‘day_of_week’ shows no correlation with both ‘no’ and ‘yes’. As in this case the sensitivity of the model is much more important than specificity, though inflation rate and unemployment rate seems related to ‘y’ , the larger distribution on ‘no’ than ‘yes’ make these 2 features contribute little to the sensitivity of the model. So these 2 features are also excluded in this model to improve the sensitivity.

The confusion Matrix shows that the sensitivity was improved a little.

subsidy_new <- 
  mutate(subsidy, new_mon=ifelse(month == 'mar', 'mar', 'other'))

set.seed(3456)
trainIndex <- createDataPartition(y=paste(subsidy_new$taxLien,subsidy$Model,subsidy$education), p = .65,
                                  list = FALSE,
                                  times = 1)
subsidyTrain <- subsidy_new[ trainIndex,]
subsidyTest  <- subsidy_new[-trainIndex,]


subsidyModel2 <- glm(y_numeric ~ .,
                  data=subsidyTrain %>% dplyr::select(-y,-day_of_week, -unemploy_rate, -inflation_rate,-month),
                  family="binomial" (link="logit"))

summary(subsidyModel2)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = subsidyTrain %>% dplyr::select(-y, -day_of_week, -unemploy_rate, 
##         -inflation_rate, -month))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4042  -0.4245  -0.3375  -0.2655   2.8305  
## 
## Coefficients:
##                                   Estimate    Std. Error z value
## (Intercept)                   18.385442502  12.394795404   1.483
## X                              0.000005457   0.000059144   0.092
## age                            0.014084099   0.008645903   1.629
## jobblue-collar                -0.328321314   0.267491785  -1.227
## jobentrepreneur               -0.416206160   0.431515179  -0.965
## jobhousemaid                   0.032307809   0.455801885   0.071
## jobmanagement                 -0.525733977   0.303860730  -1.730
## jobretired                    -0.271528807   0.380269765  -0.714
## jobself-employed              -0.848469458   0.445118140  -1.906
## jobservices                   -0.095920345   0.283965404  -0.338
## jobstudent                     0.051353159   0.399288331   0.129
## jobtechnician                 -0.328945276   0.244979713  -1.343
## jobunemployed                  0.013433207   0.427546916   0.031
## jobunknown                    -0.668413773   0.814625685  -0.821
## maritalmarried                -0.116959798   0.230111689  -0.508
## maritalsingle                  0.068101414   0.263025404   0.259
## maritalunknown                 1.398697141   1.211485743   1.155
## educationbasic.6y              0.556756221   0.417777753   1.333
## educationbasic.9y              0.420065179   0.336513129   1.248
## educationhigh.school           0.215436101   0.326359668   0.660
## educationilliterate          -11.708924743 535.411405761  -0.022
## educationprofessional.course   0.494074398   0.357294032   1.383
## educationuniversity.degree     0.464662554   0.324158513   1.433
## educationunknown               0.581991997   0.416360931   1.398
## taxLienunknown                -0.030758380   0.212102407  -0.145
## taxLienyes                   -10.589970763 535.411417001  -0.020
## mortgageunknown               -0.332695539   0.492142562  -0.676
## mortgageyes                   -0.135073012   0.142559013  -0.947
## taxbill_in_phlyes             -0.052853490   0.190933028  -0.277
## contacttelephone              -0.762340912   0.203868556  -3.739
## campaign                      -0.067315083   0.038413038  -1.752
## pdays                         -0.000383424   0.000796245  -0.482
## previous                       0.255886681   0.230599269   1.110
## poutcomenonexistent            0.576825146   0.351800331   1.640
## poutcomesuccess                1.311396406   0.793617370   1.652
## cons.price.idx                 0.322071403   0.131790297   2.444
## cons.conf.idx                  0.034349685   0.013641773   2.518
## spent_on_repairs              -0.009335729   0.000978786  -9.538
## new_monother                  -1.682384153   0.458118200  -3.672
##                                          Pr(>|z|)    
## (Intercept)                              0.137990    
## X                                        0.926485    
## age                                      0.103315    
## jobblue-collar                           0.219670    
## jobentrepreneur                          0.334784    
## jobhousemaid                             0.943492    
## jobmanagement                            0.083598 .  
## jobretired                               0.475201    
## jobself-employed                         0.056629 .  
## jobservices                              0.735522    
## jobstudent                               0.897665    
## jobtechnician                            0.179355    
## jobunemployed                            0.974935    
## jobunknown                               0.411922    
## maritalmarried                           0.611261    
## maritalsingle                            0.795700    
## maritalunknown                           0.248283    
## educationbasic.6y                        0.182643    
## educationbasic.9y                        0.211926    
## educationhigh.school                     0.509178    
## educationilliterate                      0.982552    
## educationprofessional.course             0.166719    
## educationuniversity.degree               0.151731    
## educationunknown                         0.162171    
## taxLienunknown                           0.884698    
## taxLienyes                               0.984220    
## mortgageunknown                          0.499031    
## mortgageyes                              0.343390    
## taxbill_in_phlyes                        0.781921    
## contacttelephone                         0.000184 ***
## campaign                                 0.079705 .  
## pdays                                    0.630133    
## previous                                 0.267146    
## poutcomenonexistent                      0.101080    
## poutcomesuccess                          0.098447 .  
## cons.price.idx                           0.014533 *  
## cons.conf.idx                            0.011803 *  
## spent_on_repairs             < 0.0000000000000002 ***
## new_monother                             0.000240 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.5  on 2686  degrees of freedom
## Residual deviance: 1497.2  on 2648  degrees of freedom
## AIC: 1575.2
## 
## Number of Fisher Scoring iterations: 12

testProbs2 <- data.frame(Outcome = as.factor(subsidyTest$y_numeric),
                        Probs = predict(subsidyModel2, subsidyTest, type= "response"))
testProbs2 <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1256  110
##          1   24   42
##                                            
##                Accuracy : 0.9064           
##                  95% CI : (0.8901, 0.921)  
##     No Information Rate : 0.8939           
##     P-Value [Acc > NIR] : 0.06469          
##                                            
##                   Kappa : 0.3431           
##                                            
##  Mcnemar's Test P-Value : 0.000000000000209
##                                            
##             Sensitivity : 0.27632          
##             Specificity : 0.98125          
##          Pos Pred Value : 0.63636          
##          Neg Pred Value : 0.91947          
##              Prevalence : 0.10615          
##          Detection Rate : 0.02933          
##    Detection Prevalence : 0.04609          
##       Balanced Accuracy : 0.62878          
##                                            
##        'Positive' Class : 1                
##

Cross validation and ROC

In this section, both two models were cross validated with 100 folds to test the fit of goodness.

In the results, two models have very similar average AUC(ROC as showed,around 0.76) and Sensitivity (around 0.21), which means the model with new features is no better than the original one. The AUC around 0.76 shows that both models are generally good fit.

subsidy2 <- subsidy %>% 
  mutate(y_fixed = 
    case_when(y == "no" ~ 'no',
              y == "yes" ~ 'a_yes')) %>% 
  dplyr::select(-y)
### all features
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(y_fixed~ .,
                  data=subsidy2 %>% 
                    dplyr::select(-y_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit

## Generalized Linear Model 
## 
## 4119 samples
##   20 predictor
##    2 classes: 'a_yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4079, 4078, 4079, 4079, 4077, ... 
## Resampling results:
## 
##   ROC        Sens   Spec     
##   0.7663551  0.217  0.9801652

### new features

subsidy2_new <- subsidy_new %>% 
  mutate(y_fixed = 
    case_when(y == "no" ~ 'no',
              y == "yes" ~ 'a_yes')) %>% 
  dplyr::select(-y)

cvFit2 <- train(y_fixed~ .,
                  data=subsidy2_new %>% 
                    dplyr::select(-y_numeric,-day_of_week, -unemploy_rate, -inflation_rate,-month), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit2

## Generalized Linear Model 
## 
## 4119 samples
##   17 predictor
##    2 classes: 'a_yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4078, 4077, 4077, 4077, 4078, ... 
## Resampling results:
## 
##   ROC        Sens    Spec    
##   0.7616295  0.2155  0.983964

According to the facetted plots, two models have quite similar patterns for all sensitivity, specificity and ROC. The specificity is tight to its mean, which means this model generalizes well with respect to Specificity. It does not generalize as well with respect to Sensitivity.

##all features
dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

##new features
dplyr::select(cvFit2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

ROC for model with new features

The ROC curve with 0.78 AUC also shows a generally good fit of the model.

auc(testProbs2$Outcome, testProbs2$Probs)

## Area under the curve: 0.7847

ggplot(testProbs2, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - clickModel")

Cost & benefit analysis

Cost & benefit table at 0.5 threshold

The cost & benefit was calculated as below:

1.True Positive - Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit. For each homeowner: -2850 - 0.25 * 5000 + 0.25 * (10000+56000) = $12400.

2.True Negative - Predicted correctly homeowner would not take the credit;no marketing resources were allocated, and no credit was allocated. For each homeowner: $0.

3.False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated. For each homeowner: -$2850

4.False Negative - We predicted that a homeowner would not take the credit but they did. For each homeowner: $0.

cost_benefit_table <-
   testProbs2 %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", (12400) * Count,
               ifelse(Variable == "False_Negative", 0 * Count,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted taking credit",
              "We correctly predicted not taking credit",
              "We predicted not taking credit and homeowner did",
              "We predicted taking credit and homeowner did not")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table") %>% kable_styling()

Cost/Benefit Table
Variable	Count	Revenue	Description
True_Negative	1256	0	We correctly predicted taking credit
True_Positive	42	520800	We correctly predicted not taking credit
False_Negative	110	0	We predicted not taking credit and homeowner did
False_Positive	24	-68400	We predicted taking credit and homeowner did not

Find best threshold

In this section, threshold was iterated to find the one which can bring the largest total revenue.

Below is the plots of the results of confusion matrix count by threshold (the revenue of each confusion matrix type on every threshold).

whichThreshold <- 
  iterateThresholds(
     data=testProbs, observedClass = Outcome, predictedProbs = Probs)

whichThreshold2 <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(count =
             case_when(Variable == "Count_TN"  ~ Count * 1,
                       Variable == "Count_TP"  ~ 1 * Count,
                       Variable == "Count_FN"  ~ 1 * Count,
                       Variable == "Count_FP"  ~ 1 * Count))
whichThreshold <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
             case_when(Variable == "Count_TN"  ~ Count * 0,
                       Variable == "Count_TP"  ~ (12400) * Count,
                       Variable == "Count_FN"  ~ 0 * Count,
                       Variable == "Count_FP"  ~ (-2850) * Count)) %>% 
    mutate(count =
             case_when(Variable == "Count_TN"  ~ Count * 0,
                       Variable == "Count_TP"  ~ Count * 1,
                       Variable == "Count_FN"  ~ 0 * Count,
                       Variable == "Count_FP"  ~ 0 * Count))

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold",
       y = "Revenue") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

Below are plots that show threshold as a function of total revenue and total count of credits. Here, as revenue was not counted for ‘false negative’, the total count of credits also excluded homeowners who took the credit but the model failed to predicted.

The plot threshold as a function of total revenue show that the threshold which can bring the largest total revenue is o.15.

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue),
              total_count = sum(count))

grid.arrange(
  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Revenue))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[1,1]))+
    labs(title = "Model Revenues By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold"),
  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = total_count))+
    labs(title = "Total_credit By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold"))

compare_table <- whichThreshold_revenue %>%
  filter(Threshold == 0.5 | Threshold == 0.15)
  
kable(compare_table,
       caption = "Threshold Compare Table") %>% kable_styling()

Threshold Compare Table
Threshold	Revenue	total_count
0.15	789250	91
0.50	380000	38

Conclusion

This model can be put into industry. It is generally good fit, the best threshold 0.15 will lead HCD to give allocation and credit to much more homeowners.

There are also possibilities to improve the model, one way is to collect a data set including more related variables. To gain a better data set, social research on what reasons really determine whether people take the credits might be needed. Another way is to include as much features and pick those related through the process did in this report.

People Based Machine Learning - Housing subsidy

Yuhao Jia

October 28,2022