Analyzing Spending Habits With Logistic Regression

R
Exploratory Data Analysis
Logistic Regression
Shiny
Exploring indicators of spending habit behaviors using logistic regression.
Author

Ashley Russell

1 Introduction

This project investigates how various character traits influence spending behaviors in adults (18+). Using data from the National Financial Well-Being Survey (collected by the Consumer Financial Protection Bureau), I explored how these factors impact the likelihood of exhibiting “good” or “bad” spending habits across different age groups.

1.1 Rationale

I found this an interesting topic as it prompted me to reflect on my own spending habits.

1.2 About the Data

The data was collected from a diverse sample of adults (18 and older) across all 50 U.S. states and the District of Columbia, conducted between October 27 and December 5, 2016. The sample included 6,394 individuals (5,295 from the general population and 999 from an oversample of adults aged 62 and over). This dataset is randomized and aims to be representative of the adult population.

2 Data Wrangling

Preliminary steps included loading the necessary libraries, importing the dataset, and extracting the variables I wanted from it into a data frame.

Libraries Used (expand to view code)
library(tidyverse)
library(knitr)
library(ggformula)
library(GGally)
library(dplyr)
library(mosaic)
library(broom)
library(RColorBrewer)
library(ggplot2)
library(wacolors)
library(MASS)
library(car)
library(tidymodels)
library(naniar)
library(olsrr)
library(stringr)
Data Importation (expand to view code)
finance_data <- read.csv("data/NFWBS_PUF_2016_data.csv")
Variable Extraction (expand to view code)
finance <- data.frame(
  follow_commitment = finance_data$ACT1_1,
  frugality = finance_data$FRUGALITY,
  worded_probability = finance_data$SUBNUMERACY2,
  percentage_skill = finance_data$SUBNUMERACY1,
  goal_confidence = finance_data$GOALCONF,
  admire_luxury = finance_data$MATERIALISM_1,
  self_worth = finance_data$MATERIALISM_2,
  impress_people = finance_data$MATERIALISM_3,
  psych = finance_data$CONNECT,
  distress = finance_data$DISTRESS,
  impulsivity = finance_data$SELFCONTROL_1,
  resist_temptation = finance_data$SELFCONTROL_2,
  long_term_goals = finance_data$SELFCONTROL_3,
  economic_mobility = finance_data$PEM,
  spending_habit = finance_data$FS1_6,
  age_group = finance_data$agecat,
  wellbeing = finance_data$FWBscore)

The various age groups were using ordinal encoding, so I renamed them to reflect the ages they represented (e.g. 18-24 instead of 1, 25-34 instead of 2 and so forth).

Renaming Age Groups (expand to view code)
finance <- finance |>
 mutate(
   age_group = case_when(
   age_group == 1 ~ "18-24",
   age_group == 2 ~ "25-34",
   age_group == 3 ~ "35-44",
   age_group == 4 ~ "45-54",
   age_group == 5 ~ "55-61",
   age_group == 6 ~ "62-69",
   age_group == 7 ~ "70-74",
   age_group == 8 ~ "75+")
   )

I had two primary goals—to explore the data using graphs and to understand the relationship between our response variable and the predictor variables via statistical modeling, which would require different approaches (such as creating a category for missing values vs imputing them). Consequently, I made two copies of the data frame. As the names suggest, finance_analysis was used for the exploratory data analysis and finance_modeling for the modeling portion of this project.

Duplicating Data Frames (expand to view code)
finance_analysis <- finance
finance_modeling <- finance

Here’s a preview of the first 6 rows in the data frame:

Previewing Data (expand to view code)
head(finance) |> kable()
follow_commitment frugality worded_probability percentage_skill goal_confidence admire_luxury self_worth impress_people psych distress impulsivity resist_temptation long_term_goals economic_mobility spending_habit age_group wellbeing
4 6 3 3 3 3 5 4 80 4 2 3 3 3 3 75+ 55
4 5 5 5 3 3 4 3 95 3 2 4 3 4 3 35-44 51
3 5 4 4 3 4 4 3 50 2 3 3 3 6 3 35-44 49
-1 6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 3 35-44 49
3 4 4 4 3 3 3 3 0 3 3 3 3 4 3 25-34 49
5 5 5 6 4 3 3 2 80 4 1 3 3 7 4 25-34 67

3 Exploring Missing Responses

3.1 Counting Refused Questions

Each column, except for wellbeing, represents a question answered by a respondent (wellbeing is a score given to the respondent based on answers to particular questions). “-1” means that the respondent refused to answer the question. I opted to count the number of questions (out of a total of 16) unanswered by each respondent and calculated a “refusal rate” for enhanced readability (“this person refused to answer 75% of the included questions” sounds nicer than “this person refused to answer 12 of the included questions”).

This led to the creation of two additional columns: refused_questions for the raw number and refusal_rate for the percentage.

Counting Refusals (expand to view code)
finance_analysis <- finance_analysis |>
  mutate(
    refused_questions = rowSums(across(-c(wellbeing), ~ .x == -1))
    )
Calculating the Refusal Rate (expand to view code)
#I counted the number of (relevant) columns to avoid hard coding 
num_columns <- ncol(finance_analysis)
excluded_columns <- list("wellbeing", "refused_questions")
num_excluded <- length(excluded_columns)
sum_questions <- num_columns - num_excluded

finance_analysis <- finance_analysis |>
  mutate(
    refusal_rate = round((refused_questions / sum_questions) * 100,
                         2)
  )

3.2 Encoding Missing Responses

Every column except for age_group, refused_questions, and refusal_rate contained at least one -1 value.

Identifying Columns with Refusals (expand to view code)
contains_refusals <- finance_analysis |>
  summarise(across(everything(), ~ any(. < 0, na.rm = TRUE))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "has_refusal")

contains_refusals |> kable()
variable has_refusal
follow_commitment TRUE
frugality TRUE
worded_probability TRUE
percentage_skill TRUE
goal_confidence TRUE
admire_luxury TRUE
self_worth TRUE
impress_people TRUE
psych TRUE
distress TRUE
impulsivity TRUE
resist_temptation TRUE
long_term_goals TRUE
economic_mobility TRUE
spending_habit TRUE
age_group FALSE
wellbeing TRUE
refused_questions FALSE
refusal_rate FALSE

I handled this in multiple ways:

  • I replaced -1 with “refused” for the categorical variables, treating it as a separate category for data visualization purposes.

    Renaming Refusals (expand to view code)
    finance_analysis <- finance_analysis |>
      mutate(
        across(-c(psych, wellbeing), ~ ifelse(. == -1, "refused", .))
        )
  • Since psych and wellbeing are quantitative variables, I replaced -1 values with numbers 10% below their minimums to preserve visual consistency on graphs. Note that wellbeing also contained a -4 value, meaning “response not written to database”. Since there was only one instance of this, I used a value 30% lower than the minimum to replace it.

    Calculating Minimum and Replacement Values (expand to view code)
    psych_min <- min(finance_analysis$psych)
    psych_refuse <- psych_min - ((10/100) * psych_min)
    
    wellbeing_min <- min(finance_analysis$wellbeing)
    wellbeing_refuse <- wellbeing_min - ((10/100) * wellbeing_min)
    wellbeing_unwritten <- wellbeing_min - ((30/100) * wellbeing_min)
    Re-encoding Refusals for Quantitative Variables (expand to view code)
    finance_analysis <- finance_analysis |>
      mutate(
        psych = if_else(psych == -1, psych_refuse, psych),
        wellbeing = case_when(
          wellbeing == -4 ~ wellbeing_unwritten,
          wellbeing == -1 ~ wellbeing_refuse,
          TRUE ~ wellbeing)
        )

3.3 Wrapping Up

To wrap things up, I factored the categorical variables so that responses would be grouped accordingly on graphs.

Factoring Categorical Variables (expand to view code)
finance_analysis <- finance_analysis |>
  mutate(
    spending_habit = factor(spending_habit),
    follow_commitment = factor(follow_commitment),
    frugality = factor(frugality),
    worded_probability = factor(worded_probability),
    percentage_skill = factor(percentage_skill),
    goal_confidence = factor(goal_confidence),
    admire_luxury = factor(admire_luxury),
    self_worth = factor(self_worth),
    impress_people = factor(impress_people),
    distress = factor(distress),
    impulsivity = factor(impulsivity),
    resist_temptation = factor(resist_temptation),
    long_term_goals = factor(long_term_goals),
    economic_mobility = factor(economic_mobility),
    age_group = factor(age_group)
  )

Lastly, I saved the modified data frame as an .rds file to use it in the Shiny apps I created for visualization.

Saving Data Frame (expand to view code)
saveRDS(finance_analysis, file = "shiny-apps/data/finance_analysis.rds")

4 Data Visualization

4.1 Visual Overview: Response Variable

Most responses in our outcome variable fell in group 4, meaning respondents generally answered “very well” to the prompt “I know how to keep myself from spending too much.” Group 1 (“not at all”) had the fewest responses.

If we were to reimagine this variable as binary, i.e., grouping responses 1–3 as “lower spending control” and responses 4–5 as “higher spending control”, the majority of the responses would fall into the latter category.

Spending Habit Bar Chart (expand to view code)
spending_univariate <- ggplot(finance_analysis, aes(x = spending_habit)) +
        geom_bar(fill = "#412d5e", alpha = 0.9) +
        labs(
          title = "Distribution of Spending Habit Scores",
          x = "Scores (Ranging From '1: Not At All' to '5: Completely')",
          y = "Number of Respondents",
          caption = "Prompt: 'I know how to keep myself from spending too much.'") +
        theme_minimal() +
        theme(
          plot.caption = element_text(hjust = 0, size = 10),
          axis.text.x = element_text(angle = 45, hjust = 1)
          )

spending_univariate

4.2 Bivariate Patterns

Both frugality and follow_commitment show a positive trend: as spending_habit scores increase, the corresponding predictor scores also increase.

This relationship is visually evident in the increasing density of points in the upper right areas of both plots, suggesting that respondents with higher reported spending habit scores also report higher frugality and stronger follow-through on commitments.

Spending Habit & Frugality/Following Commitments Jitter Plots (expand to view code)
frugality_bivariate <- ggplot(finance_analysis, aes(x = frugality, 
                                                    y = spending_habit, 
                                                    color = spending_habit)) + 
  geom_jitter() + 
  labs(
    title = "Relationship between spending_habit & frugality",
    x = "Frugality Score",
    y = "Spending Habit Score",
    subtitle = "'I know how to keep myself from spending too much' & 'If I can reuse an item I already have, there's \nno sense in buying something new'",
    caption = "* 'Spending Habits' scores range from 'not at all' (1) to 'completely' (5) \n* 'Frugality' scores range from 'strongly disagree' (1) to 'strongly agree' (6)") +
  theme_minimal() +
  scale_color_wa_d("stuart") +
  theme(
    plot.caption = element_text(hjust = 0, size = 10),
    plot.subtitle = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

commitment_bivariate <- ggplot(finance_analysis, aes(x = follow_commitment, 
                                             y = spending_habit, 
                                             color = spending_habit)) + 
  geom_jitter() + 
  labs(
    title = "Relationship between spending_habit & follow_commitment",
    x = "Following Commitment Score",
    y = "Spending Habit Score",
    subtitle = "'I know how to keep myself from spending too much' & 'I follow-through on my financial \ncommitments to others'",
    caption = "* 'Spending Habits' scores range from 'not at all' (1) to 'completely' (5) \n* 'Follow Commitment' scores range from 'not at all' (1) to 'completely' (5)") +
  theme_minimal() +
  scale_color_wa_d("stuart") +
  theme(
    plot.caption = element_text(hjust = 0, size = 10),
    plot.subtitle = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

frugality_bivariate
commitment_bivariate

Both impress_people and impulsivity display the opposite pattern: a negative relationship with spending_habit. As spending_habit scores increase, these predictor scores decrease, which can be seen in the decreasing density of points across the x-axis.

This suggests that respondents with higher reported spending habit scores tend to report lower impulsivity and desire to impress others.

Spending Habit & Impressing People/Impulsivity Jitter Plots (expand to view code)
impress_bivariate <- ggplot(finance_analysis, aes(x = impress_people, 
                                                    y = spending_habit, 
                                                    color = spending_habit)) + 
  geom_jitter() + 
  labs(
    title = "Relationship between spending_habit & impress_people",
    x = "Impressing People Score",
    y = "Spending Habit Score",
    subtitle = "'I know how to keep myself from spending too much' & 'I like to own things that impress people'",
    caption = "* 'Spending Habits' scores range from 'not at all' (1) to 'completely' (5) \n* 'Impress People' scores range from 'strongly disagree' (1) to 'strongly agree' (5)") +
  theme_minimal() +
  scale_color_wa_d("stuart") +
  theme(
    plot.caption = element_text(hjust = 0, size = 10),
    plot.subtitle = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

impulsivity_bivariate <- ggplot(finance_analysis, aes(x = impulsivity, 
                                             y = spending_habit, 
                                             color = spending_habit)) + 
  geom_jitter() + 
  labs(
    title = "Relationship between spending_habit & impulsivity",
    x = "Impulsivity Score",
    y = "Spending Habit Score",
    subtitle = "'I know how to keep myself from spending too much' & 'I often act without thinking \nthrough all the alternatives'",
    caption = "* 'Spending Habits' scores range from 'not at all' (1) to 'completely' (5) \n* 'Impulsivity' scores range from 'not at all' (1) to 'completely well' (4)") +
  theme_minimal() +
  scale_color_wa_d("stuart") +
  theme(
    plot.caption = element_text(hjust = 0, size = 10),
    plot.subtitle = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

impress_bivariate
impulsivity_bivariate

5 Interactive Visual Overviews

5.1 Data Dictionary & Univariate Plots

This interactive app includes a data dictionary and visual summaries of how responses are distributed across each variable.

5.1.1 Calculating Highlighted Points for Density Plots

In two of the distribution graphs (the density plots for the quantitative predictors wellbeing and psych), specific points are highlighted to represent refused or unrecorded responses. These were plotted using density() to estimate the distribution of the variable and approx() to get the approximate y-value at the specified points.

Calculating Highlighted Points (expand to view code)
psych_dens <- density(finance_analysis$psych, na.rm = TRUE)
psych_r_dens <- approx(psych_dens$x, psych_dens$y, xout = psych_refuse)$y

wellbeing_dens <- density(finance_analysis$wellbeing, na.rm = TRUE)
wellbeing_r_dens <- approx(wellbeing_dens$x, wellbeing_dens$y, xout = wellbeing_refuse)$y
wellbeing_m_dens <- approx(wellbeing_dens$x, wellbeing_dens$y, xout = wellbeing_unwritten)$y

5.2 Bivariate Plots

This interactive app visually explores the relationship between each predictor and the response variable.

6 Model Management

6.1 Preparing Data for Modeling

As was done before, I created a new column to hold the number of refused questions per observation.

Counting Refusals (expand to view code)
finance_modeling <- finance_modeling |>
  mutate(
    refused_questions = rowSums(across(-c(wellbeing), ~ .x == -1))
    )

To prepare the missing (refused/unrecorded) values for imputation, I re-encoded them as NA.

Re-encoding Missing Values as NA (expand to view code)
finance_modeling <- finance_modeling |>
  mutate(
    across(-all_of("refused_questions"), ~ ifelse(. %in% c(-1, -4), NA, .))
  )

I used the median to impute missing values in the categorical variables, since, despite being categorical, they were also ordinal (i.e., there was a meaningful order between categories).

Imputing Missing Values in Qualitative Variables (expand to view code)
finance_modeling <- finance_modeling |>
  mutate(
      across(-c("psych", "wellbeing", "refused_questions", "age_group"), impute_median)
      )

I used histograms to determine how to approach imputation for the quantitative variables. Since psych was negatively skewed and wellbeing was approximately normally distributed, I imputed missing values using the median for psych and the mean for wellbeing.

Exploring Distribution of Quantitative Variables (expand to view code)
gf_histogram(~psych, data = finance_modeling, fill = "#412d5e", alpha = 0.9) +
  theme_minimal()
gf_histogram(~wellbeing, data = finance_modeling, fill = "#412d5e", alpha = 0.9) +
  theme_minimal()

Imputing Missing Values in Quantitative Variables (expand to view code)
finance_modeling <- finance_modeling |>
  mutate(
      psych = impute_median(psych)
      )

finance_modeling <- finance_modeling |>
  mutate(
      wellbeing = impute_mean(wellbeing)
      )

Since this is a binary logistic regression, I created a new variable representing a binary version of the original response.

Category 1 includes responses 4 and 5, indicating excellent reported spending habits, while Category 0 includes responses 1 through 3, indicating comparatively weaker reported spending habits.

Creating Binary Response Variable (expand to view code)
finance_modeling <- finance_modeling |>
  mutate(
     spending_binary = if_else(spending_habit >= 4, 1, 0)
  )

To tie it all together, I factored the categorical variables (again).

Factoring Categorical Variables (expand to view code)
finance_modeling <- finance_modeling |>
  mutate(
    spending_habit = factor(spending_habit),
    follow_commitment = factor(follow_commitment),
    frugality = factor(frugality),
    worded_probability = factor(worded_probability),
    percentage_skill = factor(percentage_skill),
    goal_confidence = factor(goal_confidence),
    admire_luxury = factor(admire_luxury),
    self_worth = factor(self_worth),
    impress_people = factor(impress_people),
    distress = factor(distress),
    impulsivity = factor(impulsivity),
    resist_temptation = factor(resist_temptation),
    long_term_goals = factor(long_term_goals),
    economic_mobility = factor(economic_mobility),
    age_group = factor(age_group)
  )

6.2 Models, Metrics, & Thoughts

First, I fitted a binary logistic regression model and included all of the predictor variables. This served as a baseline for feature selection.

Fitting General Model With All Predictors (expand to view code)
general_model <- glm(spending_binary ~ . - spending_habit, data = finance_modeling, family = "binomial")

6.2.1 Feature Selection

I used stepwise selection (both forward and backward) to identify the “best” set of predictors (i.e., predictors with a statistically meaningful relationship to the response variable). This was done twice: once using AIC (Akaike Information Criterion), and once using BIC (Bayesian Information Criterion).

AIC and BIC are model selection metrics, but they have different priorities. AIC prioritizes predictive performance by balancing model fit and model complexity, while BIC prioritizes simpler models by applying a stronger penalty for model complexity.

In other words, AIC generally favors more complex models that include predictors contributing meaningfully to accuracy, whereas BIC is stricter and prefers a simpler model with only the most impactful predictors.

Applied here, AIC is used to identify predictors that have some relationship with the response, while BIC narrows our focus to the strongest predictors.

Stepwise Selection with AIC (expand to view code)
model_AIC <- stepAIC(general_model, direction = "both", k = 2, trace = FALSE)
Stepwise Selection with BIC (expand to view code)
# I needed to get the number of observations
n <- nrow(finance_modeling)

model_BIC <- stepAIC(general_model, direction = "both", k = log(n), trace = FALSE)

6.2.1.1 Examining the AIC Model

According to the AIC model, the most indicative predictors of spending habits include:

  • Financial behaviors: ability to follow [financial] commitments, be frugal, resist temptation, and work toward long-term goals; financial well-being score

  • Cognitive preferences: preference for numbers over words and comfort with percentages

  • Psychological factors: confidence in financial goals, desire to impress others, psychological connectedness, stress, and impulsivity

  • Demographics: age group

AIC Model Formula (expand to view code)
formula(model_AIC)
spending_binary ~ follow_commitment + frugality + worded_probability + 
    percentage_skill + goal_confidence + impress_people + psych + 
    distress + impulsivity + resist_temptation + long_term_goals + 
    age_group + wellbeing

6.2.1.2 Examining the BIC Model

According to the BIC model, the strongest predictors related to one’s spending habits are

  • Ability to follow commitments

  • Frugality

  • Confidence in achieving financial goals

  • Impulsivity

  • Ability to resist temptation

  • Financial well-being score

BIC Model Formula (expand to view code)
formula(model_BIC)
spending_binary ~ follow_commitment + frugality + goal_confidence + 
    impulsivity + resist_temptation + wellbeing

6.2.2 Model Assumptions

Successful logistic regression requires a few assumptions, namely:

  1. Linearity of Log-Odds

    I assessed this by plotting the log-odds against the predictors. This required a few steps:

    Making a New Data Frame With Predictors and Response From Model (expand to view code)
    bic_finance <- finance_modeling |>
      dplyr::select(spending_binary, follow_commitment, frugality, goal_confidence, impulsivity, resist_temptation, wellbeing)
    Extracting Predicted Probabilities (expand to view code)
    predictors <- colnames(bic_finance) 
    
    bic_finance$probabilities <- model_BIC$fitted.values
    Calculating Logit Values (expand to view code)
    bic_finance <- bic_finance |>
      mutate(logit = log(probabilities/(1-probabilities))) |>
      dplyr::select(-probabilities) |>
      gather(key = "predictors", value = "predictor.value", -logit) 
    Displaying Log-Odds Against Predictors (expand to view code)
    log_plot <- ggplot(bic_finance, aes(y = logit, x = predictor.value))+
      geom_point(size = 0.5, alpha = 0.5) +
      geom_smooth(method = "loess") + 
      theme_bw() + 
      facet_wrap(~predictors, scales = "free_x")
    
    log_plot
    `geom_smooth()` using formula = 'y ~ x'

    The categorical (ordinal) predictors appeared stable across levels. The continuous variable wellbeing showed a roughly linear trend, suggesting the assumption of linearity on the logit scale was reasonably met.

    To avoid redundancy, I only did this for the BIC model. We can reasonably assume that the assumption holds true for the AIC model.

  2. Independence of Observations

    Each observation represents a unique, individual survey response. No groups (such as same households or repeat responses) are present. Therefore, the model meets this assumption.

  3. Absence of Multicollinearity

    I used vif() to look for potential multicollinearity.

    AIC Model:

    AIC Model Variance (expand to view code)
    vif_aic <- vif(model_AIC)
    vif_aic |> kable()
    GVIF Df GVIF^(1/(2*Df))
    follow_commitment 1.325389 4 1.035841
    frugality 1.273549 5 1.024475
    worded_probability 1.293823 5 1.026095
    percentage_skill 1.396970 5 1.033996
    goal_confidence 1.664631 3 1.088645
    impress_people 1.299022 4 1.033242
    psych 1.160909 1 1.077455
    distress 1.423641 4 1.045141
    impulsivity 1.294653 3 1.043980
    resist_temptation 1.455748 3 1.064587
    long_term_goals 1.651653 3 1.087226
    age_group 1.388850 7 1.023740
    wellbeing 1.676884 1 1.294945

    The \(GVIF^{(\frac{1}{2\times Df})}\) column contains adjusted VIF values. All values are below 2, indicating no serious multicollinearity.

    BIC Model:

    BIC Model Variance (expand to view code)
    vif_bic <- vif(model_BIC)
    vif_bic |> kable()
    GVIF Df GVIF^(1/(2*Df))
    follow_commitment 1.214668 4 1.024607
    frugality 1.131563 5 1.012437
    goal_confidence 1.421325 3 1.060349
    impulsivity 1.146548 3 1.023054
    resist_temptation 1.163430 3 1.025550
    wellbeing 1.354680 1 1.163907

    Similarly, all values in the \(GVIF^{(\frac{1}{2\times Df})}\) column are less than 2, once again indicating that there is no notable multicollinearity.

  4. No (Influential) Outliers I used standardized residuals to identify potential outliers. No observations exceeded the common threshold of ±3, suggesting the model fits individual data points well and meets the assumption of no extreme residuals.

    Augmenting Model and Filtering Residuals (expand to view code)
    model_info <- augment(model_BIC)
    
    model_info |>
      filter(abs(.std.resid) > 3)
    # A tibble: 0 × 13
    # ℹ 13 variables: spending_binary <dbl>, follow_commitment <fct>,
    #   frugality <fct>, goal_confidence <fct>, impulsivity <fct>,
    #   resist_temptation <fct>, wellbeing <dbl>, .fitted <dbl>, .resid <dbl>,
    #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
  5. Binary/Ordinal Dependent Variable

    • I used spending_binary as the response variable. Consequently, both models meet this assumption.

To wrap things up, I ranked the predictors based on their coefficients. The higher the coefficient is, the stronger the relationship.

Ranking Predictors Based On Effect (expand to view code)
coef_summary <- summary(model_BIC)$coefficients

# I had to turn the summary into a data frame so I could group by variable
coef_df <- as.data.frame(coef_summary) |>
  mutate(term = rownames(coef_summary)) |>
  filter(term != "(Intercept)") |>
  mutate(variable = str_remove(term, "\\d+$")) |>
  group_by(variable) |>
  summarise(total_effect = sum(abs(Estimate))) |>
  arrange(desc(total_effect))

coef_df |> kable()
variable total_effect
frugality 3.7410909
resist_temptation 2.9465421
follow_commitment 1.7087117
impulsivity 1.3665486
goal_confidence 1.2320587
wellbeing 0.0358059

Our takeaway here is that frugality and resist_temptation are the strongest factors contributing to a respondent’s spending_habit score.

I used a predicted probabilities plot to visualize how these two factors affected our model.

Predicted Probabilities Plot (frugality & resisting temptation) (expand to view code)
get_mode <- function(x) {
  ux <- na.omit(x)
  ux[which.max(tabulate(match(ux, ux)))]
}

grid <- expand.grid(
  impulsivity = get_mode(finance_modeling$impulsivity),
  resist_temptation = sort(unique(finance_modeling$resist_temptation)),
  follow_commitment = get_mode(finance_modeling$follow_commitment),
  frugality = sort(unique(finance_modeling$frugality)),
  goal_confidence = get_mode(finance_modeling$goal_confidence),
  wellbeing = mean(finance_modeling$wellbeing, na.rm = TRUE)
)

grid$predicted_prob <- predict(model_BIC, newdata = grid, type = "response")

predicted_probability_plot <- ggplot(grid, aes(x = frugality, y = predicted_prob, color = resist_temptation, group = resist_temptation)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme_minimal() +
  scale_color_wa_d("stuart") +
  theme(plot.subtitle = element_text(size = 9),
        plot.caption = element_text(hjust = 0))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Predicted Probabilities Plot (frugality & resisting temptation) (expand to view code)
predicted_probability_plot

As frugality and resist_temptation scores increased, so did the predicted probability of having a “good” spending_habit score.

7 Final Thoughts

In conclusion, discipline (ability to be frugal and resist temptation) was the biggest indicator of one’s spending habits.

The ability to follow through with commitments, curb impulsivity, and confidence in achieving financial goals were also notable factors.

The financial well-being score also had a noticeable impact, but it was weak compared to the aforementioned factors.

7.1 Limitations

Scores were self-reported, and, given the nature of the topic, likely unreliable.

8 Reflection

The first iteration of this project was for my Multiple Regression Analysis class and looked nothing like this final version. I spent a bit of time “improving” it, breaking it, starting over, and changing my approach along the way. Personally, this project represents growth; it was me asking: how can I take this thing I did and make it better?

At one point, I considered using a random forest to find the “best” predictors. At another time, I attempted multinomial logistic regression. These experiments, while fun, reminded me to “keep it simple, stupid!” where appropriate, especially since my main objective was to explore potential relationships.

I am not opposed to using what I’ve learned in my Statistical Machine Learning class to create a classification tool in the future. That is likely what I will do next, as I continue to build, break, learn, and repeat.

Back to top