library(xtable)
library(tidyverse)
library(pROC)
library(ggplot2)
R code simulating Bernoulli (0/1) random outcomes:
p1 <- 0.5
yi <- rbinom(5, 1, p1); yi
<<<<<<< HEAD
## [1] 0 1 0 0 0
R code simulating Binomial random outcomes:
p2 <- 0.7
yi <- rbinom(10, 5, p2); yi
## [1] 2 3 4 3 3 3 5 3 4 4
=======
## [1] 1 0 0 1 1
R code simulating Binomial random outcomes:
p2 <- 0.7
yi <- rbinom(10, 5, p2); yi
## [1] 3 4 5 3 4 2 2 2 2 5
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
Plot for binary response variable compared to continuous response variable
# ---- Simulate two types of responses ----
n <- 100
# Continuous response (e.g. blood pressure, weight)
y_cont <- rnorm(n, mean = 50, sd = 10)
# Binary response (e.g. disease: 0 = no, 1 = yes)
p <- 0.3 # probability of "1"
y_bin <- rbinom(n, 1, p) # 0/1 outcomes
# ---- Plot: continuous vs binary response ----
par(mfrow = c(1, 2))
# Left: continuous response
plot(1:n, y_cont,
pch = 19,
xlab = "Observation index",
ylab = "Continuous response",
main = "Continuous response")
# Right: binary response
plot(1:n, jitter(y_bin, amount = 0.05),
pch = 19,
xlab = "Observation index",
ylab = "Binary response (0 or 1)",
main = "Binary response",
yaxt = "n")
axis(2, at = c(0, 1), labels = c("0", "1"))
abline(h = 0:1, lty = 2, col = "grey")
<<<<<<< HEAD
par(mfrow = c(1, 1))
Running logistic regression with simulated data and interpretation on the coefficients.
The logistic regression model examining the relationship between
\(x\) and the binary outcome \(y\) shows that \(x\) is a statistically significant
predictor with \(z = 4.61\) and \(p < 0.001\). While the estimated
coefficient for \(x\) is \(0.134\), which corresponds to an odds ratio
of \(1.14\), this means that for every
one-unit increase in \(x\), the odds of
the outcome being a success (1) increase approximately 14%,
holding all else constant. The intercept estimate is \(-6.54\) (odds ratio \(\approx 0.0014\)), representing the
log-odds of the outcome being 1 when \(x =
0\).
Model fit statistics indicate improvement over the null model: the null deviance reduced from \(165.15\) to \(133.34\), meaning the model explains more variation in the outcome than a model with no predictors. The AIC value of \(137.34\) can be used to compare this model with alternative models (lower AIC indicates better fit). Overall, the results suggest strong evidence that higher values of \(x\) are associated with a greater probability of \(y = 1\).
# Simulated binary response
n <- 120
x <- rnorm(n, 50, 10)
p <- plogis((x - 50)/8)
y <- rbinom(n, 1, p)
# Logistic regression
logit_model <- glm(y ~ x, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = y ~ x, family = binomial)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
<<<<<<< HEAD
## (Intercept) -5.71715 1.35449 -4.221 2.43e-05 ***
## x 0.12036 0.02813 4.279 1.88e-05 ***
=======
## (Intercept) -7.7942 1.5967 -4.882 1.05e-06 ***
## x 0.1612 0.0319 5.053 4.34e-07 ***
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
<<<<<<< HEAD
## Null deviance: 166.32 on 119 degrees of freedom
## Residual deviance: 140.66 on 118 degrees of freedom
## AIC: 144.66
=======
## Null deviance: 163.65 on 119 degrees of freedom
## Residual deviance: 121.88 on 118 degrees of freedom
## AIC: 125.88
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
##
## Number of Fisher Scoring iterations: 5
exp(coef(logit_model)) # Odds ratios
## (Intercept) x
<<<<<<< HEAD
## 0.003289071 1.127905924
=======
## 0.000412112 1.174945627
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
Predict probabilities of each observation has outcome = 1, can be derived from using the fitted logistic regression.
pred_prob <- predict(logit_model, type = "response")
head(pred_prob)
<<<<<<< HEAD
## 1 2 3 4 5 6
## 0.94350071 0.49536301 0.46498910 0.51788760 0.08650772 0.43350920
=======
## 1 2 3 4 5 6
## 0.8790600 0.7613865 0.1858012 0.5108606 0.4168431 0.9512163
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
We turn probabilities into actual predicted classes (0 or 1). We choose a cutoff (threshold) value, typically 0.5–if probability \(> 0.5\), then predict as 1, if probability \(\leq 0.5\) then predict the outcome as 0–for constructing the classification table. The following printed table is called the confusion matrix, where it shows how many cases the model predicted correctly or incorrectly.
threshold <- 0.5
pred_class <- ifelse(pred_prob > threshold, 1, 0)
table(Actual = y, Predicted = pred_class)
## Predicted
## Actual 0 1
<<<<<<< HEAD
## 0 41 18
## 1 17 44
=======
## 0 32 19
## 1 16 53
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
The ROC curve helps us evaluate how well the model classifies 0 vs 1 across all possible thresholds, not just 0.5. When the AUC = 0.5, it means that the model is no better than guessing, and AUC closer to 1 indicates a better classification.
roc_obj <- roc(y, pred_prob)
plot(roc_obj, col="darkgreen", lwd=3, main="ROC Curve")
<<<<<<< HEAD
auc(roc_obj)
## Area under the curve: 0.7638
=======
auc(roc_obj)
## Area under the curve: 0.8107
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
We us the mtcars dataset, which contains information of
32 car models. For logistic regression, we create a data frame with the
following key variables to be included in the model.
am is a variable indicating transmission type, which is
our binary outcome.
wt is car weight, a predictor measured in 1000
pounds.
hp is horsepower, which is another continuous
predictor.
data(mtcars)
df <- within(mtcars, {
am <- as.integer(am) # ensure binary 0/1
wt <- wt # weight (1000 lbs)
hp <- hp # horsepower
})
dim(mtcars)
## [1] 32 11
The logistic regression model examines whether car weight
wt and horsepower hp predict the probability
of a car having a manual transmission am = 1, instead of
automatic.
Both predictors are statistically significant:
wt has a negative effect with estimate = \(-8.08\), \(p =
0.008\), indicating heavier cars are less likely to have manual
transmission.
hp has a positive effect with estimate = \(0.036\), \(p =
0.041\), indicating cars with higher horsepower are more likely
to have manual transmission.
# 2) Fit logistic regression ---------------------------------
# logit(P(am=1)) = beta0 + beta1*wt + beta2*hp
fit <- glm(am ~ wt + hp, data = df, family = binomial(link = "logit"))
cat("\n=== Model summary (logit scale) ===\n")
##
## === Model summary (logit scale) ===
print(summary(fit))
##
## Call:
## glm(formula = am ~ wt + hp, family = binomial(link = "logit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.86630 7.44356 2.535 0.01126 *
## wt -8.08348 3.06868 -2.634 0.00843 **
## hp 0.03626 0.01773 2.044 0.04091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 10.059 on 29 degrees of freedom
## AIC: 16.059
##
## Number of Fisher Scoring iterations: 8
Additional interpretation on the coefficients.
# 3) Odds ratios and 95% CI ----------------------------------
# OR = exp(beta). For a one-unit increase in the predictor, odds multiply by OR.
coefs <- coef(fit)
OR <- exp(coefs); OR
## (Intercept) wt hp
## 1.561455e+08 3.085967e-04 1.036921e+00
We can also derive the confidence interval of the OR
# Profile-likelihood CI for coefficients, then exponentiate to get OR CI
ci_beta <- suppressMessages(confint(fit))
OR_CI <- exp(ci_beta)
cat("\n=== Odds Ratios (with 95% CI) ===\n")
##
## === Odds Ratios (with 95% CI) ===
print(data.frame(
term = names(OR),
OR = OR,
OR_low = OR_CI[, 1],
OR_high = OR_CI[, 2],
row.names = NULL
), digits = 3)
## term OR OR_low OR_high
## 1 (Intercept) 1.56e+08 4.34e+03 4.08e+17
## 2 wt 3.09e-04 3.36e-08 2.30e-02
## 3 hp 1.04e+00 1.01e+00 1.09e+00
Predicted probabilities for am = 1 for each
observation
# 4) Predicted probabilities for each observation -------------
df$phat <- predict(fit, type = "response") # P(am=1 | wt, hp)
cat("\n=== First 6 predicted probabilities (P(manual)) ===\n")
##
## === First 6 predicted probabilities (P(manual)) ===
print(head(df[, c("wt", "hp", "am", "phat")], 6), digits = 3)
## wt hp am phat
## Mazda RX4 2.62 110 1 0.84234
## Mazda RX4 Wag 2.88 110 1 0.40478
## Datsun 710 2.32 93 1 0.97024
## Hornet 4 Drive 3.21 110 0 0.04173
## Hornet Sportabout 3.44 175 0 0.06939
## Valiant 3.46 105 0 0.00499
In logistic regression and general linear models (GLMs) for binary outcomes, we do not model the probability directly. Instead, we model a transformed version of the probability using a link funciton. The code below plots the inverse link functions, showing how each link converts the linear predictor into a probability between 0 and 1. The x-axis represents the linear predictor \[ \eta = \beta_0 + \beta_1 x\] and the y-axis shows the corresponding predicted probability \(p\).
eta <- seq(-5, 5, length.out = 400)
curve_logit <- plogis(eta)
curve_probit <- pnorm(eta)
curve_cloglog <- 1 - exp(-exp(eta))
curve_cauchit <- pcauchy(eta)
plot(eta, curve_logit, type="l", lwd=2, xlab=expression(eta),
ylab=expression(p==g^{-1}(eta)), main="Inverse links: p(eta)")
lines(eta, curve_probit, lwd=2, lty=2)
lines(eta, curve_cloglog, lwd=2, lty=3)
lines(eta, curve_cauchit, lwd=2, lty=4)
legend("topleft", lwd=2, lty=1:4, bty="n",
legend=c("logit","probit","cloglog","cauchit"))
lr.logit <- glm(am ~ wt + hp, family = binomial(link = "logit"), data = df)
lr.probit <- glm(am ~ wt + hp, family = binomial(link = "probit"), data = df)
lr.cloglog <- glm(am ~ wt + hp, family = binomial(link = "cloglog"), data = df)
lr.cauchit <- glm(am ~ wt + hp, family = binomial(link = "cauchit"), data = df)