library(xtable)
library(tidyverse)
library(pROC)
library(ggplot2)

1 Bernoulli and Binomial random variables

R code simulating Bernoulli (0/1) random outcomes:

p1 <- 0.5
yi <- rbinom(5, 1, p1); yi
<<<<<<< HEAD
## [1] 0 1 0 0 0

R code simulating Binomial random outcomes:

p2 <- 0.7
yi <- rbinom(10, 5, p2); yi
##  [1] 2 3 4 3 3 3 5 3 4 4
=======
## [1] 1 0 0 1 1

R code simulating Binomial random outcomes:

p2 <- 0.7
yi <- rbinom(10, 5, p2); yi
##  [1] 3 4 5 3 4 2 2 2 2 5
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402

Plot for binary response variable compared to continuous response variable

# ---- Simulate two types of responses ----
n <- 100

# Continuous response (e.g. blood pressure, weight)
y_cont <- rnorm(n, mean = 50, sd = 10)

# Binary response (e.g. disease: 0 = no, 1 = yes)
p <- 0.3                  # probability of "1"
y_bin <- rbinom(n, 1, p)  # 0/1 outcomes

# ---- Plot: continuous vs binary response ----
par(mfrow = c(1, 2))

# Left: continuous response
plot(1:n, y_cont,
     pch = 19,
     xlab = "Observation index",
     ylab = "Continuous response",
     main = "Continuous response")

# Right: binary response
plot(1:n, jitter(y_bin, amount = 0.05),
     pch = 19,
     xlab = "Observation index",
     ylab = "Binary response (0 or 1)",
     main = "Binary response",
     yaxt = "n")
axis(2, at = c(0, 1), labels = c("0", "1"))
abline(h = 0:1, lty = 2, col = "grey")
<<<<<<< HEAD

=======

>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
par(mfrow = c(1, 1))

Running logistic regression with simulated data and interpretation on the coefficients.

The logistic regression model examining the relationship between \(x\) and the binary outcome \(y\) shows that \(x\) is a statistically significant predictor with \(z = 4.61\) and \(p < 0.001\). While the estimated coefficient for \(x\) is \(0.134\), which corresponds to an odds ratio of \(1.14\), this means that for every one-unit increase in \(x\), the odds of the outcome being a success (1) increase approximately 14%, holding all else constant. The intercept estimate is \(-6.54\) (odds ratio \(\approx 0.0014\)), representing the log-odds of the outcome being 1 when \(x = 0\).

Model fit statistics indicate improvement over the null model: the null deviance reduced from \(165.15\) to \(133.34\), meaning the model explains more variation in the outcome than a model with no predictors. The AIC value of \(137.34\) can be used to compare this model with alternative models (lower AIC indicates better fit). Overall, the results suggest strong evidence that higher values of \(x\) are associated with a greater probability of \(y = 1\).

# Simulated binary response
n <- 120
x <- rnorm(n, 50, 10)
p <- plogis((x - 50)/8)
y <- rbinom(n, 1, p)

# Logistic regression
logit_model <- glm(y ~ x, family = binomial)

summary(logit_model)
## 
## Call:
## glm(formula = y ~ x, family = binomial)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
<<<<<<< HEAD
## (Intercept) -5.71715    1.35449  -4.221 2.43e-05 ***
## x            0.12036    0.02813   4.279 1.88e-05 ***
=======
## (Intercept)  -7.7942     1.5967  -4.882 1.05e-06 ***
## x             0.1612     0.0319   5.053 4.34e-07 ***
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
<<<<<<< HEAD
##     Null deviance: 166.32  on 119  degrees of freedom
## Residual deviance: 140.66  on 118  degrees of freedom
## AIC: 144.66
=======
##     Null deviance: 163.65  on 119  degrees of freedom
## Residual deviance: 121.88  on 118  degrees of freedom
## AIC: 125.88
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402
## 
## Number of Fisher Scoring iterations: 5
exp(coef(logit_model))   # Odds ratios
## (Intercept)           x 
<<<<<<< HEAD
## 0.003289071 1.127905924
======= ## 0.000412112 1.174945627 >>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402

Predict probabilities of each observation has outcome = 1, can be derived from using the fitted logistic regression.

pred_prob <- predict(logit_model, type = "response")
head(pred_prob)
<<<<<<< HEAD
##          1          2          3          4          5          6 
## 0.94350071 0.49536301 0.46498910 0.51788760 0.08650772 0.43350920
=======
##         1         2         3         4         5         6 
## 0.8790600 0.7613865 0.1858012 0.5108606 0.4168431 0.9512163
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402

We turn probabilities into actual predicted classes (0 or 1). We choose a cutoff (threshold) value, typically 0.5–if probability \(> 0.5\), then predict as 1, if probability \(\leq 0.5\) then predict the outcome as 0–for constructing the classification table. The following printed table is called the confusion matrix, where it shows how many cases the model predicted correctly or incorrectly.

threshold <- 0.5
pred_class <- ifelse(pred_prob > threshold, 1, 0)

table(Actual = y, Predicted = pred_class)
##       Predicted
## Actual  0  1
<<<<<<< HEAD
##      0 41 18
##      1 17 44
======= ## 0 32 19 ## 1 16 53 >>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402

The ROC curve helps us evaluate how well the model classifies 0 vs 1 across all possible thresholds, not just 0.5. When the AUC = 0.5, it means that the model is no better than guessing, and AUC closer to 1 indicates a better classification.

roc_obj <- roc(y, pred_prob)
plot(roc_obj, col="darkgreen", lwd=3, main="ROC Curve")
<<<<<<< HEAD

auc(roc_obj)
## Area under the curve: 0.7638
=======

auc(roc_obj)
## Area under the curve: 0.8107
>>>>>>> 5b4f1afcef1e5c8ccab3f34e69f0ea8006b6d402

2 Data analysis with logistic regression

We us the mtcars dataset, which contains information of 32 car models. For logistic regression, we create a data frame with the following key variables to be included in the model.  

am is a variable indicating transmission type, which is our binary outcome.
wt is car weight, a predictor measured in 1000 pounds.
hp is horsepower, which is another continuous predictor.

data(mtcars)
df <- within(mtcars, {
  am <- as.integer(am)        # ensure binary 0/1
  wt <- wt                    # weight (1000 lbs)
  hp <- hp                    # horsepower
})

dim(mtcars)
## [1] 32 11

The logistic regression model examines whether car weight wt and horsepower hp predict the probability of a car having a manual transmission am = 1, instead of automatic.

Both predictors are statistically significant:

wt has a negative effect with estimate = \(-8.08\), \(p = 0.008\), indicating heavier cars are less likely to have manual transmission.  

hp has a positive effect with estimate = \(0.036\), \(p = 0.041\), indicating cars with higher horsepower are more likely to have manual transmission.

# 2) Fit logistic regression ---------------------------------
# logit(P(am=1)) = beta0 + beta1*wt + beta2*hp
fit <- glm(am ~ wt + hp, data = df, family = binomial(link = "logit"))

cat("\n=== Model summary (logit scale) ===\n")
## 
## === Model summary (logit scale) ===
print(summary(fit))
## 
## Call:
## glm(formula = am ~ wt + hp, family = binomial(link = "logit"), 
##     data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 18.86630    7.44356   2.535  0.01126 * 
## wt          -8.08348    3.06868  -2.634  0.00843 **
## hp           0.03626    0.01773   2.044  0.04091 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 10.059  on 29  degrees of freedom
## AIC: 16.059
## 
## Number of Fisher Scoring iterations: 8

Additional interpretation on the coefficients.

# 3) Odds ratios and 95% CI ----------------------------------
# OR = exp(beta). For a one-unit increase in the predictor, odds multiply by OR.
coefs <- coef(fit)
OR <- exp(coefs); OR
##  (Intercept)           wt           hp 
## 1.561455e+08 3.085967e-04 1.036921e+00

We can also derive the confidence interval of the OR

# Profile-likelihood CI for coefficients, then exponentiate to get OR CI
ci_beta <- suppressMessages(confint(fit))
OR_CI <- exp(ci_beta)

cat("\n=== Odds Ratios (with 95% CI) ===\n")
## 
## === Odds Ratios (with 95% CI) ===
print(data.frame(
  term = names(OR),
  OR = OR,
  OR_low = OR_CI[, 1],
  OR_high = OR_CI[, 2],
  row.names = NULL
), digits = 3)
##          term       OR   OR_low  OR_high
## 1 (Intercept) 1.56e+08 4.34e+03 4.08e+17
## 2          wt 3.09e-04 3.36e-08 2.30e-02
## 3          hp 1.04e+00 1.01e+00 1.09e+00

Predicted probabilities for am = 1 for each observation

# 4) Predicted probabilities for each observation -------------
df$phat <- predict(fit, type = "response")  # P(am=1 | wt, hp)

cat("\n=== First 6 predicted probabilities (P(manual)) ===\n")
## 
## === First 6 predicted probabilities (P(manual)) ===
print(head(df[, c("wt", "hp", "am", "phat")], 6), digits = 3)
##                     wt  hp am    phat
## Mazda RX4         2.62 110  1 0.84234
## Mazda RX4 Wag     2.88 110  1 0.40478
## Datsun 710        2.32  93  1 0.97024
## Hornet 4 Drive    3.21 110  0 0.04173
## Hornet Sportabout 3.44 175  0 0.06939
## Valiant           3.46 105  0 0.00499