Df Sum Sq Mean Sq F value Pr(>F)
role 2 8613 4306 38.45 3.46e-15 ***
Residuals 237 26544 112
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Applied Statistics for AI & Clinical Decision-Making — Lecture 6 of 10
Data InDeed | dataindeed.org
2026-01-01
Not all outcomes are continuous. Not all follow-up is complete. Not all distributions are Normal.
Post 14 ANOVA
Post 18 Survival Analysis
Post 19 Non-Parametric
Regression in disguise — with more than two groups
\[F = \frac{\text{Between-group variance (MS}_\text{between}\text{)}}{\text{Within-group variance (MS}_\text{within}\text{)}}\]
Large F → groups differ more than chance explains.
Df Sum Sq Mean Sq F value Pr(>F)
role 2 8613 4306 38.45 3.46e-15 ***
Residuals 237 26544 112
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Registry application: Is mean ISS significantly different across Role 2, 3, and 4 care settings? ANOVA tests this without running three separate t-tests (which would inflate Type I error).
Pairwise comparisons using t tests with pooled SD
data: df_anova$iss and df_anova$role
Role 2 Role 3
Role 3 1.3e-05 -
Role 4 1.1e-15 2e-04
P value adjustment method: bonferroni
Why correction matters: With 3 groups = 3 comparisons. Each at α=0.05 → ~14% chance of at least one false positive by chance alone. Bonferroni multiplies p-values by the number of comparisons.
Other options: Tukey HSD (preferred for ANOVA follow-up), Holm, Benjamini-Hochberg.
When time matters and not everyone reaches the outcome
Why ordinary regression fails for time-to-event data:
Patient A: died at day 45 → complete observation
Patient B: still alive at day 90 → right-censored (we know survival > 90 days)
Patient C: lost to follow-up day 30 → right-censored
Ignoring censored patients biases results toward shorter survival times.
Survival function: \(S(t) = P(T > t)\)
Hazard function: \(h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t+\Delta t \mid T \geq t)}{\Delta t}\)
df_surv <- tibble(
time = c(rexp(100, 0.02), rexp(100, 0.035)),
status = rbinom(200, 1, 0.75),
group = rep(c("Standard","Damage Control"), each=100)
)
km_fit <- survfit(Surv(time, status) ~ group, data=df_surv)
plot(km_fit, col=c("#2563eb","#e63946"), lwd=2,
xlab="Days", ylab="Survival probability",
main="Kaplan-Meier: Standard vs. Damage Control Resuscitation")
legend("topright", levels(factor(df_surv$group)),
col=c("#2563eb","#e63946"), lwd=2)\[h(t \mid X) = h_0(t) \cdot e^{\beta_1 X_1 + \beta_2 X_2 + \dots}\]
Hazard ratio = \(e^\beta\) — proportional change in hazard per unit increase in X.
# A tibble: 2 × 5
term estimate conf.low conf.high p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 groupStandard 0.549 0.392 0.768 0
2 rnorm(200) 0.897 0.771 1.04 0.161
HR < 1 → lower hazard (protective); HR > 1 → higher hazard (harmful).
Registry use: What is the hazard ratio for 30-day mortality between patients who received TXA within 3 hours vs. those who did not, adjusting for ISS and shock index?
When distributions can’t be assumed
Use rank-based tests when:
| Parametric | Non-Parametric equivalent |
|---|---|
| One-sample t-test | Wilcoxon signed-rank |
| Two-sample t-test | Wilcoxon rank-sum (Mann-Whitney) |
| One-way ANOVA | Kruskal-Wallis |
| Pearson correlation | Spearman rank correlation |
Wilcoxon W = 57.5 p = 0.596
Key property: Tests whether one distribution is stochastically larger than another — doesn’t require Normality, doesn’t require equal variances.
ANOVA
Survival Analysis
Non-Parametric
The meta-lesson: The choice of test follows from the outcome type, sample size, and distributional assumptions — not from habit.
Dimensionality Reduction & Clustering
Posts 15, 16, 28:
Read Before Lecture 7
Data InDeed · Applied Statistics Series · Lecture 6 | ⚡ Open App