That sounds obvious, but in practice it is one of the easiest places for analysts to go wrong.
A model can look excellent by one metric and disappointing by another. A classifier can achieve high accuracy while missing most of the cases that matter. A model can rank patients well but still produce poorly calibrated probabilities (Brier 1950; Steyerberg 2019). And a model can appear strong in development while failing in deployment because the chosen evaluation metric never matched the real decision problem.
This is why model evaluation matters so much, especially in clinical settings where model utility depends on more than discrimination alone (Vickers and Elkin 2006; Harrell 2015).
In machine learning, metrics are not just scorekeeping. They define what “good” means.
Along the way, I also connect these ideas to regression-style evaluation and to the challenge of imbalanced outcomes.
Model evaluation matters because a model is not good simply because it predicts well in the abstract, but because it performs well on the metric that matches the real decision problem.
Model Evaluation Begins with the Decision Context
A metric is never just a number.
It is a statement about what kind of error we care about.
For example:
in spam filtering, false positives may annoy users
in medical diagnostics, false negatives may miss disease
in fraud detection, a high false positive rate may overload reviewers
in triage models, ranking high-risk cases correctly may matter more than overall accuracy
This means model evaluation should not begin with:
which metric is standard?
It should begin with:
what kind of mistakes matter most in this setting?
That is why evaluation is not merely technical. It is tied to domain context, cost, and deployment consequences.
Accuracy Is Easy to Understand and Easy to Misuse
Accuracy is the proportion of predictions that are correct.
ROC curves are especially useful for understanding classifier discrimination independently of one fixed cutoff.
AUC Summarizes Ranking Performance, Not Calibration
The area under the ROC curve, or AUC, is a summary of discrimination.
It can be interpreted as the probability that a randomly chosen positive case receives a higher predicted score than a randomly chosen negative case.
pROC::auc(roc_obj)
AUC is helpful because it summarizes threshold-free ranking performance.
But it also has important limitations.
A model can have:
a high AUC but poor calibration
a high AUC but weak precision at clinically relevant thresholds
or a respectable AUC while still being practically unhelpful in deployment
So AUC is useful, but not sufficient by itself.
Precision-Recall Curves Are Often Better for Rare Events
When the positive class is uncommon, precision-recall curves can be more informative than ROC curves.
Why?
Because precision directly reflects the burden of false positives among predicted positives, which becomes especially important in low-prevalence settings.
In medical screening settings, PR curves are often especially valuable because they reflect the tradeoff between catching cases and flooding the system with false alarms.
Threshold Choice Should Reflect Clinical or Operational Costs
A classifier does not come with one universally correct threshold.
The appropriate threshold depends on the consequences of different errors.
For example:
in screening, missing a true case may be much worse than flagging a false alarm
in an intensive-care alert system, too many false positives may cause alarm fatigue
in triage, recall may be prioritized over precision
in confirmatory diagnostics, precision may become more important
This is why classification should often be treated as a decision problem, not just a probability-to-label routine.
A model may be strong, but the chosen threshold may still be inappropriate for the use case.
Calibration Matters Too, Not Only Discrimination
A model that ranks cases well is not necessarily well calibrated.
Calibration asks:
when the model predicts a probability of 0.70, do about 70% of those cases truly experience the event?
This matters because many real-world decisions depend on the probability itself, not only the ranking.
A highly discriminative model can still be miscalibrated if its probabilities are systematically too high or too low.
That is especially important in healthcare, where probability estimates may guide escalation, counseling, or treatment decisions.
Discrimination and calibration are related, but they are not the same.
Regression Models Need Metrics Too
Although this post focuses on classification, the broader lesson applies to regression as well.
Common regression metrics include:
mean squared error (MSE)
root mean squared error (RMSE)
mean absolute error (MAE)
(R^2)
These summarize different aspects of predictive error.
For example:
RMSE penalizes large errors heavily
MAE is often more robust to outliers
(R^2) reflects explained variance, but not necessarily deployment usefulness
The general principle remains the same: the choice of metric should match the real modeling objective.
A Small Regression Metric Example Completes the Picture
To keep that broader perspective visible, here is a quick regression example.
# A tibble: 1 × 4
mse rmse mae r_squared
<dbl> <dbl> <dbl> <dbl>
1 2.17 1.47 1.18 0.830
Even here, no single metric fully captures model usefulness. The right one depends on the error structure that matters most.
Fair Model Assessment Requires More Than One Metric
A common mistake is to report only the metric that makes the model look best.
That is not robust evaluation.
In most serious applications, analysts should report multiple metrics because models can behave differently across dimensions such as:
discrimination
calibration
threshold-specific error
sensitivity to imbalance
subgroup performance
This is especially important in healthcare and other high-stakes settings, where a model can appear strong overall while performing poorly for a subgroup or on the specific error type that matters most.
Good evaluation is therefore multidimensional.
Metrics Should Match Deployment, Not Just Development
One of the most important practical questions is:
how will this model actually be used?
If the model will rank patients for review, AUC may be relevant. If the model will trigger an alert, threshold-based recall and precision matter. If the model will estimate risk for counseling or allocation, calibration becomes central.
This is why analysts should avoid evaluating a model only in the style most convenient for development.
The metrics should reflect the deployment logic.
Otherwise, a model may look impressive in development and still fail operationally.
A Practical Checklist for Applied Work
Before reporting a model’s performance, ask:
Is the outcome balanced or imbalanced?
Does accuracy actually mean anything useful here?
Are precision and recall more relevant than overall correctness?
Have I examined ROC and PR behavior?
Does the metric reflect threshold-free ranking, threshold-based decisions, or calibrated probability estimation?
Would subgroup-specific evaluation reveal hidden weaknesses?
Does the evaluation metric match the real deployment use case?
These questions usually matter more than squeezing out one more decimal point of AUC.
NoteWhere This Shows Up in AI/ML
Epic’s sepsis prediction model (formerly the Sepsis Early Warning Tool, now embedded in Deterioration Index) has been reported in multiple external validations to have AUCs in the 0.74–0.83 range — but calibration analyses at several institutions showed that the model’s stated probabilities were systematically too high, meaning that a “60% risk” alert corresponded to observed event rates closer to 20–30%, directly misleading clinicians about how aggressively to escalate. The distinction between discrimination (ranking high-risk patients above low-risk patients) and calibration (outputting accurate absolute probabilities) is the difference between a tool that correctly sorts a triage line and a tool whose numerical output can guide treatment decisions safely. In DoDTR-based mortality modeling, a model with strong AUC but poor calibration is particularly dangerous because downstream resource allocation decisions — blood product preposition, surgical team activation — depend on the probability number, not just the rank order.
Closing: Good Evaluation Means Measuring What Actually Matters
Model evaluation remains one of the most important parts of machine learning because a model is only useful if its performance is judged in the right way.
Accuracy is simple, but often insufficient. Precision and recall clarify different types of classification success. F1 balances them when both matter. ROC curves and AUC summarize discrimination across thresholds. PR curves become especially important when positives are rare.
And beyond all of these lies the broader principle:
the best metric is the one that matches the real decision problem.
That is why evaluating AI like a biostatistician is so valuable. It keeps the focus on consequences, not just scores.
Metrics matter because every performance number quietly encodes a judgment about what kind of model error we are willing to live with.
This post is part of the Calibration Toolkit — a companion reference with confusion matrix templates, ROC and PR curve code, calibration plot scaffolds, and threshold selection guidance for clinical prediction models.
Davis, Jesse, and Mark Goadrich. 2006. “The Relationship Between Precision-Recall and ROC Curves.”Proceedings of the 23rd International Conference on Machine Learning, 233–40. https://doi.org/10.1145/1143844.1143874.
Hanley, James A., and Barbara J. McNeil. 1982. “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.”Radiology 143 (1): 29–36. https://doi.org/10.1148/radiology.143.1.7063747.
Harrell, Jr., Frank E. 2015. Regression Modeling Strategies. 2nd ed. Springer.
Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.
Vickers, Andrew J., and Elkin B. Elkin. 2006. “Decision Curve Analysis: A Novel Method for Evaluating Prediction Models.”Medical Decision Making 26 (6): 565–74. https://doi.org/10.1177/0272989X06295361.