Evaluation report · out-of-sample comparison

Model Performance

This page explains the model comparison behind Betting Lab. We tested several model families against historical match outcomes, compared their probability quality, and kept the live lab focused on model-vs-market edge instead of mixing methodology into the betting workflow.

ROC-AUC by model

Rankings

What we tested

Extra Trees / Random Forest: tree ensembles used to catch nonlinear interactions between team form, market-like team strength, roster inputs, and sport-specific tabular features.
Logistic L2: a regularized linear baseline. If this wins, the signal is mostly smooth and additive rather than heavily interaction-driven.
Hist GB: histogram gradient boosting, useful for nonlinear tabular signals with compact training time.
MLP: a neural tabular baseline that tests whether dense feature interactions beat the tree and linear baselines.
Graph Role / GraphRL: graph-style role features and policy-inspired features that represent matchup structure instead of only flat team rows.

How to read the metrics

ROC-AUC: ranking quality. Higher means the model more often gives stronger win probability to the team that actually won.
Accuracy: simple winner classification rate after converting probability to a pick. Helpful, but less informative than calibration metrics.
Brier score: probability calibration error. Lower is better because it punishes confident wrong probabilities.
Realism score: a simulation sanity score used internally to check whether the model stack produces believable sport-specific outcomes and distributions.

Data windows

Sport	Rows	Window	Source frame
EPL	2,561	2019-2026	EPL match-level history joined with FPL-style team/player context.
NBA	10,835	2018-2026	NBA tabular feature file with game outcomes, team context, and roster-derived features.
NFL	7,276	1999-2026	NFL tabular features covering a longer historical window because season sample sizes are smaller.
NHL	11,266	2018-2026	NHL tabular features with a wide column set, including team and game-context signals.

What this means for Betting Lab

Betting Lab should not blindly pick the highest historical AUC model for every sport. The live page uses model probability, market probability, EV, and Kelly preview together. A model can rank games well but still need calibration checks before stake sizing. That is why this report separates evaluation from the live betting workflow.

Current read: NBA and EPL have the strongest predictive separation. NFL is usable but more sensitive to season context and sample size. NHL is the weakest in AUC, so NHL edges should be treated more conservatively until richer calibration improves.