One of the hardest parts of training a sports model on UFC is the data drift. In 2016 the Vegas odds were only 61% accurate, then in 2024 they were something like 70% accurate. Right now we're seeing a small decline in the accuracy of Vegas. This is all just part of the ebb and flow of variance that also factors in general unpredictability of sports, especially UFC.
For the latest version of the model I spent a ton of time trying to really generalize the model so that I'm not overfitting to specific circumstances at any small point in time. These are the parameters I ended up settling on after hundreds of experiments:
train_size = 0.75
val_size = 0.15
test_size = 0.1
n_splits = 4
num_stack_levels = 2
use_recency_weights = True
use_bag_holdout = True # Must be true if we're using tuning_data (val split)
num_bag_sets = 2
decay_rate = 0.13
shuffle = True
start_date = '2014-04-01'
calibrate = True
Breaking Down the AutoGluon Parameters
Let me explain each of these settings in simple terms:
- train_size/val_size/test_size: We use 75% for training, 15% for calibration validation, and 10% for final testing—larger validation than test because calibration requires substantial data to work properly.
- n_splits: Cross-fold validation uses 4 splits on the training data to ensure robust model selection.
- num_stack_levels: AutoGluon stacks models in 2 layers, where second-layer models learn from first-layer predictions.
- use_recency_weights: More recent fights are weighted heavier in training to capture current fighting trends.
- decay_rate: At 0.13, the earliest fight from 2014 weighs about 0.5 while the latest fight weighs about 1.5—making recent fights 3x more important.
- use_bag_holdout/num_bag_sets: Creates multiple model bags with holdout validation for better ensemble diversity.
- shuffle: Randomizes training order to prevent temporal bias during model training.
- calibrate: Applies post-hoc probability calibration to improve confidence score reliability.
For a long time I was doing a Cross-Fold Validation on 90% of the data, then the last 10% (about 1 year of the most recent fights) was left unseen as an impartial test of how well the model is generalizing to unseen future fights.

Cross-validation ensures robust model evaluation by training and testing on different data splits, helping prevent overfitting and providing more reliable performance estimates.
The Evolution of Validation Strategy
The reason for this was I wanted to maximize the most recent fight data to better predict near future results. As the years have gone by and I've gained a better overhead view of general variance in this sport, I'm slightly leaning towards really thinking through the architecture with generalization in mind will perform better over the long term rather than trying to maximize training data to the most recent fights. But like everything else in this journey, I'll experiment with this strategy and possibly revert back depending on how things go in the testing and real world as time goes on.
Once I started messing with calibration again I was required to go back to a train/val/test method where we do CFV on the train set then calibrate the predictions on val and get an unbiased evaluation of the model from unseen data on test.
So 2014-2023 is training, 2024 is calibration, and 2025 is test (more or less). Over hundreds of tweaks to the parameters above, this final set of parameters is showing the most consistency. I'm very sensitive to not just use the model that had the best results on the validation and test results because that can lead to overfitting where we just tuned parameters to maximize performance on the training and validation sets. Instead, we make sure that there isn't too much difference between validation and test set performance and that the training set performance isn't wildly higher than val/test. Results:
Training accuracy: 0.7511
Training log loss: -0.5082
Test accuracy: 0.7008
Test log loss: -0.5949
Val accuracy: 0.7072
Val log loss: -0.5918
Very nice sub -0.6 logloss and >70% accuracy on both the validation set and the unseen test set. Just to make sure we're not overfitting the model I started doing experiments where I use the same parameters and train the model based on cutoff dates, so how would these parameters perform if it didn't have access to the last 6 months of data or the last 12?
The P-Hacking Trap
Same parameters using data with a cutoff of 6 months ago:
Training accuracy: 0.7874
Training log loss: -0.4849
Test accuracy: 0.7149
Test log loss: -0.5785
Val accuracy: 0.6542
Val log loss: -0.6044
Reasonable. So I feel pretty good about the generalizability of the parameters I'm using right now and the training method implemented. I can p-hack the shit out of this though. For those unfamiliar, p-hacking is a term used in statistics where you "massage" the data or process so it looks like you're measuring some metric in a reasonable way but in fact you're just tweaking small portion of the stats so that the final benchmark metric like accuracy or logloss is maximized to make yourself look smart but isn't accurate to real-world measurement. For instance, what happens if I make a tiny change like increasing the initial data cutoff by 4 months from 2014-01-01 to 2014-04-01?
Training accuracy: 0.8460
Training log loss: -0.4122
Val accuracy: 0.7056
Val log loss: -0.5895
Test accuracy: 0.7121
Test log loss: -0.5864
Wowie weewah I improved the validation and test set logloss AND accuracy! I'm a motherfucking genius. This kind of improvement is unlikely to generalize to real world increases of accuracy and logloss. Look at the fact that the training accuracy increased by 6% leading to a larger gap between the training accuracy/logloss and the val/test logloss/accuracy. This is a negative improvement that likely means the model is a little bit more overfitted to the historical data.
You could argue this isn't that relevant since the unseen fight data accuracy and logloss improved and you wouldn't necessarily be wrong, but here in lies the difficulty of machine learning: you're always backtesting and there is no way to look into the future and tell how the model will actually perform in the real world, all you can do is decipher clues and the clue that stands out to me is a giant 6% leap in training data accuracy lead to an almost 15% gap between training accuracy and val/test accuracy. This is not good because we will never see 85% accuracy in unseen data, ever. We are straying further from Jesus with this minor change. We could've just eliminated a chunk of high variance fights where a bunch of unpredicted big underdogs won but by eliminating these high variance fights, we might've harmed the model's ability to recognize patterns in high variance fights that are still likely to occur in the future and so harmed the generalized ability of the model to predict future fights.
Why Not Include Odds as Features Anymore?
Because the odds accuracy vary wildly compared to more predictable sports like MLB, Soccer, or NFL. Again, the odds were 61% accurate in 2016 yet 70% accurate in 2024. Including the odds essentially makes the model subject to the whims of vegas rather than concretely generalizable over the long term. Second, since we have such highly engineered features, including the odds barely increases the accuracy of the model although it does improve the logloss quite well. All my odd-included models are -ROI on all down-the-line AI predictions because it so heavily favors betting the vegas odds favorite. It is useful still as a secondary measurement of risk-adjusted returns on favorites, but ultimately it's not that useful in generalized risk-adjusted returns on predictions.
Betting Strategy Performance Analysis
The real test of any model isn't just accuracy—it's profitability. Here's how different betting strategies performed over the latest backtest period, starting with $1,000 and betting $10 per pick:
Key Performance Metrics Explained
- ROI (%): Return on investment, measuring profit as a percentage of total amount wagered.
- Sharpe (ann.): Risk-adjusted return that accounts for volatility—higher values indicate better risk-adjusted performance.
- Sortino (ann.): Similar to Sharpe but only penalizes downside volatility, focusing on harmful risk.
- CAGR (%): Compound Annual Growth Rate, showing how fast your bankroll would grow annually.
- Max DD (%): Maximum drawdown, the largest peak-to-trough decline in bankroll value.
- Calmar: CAGR divided by maximum drawdown, measuring return per unit of downside risk.
- PF: Profit Factor, the ratio of gross profits to gross losses.
- ROI-Sharpe: A custom metric combining ROI and Sharpe ratio for overall strategy evaluation.
Backtest Summary (2024-08-03 to Present)
Test Period: Starting from August 3, 2024 with $1,000 initial bankroll and $10 bet size
🏆 Best Overall: ai_all_picks_sevenday
ROI: 10.87% | Sharpe: 2.11 | Final Bankroll: $1,287.02
📊 Closing Odds Strategy
ROI: 9.51% | Sharpe: 1.83 | Final Bankroll: $1,250.98
🎯 Edge Threshold Strategy
ROI: 3.68% | Sharpe: 0.47 | Final Bankroll: $1,086.01
Complete Performance Data: For detailed metrics including CAGR, maximum drawdown, Sortino ratios, and more, download the full backtest results:
Model Calibration Analysis
Beyond profitability metrics, it's crucial to understand how well our AI model's confidence levels align with actual outcomes. A calibration curve shows whether the model's predicted probabilities match real-world frequencies—if the AI says a fighter has a 70% chance of winning, do they actually win about 70% of the time?

Calibration curve showing the relationship between predicted probabilities and actual outcomes. A perfectly calibrated model would follow the diagonal line—deviations indicate overconfidence (above line) or underconfidence (below line) in predictions.
The calibration curve reveals how reliable our model's confidence estimates are across different probability ranges. This is essential for betting strategies because misaligned confidence can lead to poor risk assessment—betting too heavily on picks the model is overconfident about, or missing value when the model is underconfident.
Key Insights from the Strategy Analysis
Lines generally move against us: The fact that the seven-day strategy consistently outperforms closing odds betting is a GREAT sign that the model is a sharp picker. When books adjust lines toward our picks by fight night, it validates our edge.
Risk-adjusted returns matter most: For gamblers, the most important metric isn't ROI, accuracy, logloss, or even expected value alone—it's risk-adjusted returns. You want sustainable profit without excessive volatility that can bankrupt you during inevitable losing streaks.
Based on the comprehensive analysis, the optimal strategy appears to be betting all AI picks within a 16% difference between the AI win probability and Vegas implied probability. For example, if the AI picks fighter1 at 70% and Vegas implied odds suggest 84%, we still bet on fighter1. Beyond this 16% threshold, we should consider placing half-unit bets AGAINST the fighter the AI picked.
Risk-Adjusted Strategy Performance
First, thanks to the community for helping me learn here (Shoutout to Jordan C. and the rest of you). To better understand the relationship between risk and reward across different betting strategies, we've analyzed the top 20 performing strategies based on their risk-adjusted returns. The scatter plot below shows each strategy's return on investment (ROI) versus its Sharpe ratio, which measures risk-adjusted performance.

Scatter plot showing ROI vs Sharpe ratio for the top 20 betting strategies. Strategies in the upper-right quadrant offer the best combination of high returns and good risk-adjusted performance.
The ideal strategies appear in the upper-right quadrant, offering both high ROI and strong risk-adjusted returns (high Sharpe ratio). This visualization helps identify strategies that not only generate profit but do so with manageable risk and volatility. Most of the "bet against AI pick based on (AI win% - Vegas odds implied probability) strategies have a cutoff of about 16-20%. Meaning the right time to bet against the AI profitably is when AI win% is at least 16-20% lower than the Vegas implied odds difference and even then it's probably not worth more than 1/2 a unit. The model appears to be very good at predicting underdogs at about a 50% rate, but it's general ability to accurately predict win% is still not as good as Vegas meaning betting solely on +EV isn't necessarily the best strategy.
Conclusion and Call for Feedback
Two important points:
1) Please be critical of this and send me unfiltered opinions and help. I have nowhere near the level of experience in picking betting strategies as I do in training models. The machine learning side I've got down, but translating statistical edges into optimal betting strategies is where I need the most improvement.
2) The data suggests our best approach is the nuanced edge-threshold strategy rather than blindly following all AI picks or only +EV selections. This makes intuitive sense—we trust the model most when it aligns reasonably well with market expectations, but we hedge when there's significant disagreement.
As always, this is an ongoing experiment. The beauty and curse of sports prediction is that the landscape constantly shifts. What works today might not work tomorrow, which is why building robust, generalizable systems matters more than chasing short-term optimization.
Get in Touch: The best place to reach me is on Patreon where the community shares free UFC predictions and discuss model development. You don't need to subscribe to participate in chats and direct messages—it's the most active community for discussing these predictions and sharing feedback. I'd love to hear your thoughts on the strategies and any suggestions for improvement!