First Model and Backtest
March 10, 2025This post covers the first complete model-and-backtest cycle. The results are not particularly encouraging, which is expected at this stage; the more useful output is a working evaluation framework and a catalog of mistakes made during the first pass.
Model
The model is an XGBoost binary classifier. The target variable is whether a stock's 10-day forward return exceeds the cross-sectional median on that date. Classification was chosen over regression because the signal-to-noise ratio in individual return prediction is very low, and a relative-ranking formulation is somewhat more forgiving.
Hyperparameters for this iteration:
params = {
"max_depth": 4,
"learning_rate": 0.05,
"n_estimators": 300,
"subsample": 0.8,
"colsample_bytree": 0.8,
}The training window is expanding, starting from 2 years of history, with monthly retraining.
Backtest procedure
Walk-forward evaluation: train on all available data up to month , generate predictions for month , advance. Each day, stocks are ranked by predicted outperformance probability, and the top decile is held long with equal weighting. Rebalancing is daily, though in practice most positions carry over; average daily turnover is around 15%.
Metrics tracked:
- Cumulative return relative to SPY
- Annualized Sharpe ratio (excess over risk-free rate)
- Maximum drawdown
- Realized transaction costs
Results (2020-01-01 to 2024-12-31)
| Metric | Strategy | SPY |
|---|---|---|
| Annualized return | 14.2% | 12.8% |
| Sharpe | 0.71 | 0.62 |
| Max drawdown | -28.1% | -33.7% |
| Avg daily turnover | 14.8% | — |
| Return after costs | 12.9% | 12.8% |
Before transaction costs, the strategy shows a modest edge over the benchmark. After the 5 bps-per-side cost model is applied, the advantage is negligible. This is not surprising for a first iteration with generic features, but it establishes that the pipeline is functional and the evaluation is, to the extent I can verify, free of lookahead bias.
Errors encountered
Three mistakes were caught and corrected during development:
- Feature leakage: the initial feature matrix inadvertently included the current day's closing price as a predictor. This produced a Sharpe ratio of approximately 2.3, which prompted investigation.
- Target leakage: the median used to define the binary target was initially computed over the full dataset rather than per-date, introducing a subtle form of lookahead.
- Execution timing: returns were computed close-to-close, but the backtest assumed trades could be executed at the closing price. Switching to next-open execution reduced annualized return by roughly 1%.
Next
The most straightforward path to improvement is the feature set. The next post will consider less standard features — order flow proxies, sector-relative momentum — and whether they produce a measurable change in out-of-sample performance.