Scope and Setup — Liam Cawley

The goal of this project is to determine whether a straightforward ML pipeline, trained on publicly available features, can outperform a passive equity benchmark after accounting for transaction costs. The setting is deliberately conventional — daily-frequency data on large-cap US equities — so that any signal (or lack thereof) is not attributable to exotic market structure or thin liquidity.

This thread will serve as a running log. Each post corresponds to a stage of the project.

Constraints

Universe: S&P 500 constituents, using point-in-time membership lists to avoid survivorship bias.
Horizon: daily rebalancing, with a target holding period of 5–20 days.
Benchmark: SPY buy-and-hold over the same evaluation window.
Cost model: 5 basis points per side. No market impact modeling; position sizes are assumed small enough to justify this.
No lookahead: all features are computed from strictly past data. The train/validation/test split is temporal.

Data

The pipeline draws from three sources:

Price and volume: adjusted closes from Yahoo Finance via yfinance.
Fundamentals: quarterly EPS, book value, and related items from SEC EDGAR bulk downloads.
Macro indicators: Fed funds rate, VIX, and the 10y–2y yield curve slope from FRED.

These are stored in a local DuckDB instance, one table per source, joined on (ticker, date) at query time.

Initial feature set

The first iteration uses a deliberately minimal feature set:

Returns over 1, 5, 20, and 60 trading days
20-day rolling standard deviation of returns
20-day average volume, and the ratio of current volume to that average
RSI(14)
A binary indicator for whether the 50-day moving average exceeds the 200-day
Most recent quarterly earnings surprise (actual minus consensus)

This gives roughly 12 features per ticker per day. The intent is to establish a working end-to-end pipeline and a credible evaluation framework before introducing more complex features.

The following post covers the choice of model and the backtesting procedure.

Constraints

Data

Initial feature set

Next