Backtesting: how to test trading strategies properly

Backtesting often looks like the most convincing stage of trading strategy development: load historical data, run the entry and exit rules, get an equity curve, calculate return and drawdown. If the chart rises, it is tempting to think the strategy has been proven.

In practice, a backtest proves much less. It shows how a specific set of rules would have behaved on a specific version of the past, under specific assumptions about data, fees, execution and available liquidity. That is a useful engineering test of a hypothesis, but not a guarantee of future trading results. A good backtest is not there to make past returns look attractive; it is there to expose fragility before capital is at risk.

What backtesting is

Backtesting is the replay of trading logic on historical data. In its simplest form, it asks: what would have happened if the strategy had made decisions according to predefined rules in the past?

But a proper backtest is not just a signal. It includes:

the trading universe: which instruments the strategy could see at each point in time;
data: candles, trades, order book, fundamental events, fees, trading calendar;
position rules: size, leverage, stop, rebalance, risk limits;
execution model: at what price and with what delay an order is considered filled;
calculation log: data version, parameters, run date, metrics and exceptions.

If one of these layers is replaced by a convenient assumption, the result quickly becomes a research illustration rather than a test. For example, a strategy on daily candles may look robust if it enters at the closing price of the same bar that generated the signal. But if the signal is only known after the close, that trade is already using future information.

In-sample and out-of-sample

One basic defense against self-deception is splitting history into in-sample and out-of-sample periods.

In-sample is the part where the researcher formulates the idea, tunes parameters, compares filters and discards weak configurations. Out-of-sample is a separate segment that the strategy should not have seen during selection. Its purpose is not to check how well the model memorized the past, but whether its behavior survives on new data.

The problem is that the formal split is not enough by itself. If the researcher repeatedly returns to the out-of-sample period, reviews the result, changes parameters and tests the same period again, that segment gradually becomes a second in-sample. Serious research therefore needs experimental discipline: define the hypothesis in advance, limit the number of trials and keep a history of runs.

For strategies that are sensitive to market regimes, a walk-forward approach is often used: parameters are estimated on one window, tested on the next, and then the window is shifted. This does not make the result true, but it better imitates reality, where the strategy always acts with limited history and no access to the future.

Overfitting

Overfitting is a situation where a strategy is fitted too closely to a particular historical sample and therefore transfers poorly to new data. In modeling terms, it has learned not a stable pattern, but the peculiarities of the period it has already seen: random spikes, a one-off trend, a specific sequence of events, local liquidity anomalies.

In backtesting, this usually does not happen through one obvious mistake, but through a series of small improvements. The researcher changes an indicator period, adds a volatility filter, chooses a different stop, excludes inconvenient hours, changes the instrument set, compares dozens of variants and keeps the one with the best Sharpe, the smaller drawdown or the prettiest equity curve. Each step may look rational, but together they can turn research into fitting a key to a history that is already known.

Importantly, overfitting is not the same as rare trading. A strategy may trade infrequently and still be valid if its logic is genuinely designed for rare events. The problem is not the number of trades by itself, but the fact that after many trials one can select a variant that happened to match the past almost perfectly. Such a backtest answers the question "what best fit this history", not "what is likely to survive outside it".

Bailey, Borwein, López de Prado and Zhu describe this problem as the probability of backtest overfitting: the more strategies and parameters are tested on the same history, the higher the chance of selecting a statistical illusion that will not survive live trading. 1 That is why it is important to look not only at the best result, but also at the distribution of results around it. If a strategy works only within a narrow parameter range and neighboring values quickly break the PnL, that is a sign of fragility, not tuning precision.

Look-ahead bias and survivorship bias

Two errors are especially dangerous because they often do not show up directly in the final equity curve.

Look-ahead bias is the use of information that was not yet available at the time of the trade. It is not limited to the obvious case of "knowing tomorrow's price". It is enough to use the daily high/low for an intraday decision, use fundamental data revised later, apply the final index membership list to earlier periods, or normalize features over the full history at once. The result may look clean, but the strategy is trading with access to the future.

Survivorship bias is another way of editing the past. If the test includes only instruments that survived until today, history loses bankruptcies, delistings, dead tokens, closed pairs and failed markets. Research on mutual funds shows that survivorship bias can materially distort estimates of average returns and persistence. 2 In crypto, this risk is even rougher: not only individual assets disappear, but also exchanges, pairs, liquidity bridges and entire market regimes.

An honest backtest should work with a point-in-time universe: the strategy sees only the instruments and data that actually existed at the moment of decision. This is technically harder, but without it the test often answers the question "how would the strategy have traded if it already knew who would survive?"

Transaction costs, slippage and latency

A clean equity curve without costs almost always overstates a strategy. This is especially true when the system trades frequently, works with a small edge or uses market orders.

Transaction costs include exchange fees, broker fees, funding, borrow costs, spread and other direct expenses. In crypto, even on a single exchange, fees depend on maker/taker status, symbol, account tier and special conditions; Binance, for example, separately documents commission types and how commission rates are calculated. 3 If a backtest assumes zero fees "for simplicity", it is testing an ideal frictionless environment, not the market.

Slippage is the difference between the expected trade price and the actual execution price. It comes from the spread, insufficient depth, price movement during execution and market impact. For a small order in a liquid pair, slippage may be almost invisible. For a large order or a thin book, it can consume the entire expected edge.

Latency is the delay between signal generation, order submission and processing by the venue. For medium-term strategies it may be secondary, but for intraday, arbitrage and market making, latency changes the nature of the test. If the backtest assumes instant execution at the best price, while the live system receives the book through WebSocket, processes the signal, passes risk checks and only then sends the order, the result must already account for time and queue position. Exchange documentation for WebSocket streams shows that market data arrives as a stream of trades and book updates, not as a perfect finished candle. 4

Paper trading

Paper trading is useful as an intermediate layer between backtesting and live trading. It checks that the strategy runs on schedule, receives data, generates signals, creates orders, writes logs, survives restarts and calculates PnL correctly in a near-live mode.

But paper trading is not the same as trading capital. In a simulation, there is no real impact of the order on the market, no partial fill in a thin book, no rejection because margin changed sharply, no psychological pressure and often not the same queue in which a real order would stand. FINRA's required risk disclosure for day trading explicitly states that active trading can be extremely risky and requires readiness for substantial losses. 5 Paper trading helps verify mechanics, but it does not remove those risks.

The right role of paper trading is therefore not "the final proof of profitability", but a rehearsal of the production system: data, timing, orders, limits, monitoring and emergency stops.

Why a backtest is not real trading

A backtest works with the past, while real trading happens in the future, where liquidity, competition, volatility, fees, regulation and participant behavior change. Even if the historical test was built honestly, it remains a model.

The main reasons for divergence are usually these:

the market changes regime, and the pattern found in history stops working;
research data is cleaner and more complete than live data;
execution in the test is simpler than real order routing, spread, queue position and partial fills;
the strategy scales worse than it appears at small size;
fees, funding, borrow costs and slippage change over time;
the author changes behavior after launch: switches the system off after a drawdown, edits rules manually, expands risk, selectively stops trades;
competitors find the same anomaly, and the edge compresses.

This is why regulators and exchange rules look at automated trading not only as a "signal algorithm", but as a system with controls, limits, monitoring and failure procedures. ESMA guidelines for automated trading separately describe requirements for systems, pre-trade and post-trade controls, resilience and risk management. 6 For a strategy developer, the implication is clear: a backtest is only one layer of validation, not the whole system.

What counts as a good backtest

A good backtest does not have to show maximum return. It should be reproducible, conservative and strict enough that a weak strategy does not pass simply because of convenient assumptions.

The minimum set of signs:

strategy rules are fixed before the final test, not rewritten after viewing the result;
data is point-in-time: no future corporate actions, future universe membership or retrospectively recalculated features;
in-sample and out-of-sample are separated, and out-of-sample is not used as an endless parameter-tuning playground;
fees, spread, slippage, funding and latency are included at least through a conservative model;
parameters are checked for robustness: neighboring values should not completely destroy the strategy;
metrics include not only CAGR, but also max drawdown, volatility, Sharpe/Sortino, turnover, hit rate, tail losses and recovery time;
the test shows behavior across market regimes, not only one final chart;
versions of data, code and parameters are preserved so the result can be reproduced.

For tools such as ai-trader, the value of a backtesting layer lies precisely here: not in a beautiful equity curve, but in a reproducible process where signal, data, risk limits, costs and live checks are tied into one verifiable system.

Bottom line

Backtesting is not meant to prove future profit. It is meant to separate strategies that can at least survive an honest historical test from strategies that depend on data errors, over-optimization and unrealistic execution.

A good backtest is always a little disappointing: it adds fees, spoils perfect entries with slippage, forbids looking into the future, puts dead instruments back into history and shows drawdowns one would rather not see. But that is exactly why it is useful. The earlier a strategy breaks in testing, the cheaper that failure is before real trading begins.