Data in trading: what data is used and where to get it

In algorithmic trading, the question "where do we get the data?" usually comes up too late. First comes the idea, then the search for quotes, then the backtest, and only after that does it become clear that the strategy in reality depends on data that was not actually present in the research setup. The result is a system that can look convincing on a chart and still be unusable in live execution.

The problem is that "market data" is not a single thing. Candles, ticks, order book data, corporate reporting, news, and sentiment are different layers of information with different update speeds, error structures, and storage costs. The layer a strategy relies on determines not only the signal, but also what distortions enter the model, how slippage is estimated, and how honest the backtest really is.

OHLCV: the most accessible layer, but not the most neutral one

OHLCV stands for open, high, low, close, and volume over a chosen interval. This format is convenient because it is compact, widely available, and suitable for a large range of tasks: market regime filters, medium-term strategies, volatility estimation, portfolio research, and simple intraday models that operate on bar closes.

But a candle is already a heavy compression of the market. Inside a single bar, the order of trades is lost, you do not know whether the high came before the low or vice versa, you do not see the structure of the spread or the queue in the book, and you cannot properly reconstruct the actual price path within the interval. That is why OHLCV answers the question "how did the instrument move on average?" quite well, but answers the question "could you really have executed the trade at that price and in the sequence assumed by the backtest?" much less well.

That is an important boundary. If a strategy makes decisions once per day or once per hour and is not sensitive to market microstructure, OHLCV is often enough. But if the model depends on how price moved through a level inside the bar, how quickly the spread collapsed, or how much liquidity was available at the best prices, candle data is already hiding exactly the information the idea depends on.

Tick data and the order book: when microstructure matters

Tick data is the stream of individual market events: trades, quotes, or changes in quotes. This layer matters when a strategy lives inside the trading day and is sensitive to event sequencing: short-term alpha, execution algorithms, slippage estimation, news-driven momentum models, or realized volatility calculations at high frequencies.

The order book adds another level of detail: not just the fact that a trade occurred, but the structure of available liquidity on the bid and ask and how that liquidity changes over time. In U.S. equities, direct exchange feeds such as Nasdaq TotalView-ITCH contain true order-level events: adds, executions, cancels, and replacements. 1 In crypto, similar logic is usually implemented through a depth snapshot plus a stream of deltas over a venue's WebSocket or REST/WebSocket API. Binance, for example, publishes separate endpoints for depth, recent trades, historical trades, and klines; Coinbase publishes level2, market_trades, and candles channels. 2

The practical value of that extra detail is simple: it lets you model not only direction, but execution quality. That is critical for market making, arbitrage, intraday momentum, and any system where final PnL depends less on the signal itself than on queue position, spread, cancellation dynamics, and actual available liquidity.

That precision comes at a real cost. Tick data, and especially order book data, takes more storage, is harder to normalize, more often contains gaps and technical events, and is more tightly tied to a specific venue. So the question is not "is the order book better than OHLCV?" but "does the strategy actually earn its edge from information that does not exist in a candle?"

Fundamental data, news, and sentiment: the market beyond the tape

Not every strategy is built on the market as a sequence of prices alone. For stocks, fundamental data means financial statements, balance sheet items, revenue, margins, debt, guidance, and other corporate metrics that affect valuation over time rather than microsecond execution. The SEC provides direct programmatic access to filings and XBRL data through the EDGAR API, including submission history, company facts, and related structures that are updated in real time as disclosures are published. 4

This same layer is also where "news" becomes meaningful in a stricter sense. For systematic work, the useful input is often not a general stream of headlines, but primary disclosures: 8-Ks, 10-Ks, 10-Qs, issuer press releases, and exchange notices. In the SEC's framework, Form 8-K is a current report for material events that investors should learn about quickly, not after they have already been retold by the media stream. 5 For event-driven strategies, the difference between a primary source and a secondary retelling is often more important than any subtlety of the NLP model.

Sentiment data should be treated carefully. It is not a magical mood indicator, but a derived layer of features extracted from news text, research notes, transcripts, social media, or forums. Academic literature does show that the tone of media coverage can be related to price moves and trading volume; Paul Tetlock's work on the role of media in the stock market is a classic example. 6 But in practical trading, sentiment is usually most useful as a weak auxiliary signal or a regime filter, not as a standalone source of edge.

Crypto vs. stocks: similar labels, different data structures

At the vocabulary level, the markets look similar: both have candles, trades, order book data, and news. At the infrastructure level, the differences run much deeper.

In equities, there is a formalized regime for market data, corporate actions, and disclosures. There is a regulatory framework for consolidated market data, official filings, issuer tickers, corporate actions, and exchange session calendars. In its Market Data Infrastructure rules, the SEC explicitly lays out the system for collecting, consolidating, and disseminating data for NMS stocks. 7 That is why the main engineering tasks in equity data revolve around correct adjustments, delisting handling, corporate action normalization, and precise mapping between ticker, CUSIP/CIK, and trading session.

Crypto is usually more fragmented. There is no single official consolidated tape, trading runs 24/7, symbols differ from venue to venue, and the same pair can have different depth, fee structures, and noticeably different microstructure depending on where it trades. The source of OHLCV, tick history, and order book data is therefore usually venue-native: you either take data from the specific exchange itself or from an aggregator that has already made the merging decisions for you. 2

There is also a substantive difference in what "fundamentals" means. For stocks, fundamentals are issuer financials and corporate events. For crypto assets, "fundamental" data more often means tokenomics, unlock schedules, issuance, on-chain activity, network fees, validator activity, treasury wallet flows, and dependence on a specific protocol. In other words, the object of analysis is different: instead of a company with regular disclosure, you often have a network, a token, and a set of public but heterogeneous data sources.

Where to get the data in practice

The most reliable rule is simple: whenever possible, take the data as close to the primary source as you can.

For market data, that means exchange feeds and official APIs. For equities, it means direct or consolidated feeds, historical archives, and official exchange message specifications. For crypto, it means the REST and WebSocket APIs of the specific venue when the strategy is sensitive to venue behavior. For stock fundamentals, it means SEC EDGAR and XBRL. For event data, it means primary corporate disclosures and exchange announcements. 1 3 5

Intermediate vendors and aggregators are useful too, but they always come with an abstraction cost. They speed up research, provide a unified format across many markets, and remove part of the infrastructure burden, but they also make decisions on your behalf: how to aggregate trades, how to build candles, how to reconstruct the book, how to label corrections, and how to handle delistings and renames. For medium-term models, that is often a reasonable trade-off. For execution-sensitive strategies, it can mean losing an important part of reality.

In practice, the choice usually looks like this:

For daily and hourly models, OHLCV and corporate data can come from a reliable aggregator or official archive, as long as you have checked the adjustments and calendars.
For intraday strategies based on trades and order book data, it is better to have either direct historical feeds or your own market data capture pipeline, so you do not have to guess how exactly the vendor reconstructed the event stream.
For event-driven and cross-sectional models, value often lies less in "the more sources the better" than in discipline around timestamps, symbol mapping, and the link between the event and the trading universe.

Data quality, survivorship bias, and data cleaning

Most strategy errors are born not in the signal formula, but in data that looks "almost right". The same model can become profitable or unprofitable simply because candles were built on different sessions, corporate actions were applied only partially, or outlier prints were left unfiltered.

Survivorship bias is one of the most dangerous examples. If the historical universe contains only the stocks or tokens that survived until today, the backtest automatically becomes prettier: delistings, bankruptcies, dead projects, and weak instruments disappear from history even though they were once part of the real market. In investment research, this effect is well documented in academic literature; for example, Carhart, Carpenter, Lynch, and Musto show that survivorship bias distorts estimates of average performance and persistence. 8 The same logic applies in trading: if the historical universe has been cleaned of losers after the fact, you are not testing the market, but an edited version of history.

In crypto, that risk is often even higher than in stocks. It is not only individual tokens that die, but trading pairs, exchanges, liquidity bridges, and quoting regimes themselves. If you look only at today's liquid pairs with long histories, you can end up with a very tidy picture of a market that did not actually exist in that form in the past.

That is why data cleaning should not be treated as a boring operational stage that begins after loading a CSV. It is part of the model itself. A minimal checklist usually includes:

aligning all timestamps to a single time zone and precision;
accounting for trading sessions, holidays, and transitions between regular and extended hours where relevant;
correctly handling corporate actions, renamings, ticker changes, and delistings;
removing or flagging duplicates, gaps, negative volumes, impossible highs and lows, and extreme prints;
defining an explicit policy for corrections, cancels, and rebuilding candles from tick data;
maintaining stable symbol mapping between market data, fundamentals, news, and your trading universe.

That is the stage where it becomes clear whether a dataset is a research convenience layer or a foundation for a production strategy. In tools such as ai-trader, the value does not come from the number of connected APIs, but from a reproducible data pipeline: raw data, a normalized layer, quality checks, and the ability to rebuild any backtest from the same historical version.

Conclusion

In trading, there is no such thing as "just data". There is the information layer that matches the strategy's horizon, and there is the layer that makes a backtest look convincing while being useless for the real market.

OHLCV is enough for many tasks, but it hides microstructure. Tick data and order book data are needed when money is made or lost on execution. Fundamentals, news, and sentiment only become useful when you understand their source, latency, and the way they are mapped into the trading universe. And the real difference between stocks and crypto does not begin with volatility, but with the structure of the data itself: in one case, filings and corporate actions matter more; in the other, venue fragmentation and the quality of your own aggregation do.

So the better question for a trader is not "where can I download quotes?" but "which exact layer of the market does my algorithm need to see so that its research statistics do not fall apart when they meet real execution?"