9 Mistakes Quants Make that Cause Backtests to Lie

What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

 

 

 

 

 

 

 

“I’ve never seen a bad backtest” — Dimitris Melas, head of research at MSCI.

About backtests

A backtest is a simulation of a trading strategy used to evaluate how effective the strategy might have been if it was traded historically. Backtestesting is used by hedge funds and other researchers to test strategies before real capital is applied. Backtests are valuable because they enable quants to quickly test and reject trading strategy ideas.

All too often, strategies look great in simulation but fail to live up to their promise in live trading. There is a number of reasons for these failures, some of which are beyond the control of a quant developer. But other failures are caused by common, insidious mistakes.

An overly optimistic backtest can cause a lot of pain. I’d like to help you avoid that pain by sharing 9 of the most common pitfalls in trading strategy development and testing that can result in an excessively optimistic backtests:

1. In-sample backtesting

Many strategies require refinement, or model training of some sort. As one example, a regression-based model that seeks to predict future prices might use recent data to build the model. It is perfectly fine to build a model in that manner, but it is not OK to test the model over that same time period. Such models are doomed to succeed.

Don’t trust them.

Solution: Best practices are to build procedures to prevent testing over the same data you train more. As a simple example, you might use data from 2007 to train your model, but test over 2008 onward.

By the way, even though it could be called “out-of-sample” testing, it is not a good practice to train over later data, say 2014, then test over earlier data, say 2008-2013. This may permit various forms of look-ahead bias.

2. Using survivor-biased data

 Suppose I told you I have created a fantastic new blood pressure medicine, and that I had tested it using the following protocol:

  1. Randomly select 500 subjects.
  2. Administer my drug to them every day for 5 years.
  3. Measure their blood pressure each day.

At the beginning of the study the average blood pressure of the participants was 160/110; at the end of the study the average BP was 120/80 (significantly lower and better).

Those look like great results, right? What if I told you that 58 of the subjects died during the study? Maybe it was the ones with the high blood pressure that died! This is clearly not an accurate study because it focused on the statistics of survivors at the end of the study.

This same sort of bias is present in backtests that use later lists of stocks (perhaps members of the S&P 500) as the basis for historical evaluations over earlier periods. A common example is to use the current S&P 500 as the universe of stocks for testing a strategy.

Why is this bad? See the two figures below for illustrative examples.

The green lines show historical performance of stocks that were members of the S&P 500 in 2012. Note that all of these stocks came out of the 2008/2009 downturn very nicely.

The green lines show historical performance of stocks that were members of the S&P 500 in 2012. Note that all of these stocks came out of the 2008/2009 downturn very nicely.

 

What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

In our work at Lucena Research, we see an annual 3% to 5% performance “improvement” with strategies using survivor-biased data.

Solution: Find datasets that include historical members of indices, then use those lists to sample for your strategies.

3. Observing the close & other forms of look-ahead bias

In this failure mode, the quant assumes he can observe market closing prices in order to compute an indicator, and then also trade at the close. For example, one might use closing price/volume to calculate a technical factor used in the strategy, then trade based on that information.

This is a specific example of look-ahead bias in which the strategy is allowed to peek a little bit into the future. In my work I have seen time and again even a slight look-ahead bias can provide fantastic (and false) returns.

Other examples of look-ahead bias have to do with incorrect registration of data such as earnings reports or news. Assuming for instance that one can trade on the same day earnings are announced even though earnings are usually announced after the close.

Solution: Don’t trade until the open of the next day, after information becomes available.

4. Ignoring market impact

The very act of trading affects price. Historical pricing data does not include your trades and is therefore not an accurate representation of the price you would get if you were trading.

Take a look at the chart below that describes the performance of a real strategy which I helped develop. Consider the region A, the first part of the upwardly sloping orange line. This region was the performance of our backtest. The strategy had a Sharpe Ratio over 7.0! Based on the information we had up until that time (the end of A), it looked great so we started trading it.

When we began live trading we saw the real performance illustrated with the green “live” line in region B– essentially flat. The strategy was not working, so we halted trading it after a few weeks. After we stopped trading it, the strategy started performing well again in paper trading (Region C, Arg!).

Performance of a strategy that looked great in backtesting (region A). When traded live, it didn’t work well (region B). When we stopped trading it it went back to working well (region C).

Performance of a strategy that looked great in backtesting (region A). When traded live, it didn’t work well (region B). When we stopped trading it it went back to working well (region C).

How can this be? We thought perhaps that the error was in our predictive model, so we backtested again over the “live” area and the backtest showed that same flat area. The only difference between the nice 7.0 Sharpe Ratio sections and the flat section was that we were engaged in the market in the flat region.

What was going on? The answer, simply, is that by participating in the market we were changing the prices to our disadvantage. We were not modeling market impact in our market simulation. Once we added that feature more accurately, our backtest appropriately showed a flat, no-return result for region A. If we have had that in the first place we probably would never have traded the strategy.

Solution: Be sure to anticipate that price will move against you at every trade. For trades that are a small part of overall volume, a rule of thumb is about 5 bps for S&P 500 stocks and up to 50 bps for more thinly traded stocks. It depends of course on how much of the market your strategy is seeking to trade.

5. Buy $10M of a $1M company

Naïve backtesters will allow a strategy to buy or sell as much of an asset as it likes. This may provide a misleadingly optimistic backtest because large allocations to small companies are allowed.

There often is a real alpha in thinly traded stocks, and data mining approaches are likely to find it. Consider for a moment why it seems there is an alpha there. The reason is that the big hedge funds aren’t playing there because they can’t execute their strategy with illiquid assets. There are perhaps scraps of alpha to be collected by the little guy, but check to be sure you’re not assuming you can buy $10M of a $1M company.

Solution: Have your backtester limit the strategy’s trading to a percentage of the daily dollar volume of the equity. Another alternative is to filter potential assets to a minimum daily dollar volume.

6. Overfit the model

An overfit model is one that models in-sample data very well. It predicts the data so well that it is likely modeling noise rather than the underlying principle or relationship in the data that you are hoping it will discover.

Here’s a more formal definition of overfitting: As the degrees of freedom of the model increase, overfitting occurs when in-sample prediction error decreases and out-of-sample prediction error increases.

What do we mean by “degrees of freedom”? Degrees of freedom can take many forms, depending on the type of model being created: Number of factors used, number of parameters in a parameterized model, and so on.

Degrees of freedom (X), versus error (Y). Overfitting is occurring in the region to the right of the yellow symbol as out of sample error increases.

Degrees of freedom (X), versus error (Y). Overfitting is occurring in the region to the right of the yellow symbol as out of sample error increases.

Solution: Don’t repeatedly “tweak” and “refine” your model using in-sample data. And always compare in-sample error versus out-of-sample error.

7. Trust complex models

Complex models are often overfit models. Simple approaches that arise from a basic idea that makes intuitive sense lead to the best models. A strategy built from a handful of factors combined with simple rules is more likely to be robust and less sensitive to overfitting than a complex model with lots of factors.

Solution: Limit the number of factors considered by a model, use simple logic in combining them.

8. Trust stateful strategy luck

A stateful strategy is one whose holdings over time depend on which day in history it was started. As an example, if the strategy rapidly accrues assets, it may be quickly fully invested and therefore miss later buying opportunities. If the strategy had started one day later, its holdings might be completely different.

Sometimes such strategies’ success vary widely if they are started on a different day. I’ve seen, for instance, a difference in 50% return for the same strategy started on two days in the same week.

Solution: If your strategy is stateful, be sure to test it starting on many different days. Evaluate the variance of the results across those days. If it is large you should be concerned.

9. Data mining fallacy

Even if you avoid all of the pitfalls listed above, if you generate and test enough strategies you’ll eventually find one that works very well in a backtest. However, the quality of the strategy cannot be distinguished from a lucky random stock picker.

How can this pitfall be avoided? It can’t be avoided. However, you can and should forward test before committing significant capital.

Solution: Forward test (paper trade) strategy before committing capital.

Summary

It is better to view backtesting as a method for rejecting strategies than as a method for validating strategies. One thing is for sure: If it doesn’t work in a backtest, it won’t work in real life. This converse is not true: Just because it works in a backtest does not mean you can expect it to work in live trading.

If you avoid the pitfalls listed above, your backtests stand a better chance of more accurate representation in real life performance.

— By Tucker Balch from augmentedtrader

About the Author System Trader Success Contributor

Contributing authors are active participants in the financial markets and fully engrossed in technical or quantitative analysis. They desire to share their stories, insights and discovers on System Trader Success and hope to make you a better system trader. Contact us if you would like to be a contributing author and share your message with the world.

Leave a Comment:

4 comments
Add Your Reply