Real-time head-to-head: Adaptive modeling of financial market data using XGBoost and CatBoost

Emergent Methods
13 min readJun 16, 2023

--

Which gradient boosted decision tree algorithm is superior for adaptive modeling of financial market data, XGBoost or CatBoost? There are plenty of articles comparing these algorithms on arbitrary static datasets, but how do they perform in a live, chaotic environment? How about resource usage like average training times, average inference times, CPU utilization, and RAM consumption? And finally, how well do these predictions translate into profit? To answer these questions, we designed a benchmark experiment, ran it live, and collected the results. Spoiler alert — XGBoost was way faster and won by almost 4x in terms of profitability.

Why do we need adaptive modeling?

Financial markets are inherently chaotic with price action reacting to unforeseen news events, market manipulation, and heard mentality. Traditional modeling techniques often struggle to keep up with such unpredictability. This is where adaptive modeling comes into play by providing a dynamic framework that can adjust and adapt to changing market conditions on the fly.

Figure 1. Adaptive modeling of a system that changes over time requires training new models when new data becomes available.

Adaptive modeling boils down to data management. Data is streamed in to the model for both re-training and inferencing. In live environments, re-training and inference are occurring simultaneously. Meanwhile, the data processing pipeline also needs careful attention in terms of feature engineering, normalization, outlier removal, and any other manipulation techniques.

FreqAI

With FreqAI [1], we have put together all of these different mechanics for you to run adaptive modeling on cryptocurrency market data via a user-friendly, community tested, open-source interface, using any external ML library of your choosing.

FreqAI is built on top of the open-source algorithmic trading software Freqtrade that allows access to a variety of open-source exchange APIs and provides a set of data analysis and visualization tools for evaluating both live and backtesting performance. On top of this, FreqAI currently provides 18 pre-configured prediction models for XGBoost, CatBoost, LightGBM, PyTorch, and Stable Baselines, and hosts a range of custom algorithms and methodologies aimed at improving computational and predictive performances.

To try the software, all you need to do is install Freqtrade and run the following command:

freqtrade trade --config config_examples/config_freqai.example.json --strategy FreqaiExampleStrategy --freqaimodel LightGBMRegressor --strategy-path freqtrade/templates

The ease of switching between XGBoost and CatBoost enabled us to pit XGBoost and CatBoost against each other, evaluating their performance in the domain of algorithmic trading for cryptocurrency. In fact, running them both simply boils down to:

freqtrade trade --strategy QuickAdapterV3 --freqaimodel XGBoostRegressor
freqtrade trade --strategy QuickAdapterV3 --freqaimodel CatboostRegressor

All code related to this experiment is open-source and available for inspection/reproduction. The underlying FreqAI source is available on Github at https://github.com/freqtrade/freqtrade. Meanwhile, the strategy and configuration files are available at the FreqAI discord.

Details of the experiment

To showcase the abilities of FreqAI, we used two of the most popular ML regressors — XGBoost and CatBoost — to perform a 3-week long study of their performance on live predictive modeling of chaotic time-series data from the cryptocurrency market.

Between February 16th and March 12th, two FreqAI instances, one for each regressor, were configured to train separate models for 19 coin pairs (/USDT). The instances were hosted on separate, identical, recycled servers (12 core Xeon X5660 2.8GHz, 64Gb DDR3). All servers were benchmarked to confirm that they had identical performance.

The accuracy of the predictions produced by each regressor was assessed via two accuracy metrics: the balanced accuracy (the arithmetic mean of sensitivity and specificity) and a custom accuracy score (the normalized temporal distance between a prediction and its closest target).

The regressors were run with default settings, except for setting the number of estimators for XGBoost to 1000 (which is the default value for CatBoost, as compared to the XGBoost default of 100), to showcase their default behavior when applied to this type of problem.

All in all, the cluster was actively generating 38 models (2 per coin x 19 coins) with ~3.3k features for each one, every 2 hours.

Feature engineering

The feature set was based on pair-specific price and volume data from a sliding window of 14 days leading up to the current time point, acquired from the cryptocurrency exchange Binance using the open-source CCXT trading library.

The feature set for each coin pair contained 42 base indicators computed for the base candle time frame (5 minutes) as well as for 15 minutes, 1 hour, and 4 hours using the open-source libraries TA-Lib, Pandas-TA, QTPyLib, and Freqtrade technical indicators. A subset of the indicators were calculated each for multiple time periods (8 minutes, 16 minutes, and 32 minutes), and each shifted 3 candles to add recency information. We also added day-of-the-week and hour-of-the-day as features, and used BTC and ETH as correlated data for all other coin pairs. In total, this amounted to 3266 features for each coin pair, except for BTC and ETH that had 2178 features each.

The feature engineering components of the code are available in the FreqAI discord, and the configuration for feature engineering is rather simple:

"freqai": {
...
"feature_parameters": {
"include_corr_pairlist": [
"BTC/USDT:USDT",
"ETH/USDT:USDT"
],
"include_timeframes": [
"5m",
"15m",
"1h",
"4h"
],
"label_period_candles": 100,
"include_shifted_candles": 3,
"DI_threshold": 20,
"weight_factor": 0.9,
"indicator_periods_candles": [8, 16, 32],
"noise_standard_deviation": 0.02,
"buffer_train_data_candles": 100
},
...
}

Targets

The training labels were defined as the extrema (minimum and maximum) points within a sliding window of 200 candles (1000 minutes). We defined the label as &s-extrema, with a value of -1 for minima and 1 for maxima, and passed a Gaussian filter to smooth them to improve the regression. The code to reproduce these labels in FreqAI can be found below:

from scipy.signal import argrelextrema
def set_freqai_targets(self, dataframe, **kwargs):
"""
Set targets for FreqAI model. Any column prepended with `&`
will be treated as a training target.
"""
dataframe["&s-extrema"] = 0
min_peaks = argrelextrema(
dataframe["low"].values, np.less,
order=100
)
max_peaks = argrelextrema(
dataframe["high"].values, np.greater,
order=100
)
for mp in min_peaks[0]:
dataframe.at[mp, "&s-extrema"] = -1
for mp in max_peaks[0]:
dataframe.at[mp, "&s-extrema"] = 1
dataframe["minima-exit"] = np.where(
dataframe["&s-extrema"] == -1, 1, 0)
dataframe["maxima-exit"] = np.where(dataframe["&s-extrema"] == 1, 1, 0)
dataframe['&s-extrema'] = dataframe['&s-extrema'].rolling(
window=5, win_type='gaussian', center=True).mean(std=0.5)
return dataframe

This label definition means two things: 1) we have an unbalanced classification problem, and 2) we are using regression for a classification problem:

1. Identifying extrema points inside a sliding window of 200 candles means that we will have only 2 true positives (one maximum and one minimum) but 198 true negatives.

2. We want to predict if the incoming candle corresponds to a maximum or a minimum or neither, which means that we are dealing with a multi-class classification problem.

FreqAI gives you access to both the regression and classification versions of each of the ML libraries we used. While we need to determine a threshold to make the final classification when using either classification or regression, we found that the regressor version performs better than the classifier for these predictions.

Adaptive thresholding

Using a regressor for a classification task requires handling the output predictions after inferencing the trained model. Since a regressor returns a real-valued prediction, this needs to be converted into a binary decision regarding whether or not it is an extrema. For this, we used an adaptive threshold that was calculated using the mean of the six highest and lowest historical predictions within the previous 50 hours for the maxima and minima, respectively. This was performed in FreqAI with:

num_candles = 600 # 50 hours of 5 minute candles
pred_df_full = self.dd.historic_predictions[pair].tail(num_candles).reset_index(drop=True)
pred_df_sorted = pd.DataFrame()

# sort each column independently
for col in pred_df_sorted:
pred_df_sorted[col] = pred_df_sorted[col].sort_values(
ascending=False, ignore_index=True)

# number of expected max mins during the last 50 hours
frequency = num_candles / 200
# get the mean of the top and bottom candles, use this for the threshold
maxima_sort_threshold = pred_df_sorted.iloc[:int(frequency)].mean()
minima_sort_threshold = pred_df_sorted.iloc[-int(frequency):].mean()

Using a dynamic threshold ensures that our regressors output predictions that are adapted to the state of the market that the model was trained on. It also means that there is a 50 hour “warmup” period before the a threshold has enough historical predictions available to threshold the real-time inference for classification.

Outlier detection

As we touched upon in FreqAI — from price to prediction, our guide to feature engineering for algorithmic trading using machine learning, outlier detection is paramount to minimizing risk when using machine learning for algorithmic trading. In that article, we described a number of different techniques for outlier and novelty detection. For the experiment we are presenting here, we opted for the Dissimilarity Index — a custom metric available only in FreqAI.

The Dissimilarity Index (DI) aims to quantify the uncertainty associated with each prediction made by the model by comparing the incoming data used for the prediction to the training data. If the prediction data is far away from the training data, the model will not be able to properly assess it and the resulting prediction should not be acted upon.

As with turning regressor predictions into binary decisions, the DI needs a threshold to be compared to in order to determine if the prediction data is close to the training data or not. Here, we fit the historical (previous 50 hours) DI data to a Weibull distribution and used the 0.999 percentile as the threshold. The choice of cutoff value will affect how conservatory the DI is, as it will allow more or less dissimilar prediction data to be acted upon.

The cutoff is used to classify outliers as shown in the following figure:

Figure 2. Dissimilarity index (DI) and DI cutoff for outlier detection for BTC/USDT throughout the experiment, for each of the two regressors. The BTC/USDT close price is showed in the background to give some insight of the market conditions relative to the regressors’ performance. We see how market regime changes (solid arrows) are identified by the DI crossing the DI cutoff (dashed arrows).

Trading strategy

The predictions from the regressors were used to determine whether to enter or exit a trade:
- If the regressor predicted that the incoming candle was a maximum or a minimum and it was not currently in a trade, it would enter a long (for a predicted maximum) or a short (for a predicted minimum).
- If the regressor predicted that the incoming candle was a maximum or minimum and it was in a trade, it would exit a long if it predicted a minimum, or exit a short if it predicted a maximum. This was defined in FreqAI using the populate_entry_trend()method in the strategy:

def populate_entry_trend(self, df: DataFrame, metadata: dict) -> DataFrame:
"""
Define the entry criteria for going long or short.
"""
enter_long_conditions = [
df["do_predict"] == 1,
df["DI_catch"] == 1,
df["&s-extrema"] < df["minima_sort_threshold"],
]

if enter_long_conditions:
df.loc[
reduce(lambda x, y: x & y, enter_long_conditions), [
"enter_long", "enter_tag"]
] = (1, "long")

enter_short_conditions = [
df["do_predict"] == 1,
df["DI_catch"] == 1,
df["&s-extrema"] > df["maxima_sort_threshold"],
]

if enter_short_conditions:
df.loc[
reduce(lambda x, y: x & y, enter_short_conditions), [
"enter_short", "enter_tag"]
] = (1, "short")

return df

However, we also had other guardrails put in place to help improve the performance:
- In a previous test, we noticed that staying too long in a trade generally resulted in a low profit. Because of this, we limit the duration of trades to 24 hours and any trade reaching this limit was exited.
- A stop loss of -4% was put in place to exit any trades that reached -4% profit.
- If the target calculation identified an extrema in the most recent candle and this was not already predicted by the regressor, an active long trade would be exited if the identified extrema was a minimum, and an active short trade would be exited if the identified extrema was a maximum.
- As was discussed in the section above about Outlier detection, the custom FreqAI outlier detection method, the Dissimilarity Index, was used to decide whether a predicted extrema would be disregarded or not.

These additional components were handled inside the custom_exit() method in the strategy:

def custom_exit(
self, pair: str, trade: Trade, current_time: datetime,
current_rate: float, current_profit: float, **kwargs
):
"""
User defines custom trade exit criteria
"""
dataframe, _ = self.dp.get_analyzed_dataframe(
pair=pair, timeframe=self.timeframe)

last_candle = dataframe.iloc[-1].squeeze()
trade_date = timeframe_to_prev_date(
self.timeframe, (trade.open_date_utc -
timedelta(minutes=int(self.timeframe[:-1])))
)
trade_candle = dataframe.loc[(dataframe["date"] == trade_date)]

if trade_candle.empty:
return None
trade_candle = trade_candle.squeeze()

entry_tag = trade.enter_tag

trade_duration = (current_time - trade.open_date_utc).seconds / 60

if trade_duration > 1000:
return "trade expired"

if last_candle["DI_catch"] == 0:
return "Outlier detected"

if (
last_candle["&s-extrema"] < last_candle["minima_sort_threshold"]
and entry_tag == "short"
):
return "minimia_detected_short"

if (
last_candle["&s-extrema"] > last_candle["maxima_sort_threshold"]
and entry_tag == "long"
):
return "maxima_detected_long"

Results

Model accuracy

Since we are dealing with an unbalanced classification problem, and we prioritize both negative and positive predictions (that is, we care just as much about that our regressors do not predict an extrema when there is none, as we do about them correctly predicting one) we assessed the performance of the models using the balanced accuracy score:

The balanced accuracy was calculated based on a sliding window of 600 candles (50 hours).

On top of the balanced accuracy, we also devised our own accuracy metric to be able to address the fact that we are predicting targets in a time series and hence are interested to see the temporal accuracy:

Here, we are looking at the temporal difference between a prediction and its closest target, normalized to the sliding window of 200 candles (1000 minutes) that we used to identify the targets. A perfect match means we have a temporal accuracy of 1, whilst a prediction further away than 1000 minutes gives a negative value.

The figure below shows the two accuracy metrics throughout the experiment for the XGBoost regressor, predicting extrema for BTC/USDT. The balanced accuracy was updated at each candle, whilst the temporal accuracy only updated when there was a predicted extrema.

Figure 3. Balanced (top) and temporal (bottom) accuracy scores for the XGBoost regressor predicting extrema for BTC/USDT. The sliding window of 50 hours used to calculate the balanced accuracy is indicated in the top plot by a shaded area. All regressors and coins were tracked similarly via our live dashboard.

From Table 1, we can see that the regressors performed similarly in terms of both balanced and temporal accuracy.

Table 1. Average plus/minus standard deviation of the model accuracy for the individual regressors.

With our target being defined as extrema points within a window of 1000 minutes, the models were trained to expect one minimum/maximum every 16.7 hours. As you can see from Table 2, the regressors predicted twice the amount of extrema compared to the number of targets that were identified.

Table 2. Average plus/minus standard deviation of the ratio between predictions and targets for the individual regressor

Resource usage

The resource usage (Table 3) for the two regressors shows that CatBoost was slow, in terms of both training and predicting, compared to XGBoost. With new data incoming every 5 minutes, the models produced by CatBoost were not always trained using the most recently available data since the regressor was too slow at completing the model training. However, thanks to the parallelized architecture of FreqAI there is always a model available for inferencing.

Table 3. Average plus/minus standard deviation of resource usage during the 3-weeks experiment for the individual regressors.

Profitability

Despite being given the exact same input features, the XGBoost and CatBoost regressors performed very differently in terms of profitability (Table 4). At the end of the experiment, both were in profit but XGBoost had clearly outperformed its competitor by ending up at a 7% profit compared to 2% for CatBoost.

Table 4. Average plus/minus standard deviation of close profit (% of stake amount), and final cumulative profit (% of total wallet) for the individual regressors.

Disruptive market events (some of which are indicated in Figure 4) clearly had effects on the profitability of the regressors. The regressors were trained on historical data, so when the market exhibited sudden changes that were not seen in the training data, the regressors performed poorly. This is, however, expected behavior as no machine learning technique is able to predict behavior that has not been included in the data used for training. Instead, the key here is adaptability. Since the previously unseen data is incorporated in the next training of the regressors, they are now imbued with new knowledge and should be able to better handle similar events occurring in the future. One clear example of this is event F in Figure 4: Events A-E represent a previously unseen or infrequent changes, and hence lead to poor regressor performance. However, after enough of these events had been incorporated into the training data set, the regressors (especially XGBoost) managed to instead take advantage of the event F.

Figure 4. Cumulative profit normalized to total wallet size throughout the experiment, for each of the two regressors. The BTC/USDT close price is showed in the background to give some insight of the market conditions relative to the regressors’ performance. Keep in mind that the regressors traded on 19 coins each, many of which are only weakly correlated with BTC. Some disruptive market events (solid arrows), with the resulting profit behavior (dashed arrows), are indicated with labeled circles. The same market events were discussed regarding the Dissimilarity Index in Figure 2.

Conclusion

During the 3-week experiment, we tracked the balanced and temporal accuracies, and resource usage of each regressor for each coin, together with the profit for each regressor. Keep in mind, no hyper-parameter optimization was done to tune the regressors and so their performance in terms of accuracy and profit should only be interpreted as relative to each other. More than anything, this experiment is a proof-of-concept to show the potential of FreqAI for real-time adaptive modeling of streaming data.

Ongoing experiment

We are currently running a new experiment that you can check out by visiting our live dashboard. If you want to try to run your own bot: join our discord server, where you will find a bunch of like-minded people to share your experience with.

DISCLAIMER FreqAI is not affiliated with any cryptocurrency offerings. FreqAI is, and always will be, a not-for-profit, open-source project. FreqAI does not have a crypto token, FreqAI does not sell signals, and FreqAI does not have a domain besides the Freqtrade documentation. Please beware of imposter projects, and help us by reporting them to the official FreqAI discord server.

References

1. Caulk, R. A., & others (2022). FreqAI: generalizing adaptive modeling for chaotic time-series market forecasts. Journal of Open Source Software, 7(80), 4864, https://doi.org/10.21105/joss.04864

--

--

Emergent Methods

A computational science company focused on applied machine learning for real-time adaptive modeling of dynamic systems.