Simple machine learning strategy


This is basically a slightly modified (in order to work properly) version of a machine learning strategy I found in the Quantopian forums, and as you’ll see from backtesting, it doesn’t work particularly well, but I figured it would be of interest, and possibly a good starting point for setting up a decent machine learning strategy.

import math as _math
import numbers as _numbers
import numpy as np
import pandas as pd
import os
import tempfile
import time

from sklearn.ensemble import RandomForestRegressor
from catalyst import run_algorithm
from catalyst.api import cancel_order, date_rules, get_open_orders, get_order, order_target, 
order_target_percent, order_target_value, record, schedule_function, symbol, time_rules
from import extract_transactions
from catalyst.utils.paths import ensure_directory
from logbook import Logger
from decimal import *

algo_namespace = 'BTCUSD'
log = Logger(algo_namespace)

def initialize(context):

    getcontext().prec = 8 = symbol('btc_usd')
    context.candle_size = '1D'
    context.model = RandomForestRegressor()
    context.lookback = 7
    context.price = 'price' = 0
    context.price_history = 180
    context.buy_pct = 1
    context.sell_pct = 0
    context.set_commission(maker=0.001, taker=0.002)
    context.start_time = time.time()
    context.current = 1

    schedule_function(rebalance, date_rules.every_day(), time_rules.market_close())

def rebalance(context, data):

    price = data.history(, context.price, context.price_history, context.candle_size)
    current_price = data.current(, context.price)
    pos_amount = context.portfolio.positions[].amount
    cash =
    value = context.portfolio.portfolio_value

    price_change = np.diff(price.values).tolist()

    X = [] 
    Y = []

    for i in range(context.price_history-context.lookback-context.current):
        Y.append(price_change[i+context.lookback]), Y)

    if context.model:
        price = data.history(, context.price, context.lookback+context.current, context.candle_size)
        price_change = np.diff(price.values).tolist()
        prediction = context.model.predict([price_change])
        prediction = float(prediction)
        prediction_price = prediction + current_price
    if prediction > and pos_amount ==
        order_target_percent(, context.buy_pct)
    if prediction < and pos_amount >
        order_target_percent(, context.sell_pct)

    record(cash=cash, price=current_price, prediction_price = prediction_price)
def analyze(context=None, perf=None):
    end = time.time()'elapsed time: {}'.format(end - context.start_time))

    import matplotlib.pyplot as plt

    quote_currency = list(context.exchanges.values())[0].quote_currency.upper()

    ax1 = plt.subplot(611)
    perf.loc[:, 'portfolio_value'].plot(ax=ax1)

    ax2 = plt.subplot(612, sharex=ax1)
    perf.loc[:, ['price', 'prediction_price']].plot(ax=ax2, label='Price')

    ax2.set_ylabel('{asset}\n({base})'.format(, base=quote_currency

    transaction_df = extract_transactions(perf)
    if not transaction_df.empty:
        buy_df = transaction_df[transaction_df['amount'] > 0]
        sell_df = transaction_df[transaction_df['amount'] < 0]
            perf.loc[buy_df.index.floor('1 min'), 'price'],
            perf.loc[sell_df.index.floor('1 min'), 'price'],

    plt.gcf().set_size_inches(18, 18)

if __name__ == '__main__':
    # Backtest
        start=pd.to_datetime('2018-1-1', utc=True),
        end=pd.to_datetime('2018-7-22', utc=True),


Brilliant work, thanks for posting this. I’m struggling to integrate ML into my stat-arb strategy so this is a great help.


thanks, let me know if you need any help. What I posted is basically garbage compared to what I learned in the months since starting to play with ML.


Thanks for the offer. I’m coming at this with no previous ML experience but from what I’ve read I think a deep learning model seems like the best route forward, at least for an initial foray, based on articles I’ve found on the application of ML to statistical arbitrage.

Not sure what the rules are with links here but this article in particular

When it comes to providing the model with input data, is it best to go all out? For example, inter-asset cointegration stats & z-scores. Or is it better to feed in raw price & volume data and let the agent work it out for itself?


From what I understand, “more” data is better than “good” data. I try to use price change over a period versus raw price to make the data more stationary, and possibly more familiar to the model. Obviously it hasn’t mattered in the past year, but when we were constantly seeing new highs, the models I’ve tested were going in to uncharted territory, and chose to sell because they couldn’t predict a higher number than the past values provided. With something like, say, a daily percent change, a 10-20% rise or drop isn’t necessarily unfamiliar.

I’d also play around somewhat with using volume * price instead of just volume. For example, the earliest prices for BTCUSD on Bitfinex that are available on Catalyst have volumes of around 60k, but in USD, that is equivalent to about 4200 BTC.