Does vanilla deep regression even work?

…because I can’t seem to get good results with it.

Regression vs. Classification

Why is regression so much less popular than classification in the machine learning community? It’s often just mentioned as an afterthought that you can change the output layer on a neural network to turn it into a regression model if you want to. Even Marcos Lopez de Prado, the recent face of machine learning in trading, seems to focus extensively (exclusively?) on classification in his popular 2018 book. But it seems obvious that we need regression in finance. We need to know the magnitude of our forecasts before we can optimize a portfolio.

I attended the Stanford Deep Learning School in the summer of 2016. During a break, I had a chance to chat with one of the speakers, Hugo Larochelle, and I asked him why I never hear about deep learning regression. He said maybe strongly varying errors cause problems during back-propagation, making it not work very well. Given the context, I was surprised he didn’t jump to the defense of deep learning.

I’m starting to think it just doesn’t work. Despite a broad hyper-parameter search, I’m still having poor luck on the multivariate interaction synthetic data. The MAE I get from the neural network (both in-sample and out-of-sample) is an order of magnitude larger than the out-of-sample MAE of XGBoost.

Universal Approximators

Hornik, Tinchcombe, and White (1989) showed that multilayer feedforward networks are capable of approximating any function to any desired degree of accuracy. “This implies that any lack of success in applications must arise from inadequate learning, insufficient numbers of hidden units or the lack of a deterministic relationship between input and target.”

The synthetic data certainly is certainly deterministic, and I’ve experimented with far more nodes and layers than I imagine would be sufficient. So, the learning must be inadequate in some way. It isn’t that the data has too much noise, because I get bad results even with the noise removed. When looking at a histogram of output predictions, it seems nearly all of them have the same value. So, this is likely an indication of network saturation.

PyTorch Code

import torch
import torch.nn as nn
from machine_learning.generate_data import multivariate_interaction

device = "cuda" if torch.cuda.is_available() else "cpu"

class Net(nn.Module):
    def __init__(self, options):
        super(Net, self).__init__()
        bias = True
        self.layers = nn.Sequential()
        for i in range(options['layers']):
            if i == 0:
                inputs = options['inputs']
                inputs = options['nodes']
            layer = nn.Linear(inputs, options['nodes'], bias)
            nn.init.normal_(layer.weight, mean=0.0, std=options['stdev'])
            eval('self.layers.append(nn.%s())' % options['activation'])
        self.layers.append(nn.Linear(options['nodes'], 1, bias))
    def forward(self, X):
        X = self.layers(X)
        return X

options = {}
options['inputs'] = 9
options['layers'] = 40 # tried lots of values
options['nodes'] = 200 # tried lots of values
options['activation'] = 'Tanh' # also tried: Sigmoid, ReLU, ELU, LeakyReLU
options['stdev'] = 0.01 # tried lots of values
options['init'] = 'normal_' # also tried: uniform_, xavier_uniform_

net = Net(options).to(device)

optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
# also tried: Adam, Adadelta, AdamW, Adamax, ASGD, LBFGS,
#             NAdam, RAdam, RMSprop, Rprop

loss_func = nn.L1Loss() # also tried MSELoss

# generate synthetic data
m = 524288
data = multivariate_interaction(m)

X = torch.tensor(data['X_train'], dtype=torch.float).to(device)
y = torch.tensor(data['y_train'], dtype=torch.float).unsqueeze(1).to(device)

last_loss = 1e6
while True:
    pred = net(X)
    loss_ = loss_func(pred, y)
    loss = float(loss_)
    if loss > last_loss:
        optimizer.param_groups[0]['lr'] /= 2 # reduce learning rate
        print('lr = %0.4f' % optimizer.param_groups[0]['lr'])
    last_loss = loss
    optimizer.zero_grad()   # clear gradients for next train
    loss_.backward()        # backpropagation, compute gradients
    optimizer.step()        # apply gradients


Neural network regression either doesn’t work in practice, or else requires black magic tuning that I haven’t figured out yet. Perhaps the answer is that I need a ridiculously large network, in which case both training and inference will take an unreasonable amount of time for trading applications. I will continue to study the saturation issue, but perhaps there is a reason I’m having trouble finding nontrivial internet examples that use regression instead of classification.

Update: Ayush Gupta appears to have had similar troubles with deep learning regression.