k-Nearest neighbors

KNN is simple. I’m interested in trying it because it seems like the way some traders think. In interesting situations, they think back to other stocks that were in a similar situation in the past and bet on a similar outcome. The trick is knowing what aspects of a situation are really useful for prediction.

In KNN, the main hyperparameters are the distance metric used to determine which points are nearest, and k, the number of nearest neighbors to consider. The prediction is just the average outcome of the k nearest neighbors. Some variants use a weighted average prediction, where the very closest neighbors get more weight.

Brute force nearest-neighbor search requires calculating the pairwise distance between all points and is unreasonable for large datasets. Search algorithms that make approximations and take shortcuts are much faster, and they introduce their own set of hyperparameters that need to be specified and studied.

KNN is non-parametric, in that it does not codify what it has learned into calibrated parameters. Instead, the entire dataset must be present to perform inference.

Sample code

The interesting bits from scikit-learn in python:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

knn = KNeighborsRegressor(n_jobs=2)

# create a dictionary of all values we want to test for n_neighbors
param_grid = {}
param_grid['n_neighbors'] = np.arange(2000, 10001, 2000)
param_grid['leaf_size'] = np.arange(10, 31, 10)

# use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn, param_grid, cv=10, n_jobs=4)

# fit model to data
knn_gscv.fit(train_X, train_y)


train_pred = knn_gscv.predict(train_X)
test_pred = knn_gscv.predict(test_X)

Dataset-1 model, all features

On dataset-1, the even the training data FVR plot looks poor:

The out-of-sample results appear indistinguishable from noise:

Dataset-1 model, select features

KNN really gets messed up by the inclusion of poor features. So the next test just uses the 5 features I ended up using on my handcrafted model. Here are the FVR plots:

Manhattan distance

KNN isn’t looking good so far. Let’s try using the Manhattan distance as our distance measure, instead of the Euclidean distance we’ve been using up until now.

I think this last plot looks better than the others, but still isn’t very good.