Building a Neural Movie Recommender System

Despite its name (and the original purpose), timeserio is a general-purpose tool for rapid model development. In this example, we use it to train a state-of-the-art movie recommender system in a few lines of code.


The MovieLens dataset is commonly used to benchmark recommender systems - see

Our task is to learn to predict how user \(u\) would rate a movie \(m\) (\(r_{um}\)) based on an available dataset of ratings \(r_{ij}\). Importantly, each user has only given ratings to some of the movies, and each movie has only been rated by some of the users. This is a classic example of transfer learning, commonly known as collaborative filtering in the context of recommender systems.


We make use of keras to define three models: - a user embedder that learns to represent each user’s preference as a vector - a movie embedder that learns to represent each movie as a vector - a rating model that concatenates user and movie embedding networks, and applies a dense neural network to predict a (non-negative) rating

By wrapping the three models in a multinetwork (of the MultiNetworkBase class), we can for example - train the rating model end-to-end, then use one of the embedding models - freeze one or both of the embedding models and re-train the dense layers, or - freeze the dense layers, and re-train embeddings for new users only

To make our job even simpler, we further wrap our multinetwork in a MultiModel class, which allows us to take data directly from pandas DataFrames, and apply pre-processing pipelines if needed.

import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

Download data

First, we download the freely available dataset and define a few helper functions for importing data.

!mkdir -p datasets; cd datasets; wget; unzip -o; rm
def get_ratings(part=''):
    """Return a DataFrame of user-movie ratings."""
    return pd.read_csv(
        os.path.join('datasets/ml-100k', part), header=None, sep='\t',
        names=['user_id', 'item_id', 'rating', 'timestamp'],
    ).rename(columns={'item_id': 'movie_id'})

def get_users():
    """Return a DataFrame of all users."""
    return pd.read_csv(
        os.path.join('datasets/ml-100k', 'u.user'), header=None, sep='|',
        names=['user_id', 'age', 'gender', 'occupation', 'zip_code'],
    ).rename(columns={'item_id': 'movie_id'})

ITEM_PROPS = ['movie_id', 'movie_title', 'video_release_date', 'unknown', 'IMDb_URL']
GENRES = ['Action', 'Adventure', 'Animation',
          'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
          'Thriller', 'War', 'Western']

def get_movies():
    """Return a DataFrame of all movies."""
    return pd.read_csv(
        os.path.join('datasets/ml-100k', 'u.item'), header=None, index_col=False, sep='|', encoding="iso-8859-1",
        names=ITEM_PROPS + GENRES,
user_id movie_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
movie_id movie_title video_release_date unknown IMDb_URL Action Adventure Animation Childrens Comedy ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 1 Toy Story (1995) 01-Jan-1995 NaN 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 0 0
1 2 GoldenEye (1995) 01-Jan-1995 NaN 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 3 Four Rooms (1995) 01-Jan-1995 NaN 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

3 rows × 23 columns

Define the model architecture

We start by defining the network architecture. All we need to do is sub-class MultiNetworkBase, and define the _model method.

  • keyword arguments to _model are used to parametrise our network architecture, e.g. by specifying a settable number of neurons or layers

  • the _model method is expected to return a dictionary of keras.models.Model objects.

from keras.layers import Input, Embedding, Dense, Concatenate, Flatten
from keras.models import Model
Using TensorFlow backend.
from timeserio.keras.multinetwork import MultiNetworkBase

class MovieLensNetwork(MultiNetworkBase):
    def _model(self,
               user_dim=2, item_dim=2, max_user=10000, max_item=10000,
        user_input = Input(shape=(1,), name='user')
        item_input = Input(shape=(1,), name='movie')
        user_emb = Flatten(name='flatten_user')(Embedding(max_user, user_dim, name='embed_user')(user_input))
        item_emb = Flatten(name='flatten_movie')(Embedding(max_item, item_dim, name='embed_movie')(item_input))
        output = Concatenate(name='concatenate')([user_emb, item_emb])
        output = Dense(hidden, activation='relu', name='dense')(output)
        output = Dense(1, name='rating')(output)

        user_model = Model(user_input, user_emb)
        item_model = Model(item_input, item_emb)
        rating_model = Model([user_input, item_input], output)
        rating_model.compile(optimizer='Adam', loss='mse', metrics=['mae'])

        return {'user': user_model, 'movie': item_model, 'rating': rating_model}

The three models are initialized on-demand, e.g. when we access multinetwork.model. Note that the inputs and embedding layers are shared, and therefore changes made to e.g. the user model are instantly available in the rating model.

multinetwork = MovieLensNetwork()
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils.layer_utils import print_summary
SVG(model_to_dot(multinetwork.model['user'], rankdir='LR').create(prog='dot', format='svg'))
SVG(model_to_dot(multinetwork.model['movie'], rankdir='LR').create(prog='dot', format='svg'))
SVG(model_to_dot(multinetwork.model['rating']).create(prog='dot', format='svg'))
Layer (type)                    Output Shape         Param #     Connected to
user (InputLayer)               (None, 1)            0
movie (InputLayer)              (None, 1)            0
embed_user (Embedding)          (None, 1, 2)         20000       user[0][0]
embed_movie (Embedding)         (None, 1, 2)         20000       movie[0][0]
flatten_user (Flatten)          (None, 2)            0           embed_user[0][0]
flatten_movie (Flatten)         (None, 2)            0           embed_movie[0][0]
concatenate (Concatenate)       (None, 4)            0           flatten_user[0][0]
dense (Dense)                   (None, 8)            40          concatenate[0][0]
rating (Dense)                  (None, 1)            9           dense[0][0]
Total params: 40,049
Trainable params: 40,049
Non-trainable params: 0

From Multinetwork to Multimodel

We can train a specific model by using its name. Note that we must provide numpy feature arrays to each input, and also an array of training labels:[X_user, X_movie], y_rating, model='rating')

In our case, we could simply write X_user = df["user_id"].values etc.

However, we prefer different models to be fed from one data source, typically a pandas.DataFrame, with any details of feature pre-processing, or input ordering, taken care of by encapsulated pipelines, providing an interface of the form, model='rating')

Let’s work through the necessary steps - these may seem trivial for a simple problem, but save a lot of headaches when developing and deploying complex models.

Define individual pipelines

We start by defining a pipeline (a scikit-learn transformer) for each of the model inputs and labels:

from timeserio.preprocessing import PandasValueSelector

user_pipe = PandasValueSelector('user_id')
item_pipe = PandasValueSelector('movie_id')
rating_pipe = PandasValueSelector('rating')

Group the pipelines in a MultiPipeline

The MultiPipeline object provides a container for all the pipelines, with convenience features such as easy parameter accesss. All we need is to provide a name for each pipeline:

from timeserio.pipeline import MultiPipeline

multipipeline = MultiPipeline({
    'user_pipe': user_pipe,
    'movie_pipe': item_pipe,
    'rating_pipe': rating_pipe,

Connect pipelines to models

To finish the plumbing exercise, we specify which pipeline connects to each input or output of each model using a manifold.

Each key-value in the manifold has the form model_name: (input_pipes, output_pipes), where input_pipes is either a single pipe name, or a list of pipe names (one per input). Similarly, the output_pipe will have one ore more pipe names, one per output of the model - we use None for models that we do not intend to train using supervised labels.

manifold = {
    'user': ('user_pipe', None),
    'movie': ('movie_pipe', None),
    'rating': (['user_pipe', 'movie_pipe'], 'rating_pipe')

Put it all together

The MultiModel holds all three parts: - the multinetwork specifies the model architectures, and also training parameters and callbacks - the multipipeline specifies the feature processing pipelines - the manifold specifies which pipelines is plumbed to which input (or output) of which neural network model

from timeserio.multimodel import MultiModel

multimodel = MultiModel(

Fit the MultiModel

We load one train-test split, and fitting our neural recommender system

df_train = get_ratings('u1.base')
df_val = get_ratings('u1.test')
len(df_train), len(df_val)
(80000, 20000)
from kerashistoryplot.callbacks import PlotHistory
# Note: `PlotHistory` callback is rather slow
    df=df_train, model='rating', validation_data=df_val,
    batch_size=4096, epochs=50,
    callbacks=[PlotHistory(batches=True, n_cols=2, figsize=(15, 8))]
<timeserio.keras.callbacks.HistoryLogger at 0x13d5189e8>

The multimodel provides all the familiar methods such as fit, predict, or evaluate:

mse, mae = multimodel.evaluate(df_val, model="rating")
print(f"MSE: {mse}, RMSE: {np.sqrt(mse)}, MAE: {mae}")
MSE: 0.9139011481761933, RMSE: 0.9559817718849002, MAE: 0.7554579020500183

Cross-Validate our approach

To evaluate how well recommender system performs, we perform 5-fold cross-validation and compare scores established benchmarks.

from sklearn.metrics import mean_absolute_error, mean_squared_error
folds = [1, 2, 3, 4, 5]
folds_mse = []
folds_rmse = []
folds_mae = []
for fold in tqdm(folds, total=len(folds)):
    df_train = get_ratings(f'u{fold}.base')
    df_val = get_ratings(f'u{fold}.test')
        df=df_train, model='rating', validation_data=df_val,
        batch_size=4096, epochs=50, verbose=0,
    y_pred = multimodel.predict(df=df_val, model='rating')
    mse = mean_squared_error(df_val['rating'], y_pred)
    mae = mean_absolute_error(df_val['rating'], y_pred)
    f"5-fold Cross-Validation results: \n"
    f"RMSE: {np.mean(folds_rmse):.2f} ± {np.std(folds_rmse):.2f} \n"
    f"MAE: {np.mean(folds_mae):.2f} ± {np.std(folds_mae):.2f} \n"
5-fold Cross-Validation results:
RMSE: 0.94 ± 0.01
MAE: 0.74 ± 0.00

Benchmarks for some modern algorithms can be seen e.g. at or - our approach is in fact competitive with the state of the art before any tuning! By using dense embeddings, we did not use any user features such as gender or age - all we need is to learn their preference embedding as part of our end-to-end model.

We are now free to experiment with embedding dimensions for users and movies, or tweak the dense layers.

Using multiple models

We now use a trained MultiModel to inspect the embeddings. Because we defined user and movie embedders as independent models, we can simple call .predict(..., model=...) with different model names.

User embeddings

user_df = get_users()
embeddings = multimodel.predict(user_df, model='user')
user_df['emb_0'] = embeddings[:, 0]
user_df['emb_1'] = embeddings[:, 1]
sns.scatterplot(x='emb_0', y='emb_1', hue='gender', size='age', data=user_df)
<matplotlib.axes._subplots.AxesSubplot at 0x143dac320>

And the movie embeddings…

movie_df = get_movies()
embeddings = multimodel.predict(movie_df, model='movie')
movie_df['emb_0'] = embeddings[:, 0]
movie_df['emb_1'] = embeddings[:, 1]

Out of curiosity, we can compute mean embeddings for movies tagged with each genre.

genre_df = pd.DataFrame()
for genre in GENRES:
    mean = movie_df[movie_df[genre] == 1][['emb_0', 'emb_1']].mean()
    mean['genre'] = genre
    genre_df = genre_df.append(mean, ignore_index=True)

Movie and Genre embeddings

fig, axes = plt.subplots(ncols=2, figsize=(20, 8))
sns.scatterplot(x='emb_0', y='emb_1', data=movie_df, ax=axes[0], palette='bright')
sns.scatterplot(x='emb_0', y='emb_1', hue='genre', data=genre_df, ax=axes[1], palette='bright')
<matplotlib.axes._subplots.AxesSubplot at 0x1443e4f98>

We can even consider similarity between genres by: - computing centroid for each genre - performing hierarchical clustering on genre centroids - plotting the distance matrix with a fancy colour scheme

from sklearn.metrics import pairwise_distances
from scipy.cluster import hierarchy
X = genre_df[['emb_0', 'emb_1']].values
Z = hierarchy.linkage(X)
order = hierarchy.leaves_list(hierarchy.optimal_leaf_ordering(Z, X))
genre_df_ordered = genre_df.iloc[order]
embs_ord = genre_df_ordered[['emb_0', 'emb_1']].values
dist_ord = pairwise_distances(embs_ord)
genres_ord = genre_df_ordered['genre'].values
sns.heatmap(dist_ord, xticklabels=genres_ord, yticklabels=genres_ord, cmap='plasma_r');

We see that the two genres furthest apart are Horror and Musical, while Romance and Mystery or Crime and Adventure evoke similar rating patterns!

Freezing and partial updating

Finally, we mention another key advantage of the MultiModel approach: partial re-training.

Imagine we have a powerful production system, but new users register with our service every day.

We don’t want to re-train the full model, only the embeddings for new users. This is trivial:
    df=df_new, model='rating', **kwargs

This will ensure that only the user embeddings are updated (and only for users present in df_new, while dense layer weights and movie embeddings remain frozen.