This page was generated from docs/source/examples/MovieLens100k.ipynb.

Building a Neural Movie Recommender System

Despite its name (and the original purpose), timeserio is a general-purpose tool for rapid model development. In this example, we use it to train a state-of-the-art movie recommender system in a few lines of code.

Dataset

The MovieLens dataset is commonly used to benchmark recommender systems - see https://grouplens.org/datasets/movielens/100k/.

Our task is to learn to predict how user \(u\) would rate a movie \(m\) (\(r_{um}\)) based on an available dataset of ratings \(r_{ij}\). Importantly, each user has only given ratings to some of the movies, and each movie has only been rated by some of the users. This is a classic example of transfer learning, commonly known as collaborative filtering in the context of recommender systems.

Introduction

We make use of keras to define three models: - a user embedder that learns to represent each user’s preference as a vector - a movie embedder that learns to represent each movie as a vector - a rating model that concatenates user and movie embedding networks, and applies a dense neural network to predict a (non-negative) rating

By wrapping the three models in a multinetwork (of the MultiNetworkBase class), we can for example - train the rating model end-to-end, then use one of the embedding models - freeze one or both of the embedding models and re-train the dense layers, or - freeze the dense layers, and re-train embeddings for new users only

To make our job even simpler, we further wrap our multinetwork in a MultiModel class, which allows us to take data directly from pandas DataFrames, and apply pre-processing pipelines if needed.

[73]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
plt.xkcd();

Download data

First, we download the freely available dataset and define a few helper functions for importing data.

[21]:
!mkdir -p datasets; cd datasets; wget http://files.grouplens.org/datasets/movielens/ml-100k.zip; unzip -o ml-100k.zip; rm ml-100k.zip
--2019-06-22 14:53:38--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’

ml-100k.zip         100%[===================>]   4.70M  2.00MB/s    in 2.3s

2019-06-22 14:53:41 (2.00 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl
  inflating: ml-100k/mku.sh
  inflating: ml-100k/README
  inflating: ml-100k/u.data
  inflating: ml-100k/u.genre
  inflating: ml-100k/u.info
  inflating: ml-100k/u.item
  inflating: ml-100k/u.occupation
  inflating: ml-100k/u.user
  inflating: ml-100k/u1.base
  inflating: ml-100k/u1.test
  inflating: ml-100k/u2.base
  inflating: ml-100k/u2.test
  inflating: ml-100k/u3.base
  inflating: ml-100k/u3.test
  inflating: ml-100k/u4.base
  inflating: ml-100k/u4.test
  inflating: ml-100k/u5.base
  inflating: ml-100k/u5.test
  inflating: ml-100k/ua.base
  inflating: ml-100k/ua.test
  inflating: ml-100k/ub.base
  inflating: ml-100k/ub.test
[22]:
def get_ratings(part='u.data'):
    """Return a DataFrame of user-movie ratings."""
    return pd.read_csv(
        os.path.join('datasets/ml-100k', part), header=None, sep='\t',
        names=['user_id', 'item_id', 'rating', 'timestamp'],
    ).rename(columns={'item_id': 'movie_id'})

def get_users():
    """Return a DataFrame of all users."""
    return pd.read_csv(
        os.path.join('datasets/ml-100k', 'u.user'), header=None, sep='|',
        names=['user_id', 'age', 'gender', 'occupation', 'zip_code'],
    ).rename(columns={'item_id': 'movie_id'})

ITEM_PROPS = ['movie_id', 'movie_title', 'video_release_date', 'unknown', 'IMDb_URL']
GENRES = ['Action', 'Adventure', 'Animation',
          'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
          'Thriller', 'War', 'Western']

def get_movies():
    """Return a DataFrame of all movies."""
    return pd.read_csv(
        os.path.join('datasets/ml-100k', 'u.item'), header=None, index_col=False, sep='|', encoding="iso-8859-1",
        names=ITEM_PROPS + GENRES,
    )
[23]:
get_ratings().head(3)
[23]:
user_id movie_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
[24]:
get_users().head(3)
[24]:
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
[25]:
get_movies().head(3)
[25]:
movie_id movie_title video_release_date unknown IMDb_URL Action Adventure Animation Childrens Comedy ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 0 0
1 2 GoldenEye (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(... 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 3 Four Rooms (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%... 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

3 rows × 23 columns

Define the model architecture

We start by defining the network architecture. All we need to do is sub-class MultiNetworkBase, and define the _model method.

  • keyword arguments to _model are used to parametrise our network architecture, e.g. by specifying a settable number of neurons or layers

  • the _model method is expected to return a dictionary of keras.models.Model objects.

[26]:
from keras.layers import Input, Embedding, Dense, Concatenate, Flatten
from keras.models import Model
Using TensorFlow backend.
[36]:
from timeserio.keras.multinetwork import MultiNetworkBase

class MovieLensNetwork(MultiNetworkBase):
    def _model(self,
               user_dim=2, item_dim=2, max_user=10000, max_item=10000,
               hidden=8):
        user_input = Input(shape=(1,), name='user')
        item_input = Input(shape=(1,), name='movie')
        user_emb = Flatten(name='flatten_user')(Embedding(max_user, user_dim, name='embed_user')(user_input))
        item_emb = Flatten(name='flatten_movie')(Embedding(max_item, item_dim, name='embed_movie')(item_input))
        output = Concatenate(name='concatenate')([user_emb, item_emb])
        output = Dense(hidden, activation='relu', name='dense')(output)
        output = Dense(1, name='rating')(output)

        user_model = Model(user_input, user_emb)
        item_model = Model(item_input, item_emb)
        rating_model = Model([user_input, item_input], output)
        rating_model.compile(optimizer='Adam', loss='mse', metrics=['mae'])

        return {'user': user_model, 'movie': item_model, 'rating': rating_model}

The three models are initialized on-demand, e.g. when we access multinetwork.model. Note that the inputs and embedding layers are shared, and therefore changes made to e.g. the user model are instantly available in the rating model.

[44]:
multinetwork = MovieLensNetwork()
multinetwork.model
[44]:
{'user': <keras.engine.training.Model at 0x139c6c438>,
 'movie': <keras.engine.training.Model at 0x139c8eac8>,
 'rating': <keras.engine.training.Model at 0x139c8e5c0>}
[45]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils.layer_utils import print_summary
[46]:
SVG(model_to_dot(multinetwork.model['user'], rankdir='LR').create(prog='dot', format='svg'))
[46]:
../_images/examples_MovieLens100k_14_0.svg
[47]:
SVG(model_to_dot(multinetwork.model['movie'], rankdir='LR').create(prog='dot', format='svg'))
[47]:
../_images/examples_MovieLens100k_15_0.svg
[48]:
SVG(model_to_dot(multinetwork.model['rating']).create(prog='dot', format='svg'))
[48]:
../_images/examples_MovieLens100k_16_0.svg
[49]:
print_summary(multinetwork.model['rating'])
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
user (InputLayer)               (None, 1)            0
__________________________________________________________________________________________________
movie (InputLayer)              (None, 1)            0
__________________________________________________________________________________________________
embed_user (Embedding)          (None, 1, 2)         20000       user[0][0]
__________________________________________________________________________________________________
embed_movie (Embedding)         (None, 1, 2)         20000       movie[0][0]
__________________________________________________________________________________________________
flatten_user (Flatten)          (None, 2)            0           embed_user[0][0]
__________________________________________________________________________________________________
flatten_movie (Flatten)         (None, 2)            0           embed_movie[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 4)            0           flatten_user[0][0]
                                                                 flatten_movie[0][0]
__________________________________________________________________________________________________
dense (Dense)                   (None, 8)            40          concatenate[0][0]
__________________________________________________________________________________________________
rating (Dense)                  (None, 1)            9           dense[0][0]
==================================================================================================
Total params: 40,049
Trainable params: 40,049
Non-trainable params: 0
__________________________________________________________________________________________________

From Multinetwork to Multimodel

We can train a specific model by using its name. Note that we must provide numpy feature arrays to each input, and also an array of training labels:

multinetwork.fit([X_user, X_movie], y_rating, model='rating')

In our case, we could simply write X_user = df["user_id"].values etc.

However, we prefer different models to be fed from one data source, typically a pandas.DataFrame, with any details of feature pre-processing, or input ordering, taken care of by encapsulated pipelines, providing an interface of the form

multimodel.fit(df, model='rating')

Let’s work through the necessary steps - these may seem trivial for a simple problem, but save a lot of headaches when developing and deploying complex models.

Define individual pipelines

We start by defining a pipeline (a scikit-learn transformer) for each of the model inputs and labels:

[52]:
from timeserio.preprocessing import PandasValueSelector

user_pipe = PandasValueSelector('user_id')
item_pipe = PandasValueSelector('movie_id')
rating_pipe = PandasValueSelector('rating')

Group the pipelines in a MultiPipeline

The MultiPipeline object provides a container for all the pipelines, with convenience features such as easy parameter accesss. All we need is to provide a name for each pipeline:

[53]:
from timeserio.pipeline import MultiPipeline

multipipeline = MultiPipeline({
    'user_pipe': user_pipe,
    'movie_pipe': item_pipe,
    'rating_pipe': rating_pipe,
})

Connect pipelines to models

To finish the plumbing exercise, we specify which pipeline connects to each input or output of each model using a manifold.

Each key-value in the manifold has the form model_name: (input_pipes, output_pipes), where input_pipes is either a single pipe name, or a list of pipe names (one per input). Similarly, the output_pipe will have one ore more pipe names, one per output of the model - we use None for models that we do not intend to train using supervised labels.

[54]:
manifold = {
    'user': ('user_pipe', None),
    'movie': ('movie_pipe', None),
    'rating': (['user_pipe', 'movie_pipe'], 'rating_pipe')
}

Put it all together

The MultiModel holds all three parts: - the multinetwork specifies the model architectures, and also training parameters and callbacks - the multipipeline specifies the feature processing pipelines - the manifold specifies which pipelines is plumbed to which input (or output) of which neural network model

[56]:
from timeserio.multimodel import MultiModel

multimodel = MultiModel(
    multinetwork=multinetwork,
    multipipeline=multipipeline,
    manifold=manifold
)

Fit the MultiModel

We load one train-test split, and fitting our neural recommender system

[67]:
df_train = get_ratings('u1.base')
df_val = get_ratings('u1.test')
len(df_train), len(df_val)
[67]:
(80000, 20000)
[58]:
from kerashistoryplot.callbacks import PlotHistory
[68]:
# Note: `PlotHistory` callback is rather slow
multimodel.fit(
    reset_weights=True,
    df=df_train, model='rating', validation_data=df_val,
    batch_size=4096, epochs=50,
    callbacks=[PlotHistory(batches=True, n_cols=2, figsize=(15, 8))]
)
../_images/examples_MovieLens100k_30_0.png
[68]:
<timeserio.keras.callbacks.HistoryLogger at 0x13d5189e8>

The multimodel provides all the familiar methods such as fit, predict, or evaluate:

[70]:
mse, mae = multimodel.evaluate(df_val, model="rating")
print(f"MSE: {mse}, RMSE: {np.sqrt(mse)}, MAE: {mae}")
MSE: 0.9139011481761933, RMSE: 0.9559817718849002, MAE: 0.7554579020500183

Cross-Validate our approach

To evaluate how well recommender system performs, we perform 5-fold cross-validation and compare scores established benchmarks.

[71]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
[74]:
%%time
folds = [1, 2, 3, 4, 5]
folds_mse = []
folds_rmse = []
folds_mae = []
for fold in tqdm(folds, total=len(folds)):
    multimodel.multinetwork._init_model()
    df_train = get_ratings(f'u{fold}.base')
    df_val = get_ratings(f'u{fold}.test')
    multimodel.fit(
        df=df_train, model='rating', validation_data=df_val,
        batch_size=4096, epochs=50, verbose=0,
        reset_weights=True
    )
    y_pred = multimodel.predict(df=df_val, model='rating')
    mse = mean_squared_error(df_val['rating'], y_pred)
    mae = mean_absolute_error(df_val['rating'], y_pred)
    folds_mse.append(mse)
    folds_rmse.append(np.sqrt(mse))
    folds_mae.append(mae)
100%|██████████| 5/5 [01:00<00:00, 12.28s/it]
CPU times: user 1min 6s, sys: 4.14 s, total: 1min 10s
Wall time: 1min

[75]:
print(
    f"5-fold Cross-Validation results: \n"
    f"RMSE: {np.mean(folds_rmse):.2f} ± {np.std(folds_rmse):.2f} \n"
    f"MAE: {np.mean(folds_mae):.2f} ± {np.std(folds_mae):.2f} \n"
)
5-fold Cross-Validation results:
RMSE: 0.94 ± 0.01
MAE: 0.74 ± 0.00

Benchmarks for some modern algorithms can be seen e.g. at http://surpriselib.com/ or https://www.librec.net/release/v1.3/example.html - our approach is in fact competitive with the state of the art before any tuning! By using dense embeddings, we did not use any user features such as gender or age - all we need is to learn their preference embedding as part of our end-to-end model.

We are now free to experiment with embedding dimensions for users and movies, or tweak the dense layers.

Using multiple models

We now use a trained MultiModel to inspect the embeddings. Because we defined user and movie embedders as independent models, we can simple call .predict(..., model=...) with different model names.

User embeddings

[76]:
user_df = get_users()
[77]:
embeddings = multimodel.predict(user_df, model='user')
user_df['emb_0'] = embeddings[:, 0]
user_df['emb_1'] = embeddings[:, 1]
[78]:
sns.scatterplot(x='emb_0', y='emb_1', hue='gender', size='age', data=user_df)
[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x143dac320>
../_images/examples_MovieLens100k_42_1.png

And the movie embeddings…

[79]:
movie_df = get_movies()
[81]:
embeddings = multimodel.predict(movie_df, model='movie')
movie_df['emb_0'] = embeddings[:, 0]
movie_df['emb_1'] = embeddings[:, 1]

Out of curiosity, we can compute mean embeddings for movies tagged with each genre.

[85]:
genre_df = pd.DataFrame()
for genre in GENRES:
    mean = movie_df[movie_df[genre] == 1][['emb_0', 'emb_1']].mean()
    mean['genre'] = genre
    genre_df = genre_df.append(mean, ignore_index=True)

Movie and Genre embeddings

[88]:
fig, axes = plt.subplots(ncols=2, figsize=(20, 8))
sns.scatterplot(x='emb_0', y='emb_1', data=movie_df, ax=axes[0], palette='bright')
sns.scatterplot(x='emb_0', y='emb_1', hue='genre', data=genre_df, ax=axes[1], palette='bright')
[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x1443e4f98>
../_images/examples_MovieLens100k_49_1.png

We can even consider similarity between genres by: - computing centroid for each genre - performing hierarchical clustering on genre centroids - plotting the distance matrix with a fancy colour scheme

[89]:
from sklearn.metrics import pairwise_distances
from scipy.cluster import hierarchy
[90]:
X = genre_df[['emb_0', 'emb_1']].values
Z = hierarchy.linkage(X)
order = hierarchy.leaves_list(hierarchy.optimal_leaf_ordering(Z, X))
genre_df_ordered = genre_df.iloc[order]
[91]:
embs_ord = genre_df_ordered[['emb_0', 'emb_1']].values
dist_ord = pairwise_distances(embs_ord)
genres_ord = genre_df_ordered['genre'].values
[92]:
sns.heatmap(dist_ord, xticklabels=genres_ord, yticklabels=genres_ord, cmap='plasma_r');
../_images/examples_MovieLens100k_54_0.png

We see that the two genres furthest apart are Horror and Musical, while Romance and Mystery or Crime and Adventure evoke similar rating patterns!

Freezing and partial updating

Finally, we mention another key advantage of the MultiModel approach: partial re-training.

Imagine we have a powerful production system, but new users register with our service every day.

We don’t want to re-train the full model, only the embeddings for new users. This is trivial:

multimodel.fit(
    trainable_models=['user'],
    df=df_new, model='rating', **kwargs
)

This will ensure that only the user embeddings are updated (and only for users present in df_new, while dense layer weights and movie embeddings remain frozen.