Personal recommendations with one matrix multiplication



Personal Recommendations with one single Matrix Multiplication

This blog post shows a simple example of creating a simple but effective personal recommendation engine using the method described in the paper Embarrassingly Shallow Autoencoders for Sparse Data from Harald Steck.

The challenge of meaningful automated recommendations

Finding good recommendations for your customers in an automated way is not an easy task. No matter whether you try to sell movies, songs or fashion, the idea is always the same: finding meaningful products which match your customers' demands. While a "human recommender" may simply ask the customer for their preferences and, based on their own experience, will pick a selection of suitable products, a computer lacks all that intuition.

Therefore, different strategies have been developed in order to help computers gain a deeper understanding of the customer-product relationship.

The method described here today uses an autoencoder approach: given a vector of products a customer bought, it will try to infer relevant products by filling in the empty elements in the product vector.

Autoencoders

The task of an autoencoder is straightforward: given some input vector $x$, the output $y$ of the autoencoder should be identical to $x$. This seemingly trivial task is made more complicated, however, by adding restrictions on the internal processing steps of the autoencoder.

A typical autoencoder might be built like a neural network, consisting of an input layer, a hidden layer, and an output layer. Here the restriction would be that the size of the hidden layer is smaller than that of the input layer, and therefore the autoencoder is forced to learn a more efficient representation of the data it is trained on.

Shallow Autoencoders

In our case, the autoencoder is actually much more simple, as it only consists of a single matrix $B \in \mathbb{R} ^{\left|I\right| x \left|I\right|}$, i.e. this matrix has as many rows/columns as there are unique items our customers might buy.

Given a customer-item matrix $X$, matrix multiplication with $B$ should hopefully return a very similar matrix.

In order to find meaningful parameters for $B$, we define a quadratic loss:

$$ \min_B ||X - XB ||^2 + \lambda ||B||^2 $$

Additionally, a regularization term is added to the loss in order keep the learned parameters small. What we are still lacking is a restriction which forces our autoencoder to learn something meaningful (in the current form the trivial solution would be simply an identity matrix which would not give us any new insight). The authors have the simple but ingenious idea to add a constraint that all diagonal elements of $B$ must be zero. What this means is that our autoencoder now has to learn whether item $i$ is present in the current basket while only given information about the presence or absence of all items except $i$.

The detailed solution procedure involves minimizing the loss function using Lagrangian multipliers to also consider the additional constraints. It can be found in the paper, so I will not discuss this further. Instead, let's write some Python code to see how well it actually works on some data!

Making some Movie Recommendations

Let's start by downloading some data of movie ratings.

In order to be able to use the code you need to install the following python libraries

pip install numpy pandas scipy tqdm
In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import lil_matrix
from scipy import linalg
from tqdm import tqdm_notebook as tqdm

If the next three lines don't work for you, you can also just download the file using your browser and unzip it manually.

In [2]:
movielens_url = "http://files.grouplens.org/datasets/movielens/ml-20m.zip"
In [ ]:
!wget {movielens_url}
In [ ]:
!unzip ml-20m.zip

Next, we need to read the movie list and the list of ratings into pandas dataframes:

In [3]:
movies = pd.read_csv("ml-20m/movies.csv")
ratings = pd.read_csv("ml-20m/ratings.csv")

Next, we write a helper function which creates a sparse customer item matrix from the ratings dataframe

In [4]:
# create a mapping between the movie/user id and 
# its index in the user item matrix
def create_mapping(unique_vals):
    mapping = {}
    for i, id_ in enumerate(unique_vals):
        mapping[id_] = i
    return mapping


def create_user_item_matrix(ratings, user_mapping, item_mapping):
    ui_mat = lil_matrix((max(user_mapping.values()) + 1, max(item_mapping.values()) + 1))
    for tup in tqdm(ratings.itertuples(), total=ratings.shape[0]):
        ui_mat[user_mapping[tup.userId], item_mapping[tup.movieId]] = 1
    return ui_mat.tocsr()

Creating the matrix should take a few minutes

In [5]:
user_mapping = create_mapping(ratings.userId.unique())
item_mapping = create_mapping(movies.movieId.unique())

ui_mat = create_user_item_matrix(ratings, user_mapping, item_mapping)

The next few lines are the actual optimization procedure. If we ignore the print lines, the whole algorithm needs only 6 lines of code!

In [6]:
def create_autoencoder(ui_mat, lambda_):
    print("Create G")
    G = (ui_mat.T @ ui_mat).toarray()
    diag_indices = np.diag_indices_from(G)
    G[diag_indices] += lambda_
    print("Invert G")
    # In order to save some memory, we overwrite our G matrix when 
    # calculating the inverse
    B = linalg.inv(G, overwrite_a=True)
    B /= (-np.diag(B))
    B[diag_indices] = 0
    return B

The whole procedure will take a few minutes. On my laptop with 8gb ram memory was sufficient. If you run into troubles, try and change the type of ui_mat from np.float32 to np.float16 and see if that helps

In [7]:
autoenc = create_autoencoder((ui_mat > 0).astype(np.float32), 500)
Create G
Invert G

In order to be able to user our small recommendation engine, we write another helper function which, given a movie id or list of movie ids returns movie ids of similar movies

In [38]:
def recommend(movie_ids, item_mapping, N=10, filter_already_liked=False):
    reverse_mapping = {v: k for k, v in item_mapping.items()}
    if isinstance(movie_ids, int):
        movie_ids = [movie_ids]
    movie_ids = [item_mapping[id_] for id_ in movie_ids]
    movievec = np.zeros(autoenc.shape[0], dtype=np.float32)
    scores = np.zeros_like(movievec)
    movievec[movie_ids] = 1
    scores = movievec @ autoenc
    #np.dot(movievec, autoenc, out=scores)
    if filter_already_liked:
        scores[movie_ids] = -100
    indices = np.argsort(-scores)[:N]
    return [reverse_mapping[idx] for idx in indices]

Let's see what kind of movies there are

In [39]:
movies.sample(20, random_state=3)
Out[39]:
movieId title genres
14249 71442 Plaisir, Le (1952) Comedy|Drama
8099 8782 Thunderbirds (2004) Action|Adventure|Fantasy|Sci-Fi
11389 48696 Little Children (2006) Drama|Romance
24420 115711 TINY: A Story About Living Small (2013) Documentary
15250 77837 You Don't Know Jack (2010) Drama
24588 116341 Dana Carvey: Squatting Monkeys Tell No Lies (2... Comedy
6267 6373 Bruce Almighty (2003) Comedy|Drama|Fantasy|Romance
22788 109019 Bewitched (Alter Ego) (1945) Crime|Drama|Film-Noir|Thriller
26781 128723 James Dean (2001) Drama
24274 115060 Haunted Castle, The (Hiroku kaibyô-den) (1969) Horror
21313 103626 Senotaji (2013) Comedy
7093 7205 Wind and the Lion, The (1975) Adventure
3160 3247 Sister Act (1992) Comedy|Crime
5465 5562 Snipes (2001) Drama|Thriller
18378 91749 Sophie's Revenge (Fei chang wan mei) (2009) Comedy|Romance
22160 106648 Guilty of Romance (Koi no tsumi) (2011) Crime|Drama|Horror
11081 46062 High School Musical (2006) Children|Comedy|Drama|Musical|Romance
1847 1931 Mutiny on the Bounty (1935) Adventure|Drama
6521 6631 Man's Best Friend (1993) Horror|Sci-Fi|Thriller
10231 34099 Julian Po (1997) Comedy|Drama

Let's see which movies are similar to High School Musical with id 46062

In [41]:
ids = recommend(46062, item_mapping)
movies.query("movieId in @ids")
Out[41]:
movieId title genres
1596 1653 Gattaca (1997) Drama|Sci-Fi|Thriller
8282 8965 Polar Express, The (2004) Adventure|Animation|Children|Fantasy|IMAX
11216 47382 Step Up (2006) Drama|Romance
11858 52975 Hairspray (2007) Comedy|Drama|Musical
11971 53993 Evan Almighty (2007) Comedy|Fantasy
12810 60397 Mamma Mia! (2008) Comedy|Musical|Romance
12923 61123 High School Musical 2 (2007) Comedy|Drama|Musical|Romance
13086 62912 High School Musical 3: Senior Year (2008) Musical
13174 63992 Twilight (2008) Drama|Fantasy|Romance|Thriller
14017 70334 Hannah Montana: The Movie (2009) Comedy|Drama|Musical|Romance

This looks like some reasonable recommendations.

In [ ]: