Personal Recommendations with one single Matrix Multiplication¶

This blog post shows a simple example of creating a simple but effective personal recommendation engine using the method described in the paper Embarrassingly Shallow Autoencoders for Sparse Data from Harald Steck.

The challenge of meaningful automated recommendations¶

Finding good recommendations for your customers in an automated way is not an easy task. No matter whether you try to sell movies, songs or fashion, the idea is always the same: finding meaningful products which match your customers' demands. While a "human recommender" may simply ask the customer for their preferences and, based on their own experience, will pick a selection of suitable products, a computer lacks all that intuition.

Therefore, different strategies have been developed in order to help computers gain a deeper understanding of the customer-product relationship.

The method described here today uses an autoencoder approach: given a vector of products a customer bought, it will try to infer relevant products by filling in the empty elements in the product vector.

Autoencoders¶

The task of an autoencoder is straightforward: given some input vector $x$ , the output $y$ of the autoencoder should be identical to $x$ . This seemingly trivial task is made more complicated, however, by adding restrictions on the internal processing steps of the autoencoder.

A typical autoencoder might be built like a neural network, consisting of an input layer, a hidden layer, and an output layer. Here the restriction would be that the size of the hidden layer is smaller than that of the input layer, and therefore the autoencoder is forced to learn a more efficient representation of the data it is trained on.

Shallow Autoencoders¶

In our case, the autoencoder is actually much more simple, as it only consists of a single matrix $B \in \mathbb{R} ^{\left|I\right| x \left|I\right|}$ , i.e. this matrix has as many rows/columns as there are unique items our customers might buy.

Given a customer-item matrix $X$ , matrix multiplication with $B$ should hopefully return a very similar matrix.

In order to find meaningful parameters for $B$ , we define a quadratic loss:

$\min_B ||X - XB ||^2 + \lambda ||B||^2$

Additionally, a regularization term is added to the loss in order keep the learned parameters small. What we are still lacking is a restriction which forces our autoencoder to learn something meaningful (in the current form the trivial solution would be simply an identity matrix which would not give us any new insight). The authors have the simple but ingenious idea to add a constraint that all diagonal elements of $B$ must be zero. What this means is that our autoencoder now has to learn whether item $i$ is present in the current basket while only given information about the presence or absence of all items except $i$ .

The detailed solution procedure involves minimizing the loss function using Lagrangian multipliers to also consider the additional constraints. It can be found in the paper, so I will not discuss this further. Instead, let's write some Python code to see how well it actually works on some data!

Making some Movie Recommendations¶

Let's start by downloading some data of movie ratings.

In order to be able to use the code you need to install the following python libraries

pip install numpy pandas scipy tqdm

In [1]:

import numpy as np
import pandas as pd
from scipy.sparse import lil_matrix
from scipy import linalg
from tqdm import tqdm_notebook as tqdm

If the next three lines don't work for you, you can also just download the file using your browser and unzip it manually.

In [2]:

movielens_url = "http://files.grouplens.org/datasets/movielens/ml-20m.zip"

In [ ]:

!wget {movielens_url}

In [ ]:

!unzip ml-20m.zip

Next, we need to read the movie list and the list of ratings into pandas dataframes:

In [3]:

movies = pd.read_csv("ml-20m/movies.csv")
ratings = pd.read_csv("ml-20m/ratings.csv")

Next, we write a helper function which creates a sparse customer item matrix from the ratings dataframe

In [4]:

# create a mapping between the movie/user id and 
# its index in the user item matrix
def create_mapping(unique_vals):
    mapping = {}
    for i, id_ in enumerate(unique_vals):
        mapping[id_] = i
    return mapping


def create_user_item_matrix(ratings, user_mapping, item_mapping):
    ui_mat = lil_matrix((max(user_mapping.values()) + 1, max(item_mapping.values()) + 1))
    for tup in tqdm(ratings.itertuples(), total=ratings.shape[0]):
        ui_mat[user_mapping[tup.userId], item_mapping[tup.movieId]] = 1
    return ui_mat.tocsr()

Creating the matrix should take a few minutes

In [5]:

user_mapping = create_mapping(ratings.userId.unique())
item_mapping = create_mapping(movies.movieId.unique())

ui_mat = create_user_item_matrix(ratings, user_mapping, item_mapping)

The next few lines are the actual optimization procedure. If we ignore the print lines, the whole algorithm needs only 6 lines of code!

In [6]:

def create_autoencoder(ui_mat, lambda_):
    print("Create G")
    G = (ui_mat.T @ ui_mat).toarray()
    diag_indices = np.diag_indices_from(G)
    G[diag_indices] += lambda_
    print("Invert G")
    # In order to save some memory, we overwrite our G matrix when 
    # calculating the inverse
    B = linalg.inv(G, overwrite_a=True)
    B /= (-np.diag(B))
    B[diag_indices] = 0
    return B

The whole procedure will take a few minutes. On my laptop with 8gb ram memory was sufficient. If you run into troubles, try and change the type of ui_mat from np.float32 to np.float16 and see if that helps

In [7]:

autoenc = create_autoencoder((ui_mat > 0).astype(np.float32), 500)

Create G
Invert G

In order to be able to user our small recommendation engine, we write another helper function which, given a movie id or list of movie ids returns movie ids of similar movies

In [38]:

def recommend(movie_ids, item_mapping, N=10, filter_already_liked=False):
    reverse_mapping = {v: k for k, v in item_mapping.items()}
    if isinstance(movie_ids, int):
        movie_ids = [movie_ids]
    movie_ids = [item_mapping[id_] for id_ in movie_ids]
    movievec = np.zeros(autoenc.shape[0], dtype=np.float32)
    scores = np.zeros_like(movievec)
    movievec[movie_ids] = 1
    scores = movievec @ autoenc
    #np.dot(movievec, autoenc, out=scores)
    if filter_already_liked:
        scores[movie_ids] = -100
    indices = np.argsort(-scores)[:N]
    return [reverse_mapping[idx] for idx in indices]

Let's see what kind of movies there are

In [39]:

movies.sample(20, random_state=3)

Out[39]:

	movieId	title	genres
14249	71442	Plaisir, Le (1952)	Comedy\|Drama
8099	8782	Thunderbirds (2004)	Action\|Adventure\|Fantasy\|Sci-Fi
11389	48696	Little Children (2006)	Drama\|Romance
24420	115711	TINY: A Story About Living Small (2013)	Documentary
15250	77837	You Don't Know Jack (2010)	Drama
24588	116341	Dana Carvey: Squatting Monkeys Tell No Lies (2...	Comedy
6267	6373	Bruce Almighty (2003)	Comedy\|Drama\|Fantasy\|Romance
22788	109019	Bewitched (Alter Ego) (1945)	Crime\|Drama\|Film-Noir\|Thriller
26781	128723	James Dean (2001)	Drama
24274	115060	Haunted Castle, The (Hiroku kaibyô-den) (1969)	Horror
21313	103626	Senotaji (2013)	Comedy
7093	7205	Wind and the Lion, The (1975)	Adventure
3160	3247	Sister Act (1992)	Comedy\|Crime
5465	5562	Snipes (2001)	Drama\|Thriller
18378	91749	Sophie's Revenge (Fei chang wan mei) (2009)	Comedy\|Romance
22160	106648	Guilty of Romance (Koi no tsumi) (2011)	Crime\|Drama\|Horror
11081	46062	High School Musical (2006)	Children\|Comedy\|Drama\|Musical\|Romance
1847	1931	Mutiny on the Bounty (1935)	Adventure\|Drama
6521	6631	Man's Best Friend (1993)	Horror\|Sci-Fi\|Thriller
10231	34099	Julian Po (1997)	Comedy\|Drama

Let's see which movies are similar to High School Musical with id 46062

In [41]:

ids = recommend(46062, item_mapping)
movies.query("movieId in @ids")

Out[41]:

	movieId	title	genres
1596	1653	Gattaca (1997)	Drama\|Sci-Fi\|Thriller
8282	8965	Polar Express, The (2004)	Adventure\|Animation\|Children\|Fantasy\|IMAX
11216	47382	Step Up (2006)	Drama\|Romance
11858	52975	Hairspray (2007)	Comedy\|Drama\|Musical
11971	53993	Evan Almighty (2007)	Comedy\|Fantasy
12810	60397	Mamma Mia! (2008)	Comedy\|Musical\|Romance
12923	61123	High School Musical 2 (2007)	Comedy\|Drama\|Musical\|Romance
13086	62912	High School Musical 3: Senior Year (2008)	Musical
13174	63992	Twilight (2008)	Drama\|Fantasy\|Romance\|Thriller
14017	70334	Hannah Montana: The Movie (2009)	Comedy\|Drama\|Musical\|Romance

This looks like some reasonable recommendations.

In [ ]:

Personal Recommendations with one single Matrix Multiplication