Recommendation Algorithms using PySpark

02 Oct 2017

Introduction

GitHub

Background of the Problem:

For companies which have business in the domains such as media streaming, e-commerce, it is important to know what content its customers might like. Having a recommendation system can help the provide a personalized experience to a customer by making personalized recommendations.

In this project, we can leverage customer data from last.fm to build recommendation systems that can provide personalized recommendations using listening history and / or user information such as user age, location, gender, etc.

Apart from building a recommendation algorithm, I have decided to use Apache Spark to solve this problem, because I find it to be an appropriate tool for this project as I’m dealing with a large amount of data. Also, I’m using Spark because I get to showcase my pyspark skills as well! I have used ALS tool from pyspark to build a matrix factorization based recommendation algorithm. Apart from this algorithm, I also use k-means clustering to aggregate data based on users’ age, gender and country to make crude recommendation for users without any listening history.

Who is it useful for?

Music Streaming Company

Data

The data I’m using for this project is collected from a website last.fm. This link contains anonymized user data from the website. Here are some statistics about the dataset

The data is in the form of two text (tsv) files. Files -

usersha1-artmbid-artname-plays.tsv

This file contains information about number of times an artist is played by a user

Variable Name Datatype Example
userId string 00000c289a1829a808ac09c00daf10bc3c4e223b
artistId string 3bd73256-3905-4f3a-97e2-8b341527f805
artistName string betty blowtorch
plays integer 2137

usersha1-profile.tsv

VariableName Datatype Example
userId string 00000c289a1829a808ac09c00daf10bc3c4e223b
gender string f
age integer 22
country string Germany
signupDate datetime Feb1,2007

Uses and Limitations of the data

Uses

One datafile contains information of the number of times a user has listened to a song aggregated to the artist-level. So, it is number of times a user who has listened to a particular artist. The other one has information about the customer - gender, age, country and signup date.

Apart from building a recommendation system that I am building, such data can be used to gamify music listening experience to improve engagement by rewarding listeners in some way (badges, points etc.). Also, it can be used to create a social network like platform where you can follow and connect with people with similar musical tastes.

Limitations

This data is aggregated to an artist-level, hence lacks full granularity. Also, it does not contains information on the timeline for every user, so we wouldn’t know if someone has stopped listening to someone and we will not be able to recommend based on recent listening history.

User information can be augmented by getting user’s full digital footprint by linking his account to his accounts on other major platforms which can have a useful information available.

Data Cleaning

Users’ Data

The data is relatively clean already. The age column has unreasonable values (less than 0 and more than 122) which are replaced by null values and all other columns are kept intact. After that, we delete all rows containing null values in any of their columns (age, gender, country), during which approximately 5% of the data is lost. As we do not have sufficient information to predict the unknown values, we do not have any other option. This only done for a clustering-based recommendation, as ALS does not require this information for making recommendations.

Listening Data

This data needs two main cleaning tasks -

  1. Missing artist Ids. Not all artists have artistIds (Approx 2%). For the sake of simplicity, corresponding data is not included in the project.
  2. Duplicate artist names. There are more than one ways in which a single artist is named in the artist column. We need some normalization, regularization and text processing to deal with this issue.

Preliminary Data Exploration

Here are some interesting visualizations from the data -

Activity of 20 most active listeners

This plot displays top 20 listeners with their respective play counts

10-most-active-listeners

20 Most Popular Artists

Top 20 artists whose songs are played the most number of times (plays)

10-most-popular-artists

Top 20 countries

Top 20 countries where the users are most active in terms of number of times the songs are played cumulatively (sum)

top-countries

Gender Proportions

Number of males and females registered as users on the website

gender-ratio

Age Distribution

Age distribution of the users registered on the website

age-distribution

Following visualization show the sign-ups over different aggregation levels such as over the months, years and weeks.

Sign-ups over the months and years

This plot shows the sum the number of sign-ups happening every month as well as years for the length of the time for which the data was recorded. Y-axis on the left has yearly aggregation (sum) while the one on the right has monthly aggregation (sum).

signups-over-months-and-years

Sign-ups per Month-of-the-year

signups-per-month-of-the-year

Sign-ups per Day-of-the-week

signups-per-day-of-the-week

Apache Spark and Collaborative Filtering (ALS)

ALS model in pyspark uses 3 columns from the data - userId, artistId and plays. In simple terms, we need to construct a matrix with plays as values, userId as rows and artistId as columns, and use already present values to find missing ones. It assumes that there are ‘k’ attributes or features and each artist can be represented by the same. We try to construct the said matrix and minimize a loss function to solve the problem.

We use matrix factorization (SVD) for the purpose. In this method, we approximate the original userId-artistId matrix (R) to a product of two k-rank matrices R' = Ut x M . As there are many missing elements in the rating matrix R, standard SVD algorithms cannot find U and M. Alternating Least Squares (ALS) can be used to solve such low-rank matrix factorization problems as follows:

  1. Initialize matrix M by assigning the average plays for that artist as the first row, and small random numbers for the remaining entries
  2. Fix M, Solve U by minimizing the objective function (the sum of squared errors)
  3. Fix U, solve M by minimizing the objective function similarly
  4. Repeat Steps 2 and 3 until a stopping criterion is satisfied

Defining a ‘normalized’ plays variable

Plays variable is defined as number of times an artist has been played for a particular user. By its definition, it implies that an older user is more likely to have more number of plays in general. So, to deal with this problem, I have divided the variable by number of days each corresponding user has been active, turning it into a more ‘normalized’ plays/day term. Here is a histogram of the same.

plays-per-day-hist

Here is a step-by-step information about my implementation of ALS

  1. Used the following data - userId, artistId and plays (plays-per-day)
  2. Substituted userId and artistId by integer values using zipWithIndex() function
  3. Fed the data to the ALS function in Apache Spark which performs matrix factorization and uses ALS to minimize the loss function
  4. Tuned following parameters - rank, regularization parameter, alpha, iterations
  5. Analyzed the performance using 5-fold cross validation and root-mean-squared-error (RMSE) value (It is important to remember here that the recommendations on the full test set cannot be evaluated because of the nature of how matric factorization works. We can only evaluate for the users and artists who exist in both the training and test sets. Only that overlaping data is the available data to measure RMSE)
  6. Used the best parameters to train the full dataset
  7. Used the trained model to predict top 20 artists for every user

K-means Clustering

The user-user based collaborative filtering is not useful for a user who does not have any listening history. It is thus important to be able to make some predictions about their musical preferences, which could be done using information about the user such as their age, gender and country. This could be done by separating the rating information based on these criteria separately and then using rating-mean to find out the top artists for users who fit in the same criteria. Instead of doing that, I have used k-means clustering to cluster the data into an optimum number of clusters. I have then aggregated the rating information among the clusters (mean) to generate a list of top 20 artists for every user.

Number of Clusters (k)

While using Apache Spark, Sillhoutte Score is not a good metric as it requires calculation of distance between all the datapoint that are present in the dataset which is impractical with millions of datapoints, which is usually the case with Spark. So, to draw the elbow curve, I used Within Set Sum of Squared Distances. It is sum of squares of the distances of every point from its cluster center. Using this information, I created an Elbow Curve and decided the optimum number of clusters to be 10 by qualitative observation. Here is what the chart looks like -

elbow_curve

Thus, ALS could be used to recommend new artists to an existing user, whereas we can use the clustering algorithm to give recommendations to new users with zero or little listening experience.

Also, as a future scope for this project, we can also create multiple composite models using different linear combinations of these two models for different use-cases.

Model Evaluation

Constantly evaluating accuracy of a model is arguably the most important aspect to consider after it’s built. As the recommender systems in real life are dynamic in nature unlike this static dataset, a different techniques can be used to assess their performance and improve it in real-life instances.

Commonly used method of A/B testing can be used to test a recommender system on a website. It can be evaluated using measures such as click-through-rate or converstion rate of the recommendations. The definition of conversion rate can be defined/modified as per the context specific to the case. In our case, we can define it based on how much time does a user listen to a recommended track and define a single or multiple threholds depending on that time. This can further be translated into economic output such as Generated Income or Return on Investment.

In an “experimental” scenario such as the one I am dealing with in this project, we can use tools such as RMSE (root mean squared error) and Recall Score to benchmark the performance of the algorighm. RMSE was used as the sole evaluation criteria in the famous Netflix competition. I have used the same criteria to measure the performance of my model. You can find the details in the step-by-step information about my implementation of ALS.

That being said, there is a shortcoming of of this method. It is that it does not work for the items with few ratings as they don’t mean much in terms of their impact on the loss. As a result, predictions for them can be off, some getting scores much higher, some much lower than actual. The former will show among top recommended items, spoiling the results. Some regularization may help mitigate the problem, but it is unlikely to eliminate it.

Other way would be to evaluate using ranking. For example, we can assign rank and select top 10 recommendations, and assign a recall score or a similar reasonable metric to measure how many of those 10 are accurately predicted by the commendation algorithm. This remains to be a scope for future work for this model I’ve built.