Hivemall supports a neighborhood-learning scheme using SLIM. SLIM is a representative of neighborhood-learning recommendation algorithm introduced in the following paper:
- Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011.
Caution: SLIM is supported from Hivemall v0.5-rc.1 or later.
SLIM optimization objective
The optimization objective of SLIM is similar to Elastic Net (L1+L2 regularization) with additional constraints as follows:
Data preparation
Rating binarization
In this article, each user-movie matrix element is binarized to reduce training samples and consider only high rated movies whose rating is 4 or 5. So, every matrix element having a lower rating than 4 is not used for training.
SET hivevar:seed=31;
DROP TABLE ratings2;
CREATE TABLE ratings2 as
select
rand(${seed}) as rnd,
userid,
movieid as itemid,
cast(1.0 as float) as rating -- double is also accepted
from
ratings
where rating >= 4.
;
rnd
field is appended for each record to split ratings2
into training and testing data later.
Binarization is an optional step, and you can use raw rating values to train a SLIM model.
Splitting dataset
To evaluate a recommendation model, this tutorial uses two type cross validations:
- Leave-one-out cross validation
- -hold cross validation
The former is used in the SLIM's paper and the latter is used in Mendeley's slide.
Leave-one-out cross validation
For leave-one-out cross validation, the dataset is split into a training set and a testing set by randomly selecting one of the non-zero entries of each user and placing it into the testing set.
In the following query, the movie has the smallest rnd
value is used as test data (testing
table) per a user.
And, the others are used as training data (training
table).
When we select slim's best hyperparameters, different test data is used in evaluation section several times.
DROP TABLE testing;
CREATE TABLE testing
as
WITH top_k as (
select
each_top_k(1, userid, rnd, userid, itemid, rating)
as (rank, rnd, userid, itemid, rating)
from (
select * from ratings2
CLUSTER BY userid
) t
)
select
userid, itemid, rating
from
top_k
;
DROP TABLE training;
CREATE TABLE training as
select
l.*
from
ratings2 l
LEFT OUTER JOIN testing r ON (l.userid=r.userid and l.itemid=r.itemid)
where
r.itemid IS NULL -- anti join
;
-hold corss validation
When , the dataset is divided into training data and testing dataset. The numbers of training and testing samples roughly equal.
When we select slim's best hyperparameters, you'll first train a SLIM prediction model from training data and evaluate the prediction model by testing data.
Optionally, you can switch training data with testing data and evaluate again.
DROP TABLE testing;
CREATE TABLE testing
as
select * from ratings2
where rnd >= 0.5
;
DROP TABLE training;
CREATE TABLE training
as
select * from ratings2
where rnd < 0.5
;
Note
In the following section excluding evaluation section, we will show the example of queries and its results based on -hold cross validation case. But, this article's queries are valid for leave-one-out cross validation.
Pre-compute item-item similarity
SLIM needs top- most similar movies for each movie to the approximate user-item matrix. Here, we particularly focus on DIMSUM, an efficient and approximated similarity computation scheme.
Because we set k=20
, the output has 20 most-similar movies per itemid
.
We can adjust trade-off between training and prediction time and precision of matrix approximation by varying k
.
Larger k
is the better approximation for raw user-item matrix, but training time and memory usage tend to increase.
As we explained in the general introduction of item-based CF, following query finds top- nearest-neighborhood movies for each movie:
set hivevar:k=20;
DROP TABLE knn_train;
CREATE TABLE knn_train
as
with item_magnitude as (
select
to_map(j, mag) as mags
from (
select
itemid as j,
l2_norm(rating) as mag
from
training
group by
itemid
) t0
),
item_features as (
select
userid as i,
collect_list(
feature(itemid, rating)
) as feature_vector
from
training
group by
userid
),
partial_result as (
select
dimsum_mapper(f.feature_vector, m.mags, '-threshold 0.1 -int_feature')
as (itemid, other, s)
from
item_features f
CROSS JOIN item_magnitude m
),
similarity as (
select
itemid,
other,
sum(s) as similarity
from
partial_result
group by
itemid, other
),
topk as (
select
each_top_k(
${k}, itemid, similarity, -- use top k items
itemid, other
) as (rank, similarity, itemid, other)
from (
select * from similarity
CLUSTER BY itemid
) t
)
select
itemid, other, similarity
from
topk
;
itemid | other | similarity |
---|---|---|
1 | 3114 | 0.28432244 |
1 | 1265 | 0.25180137 |
1 | 2355 | 0.24781825 |
1 | 2396 | 0.24435896 |
1 | 588 | 0.24359442 |
... | ... | ... |
Caution
To run the query above, you may need to run the following statements:
set hive.strict.checks.cartesian.product=false;
set hive.mapred.mode=nonstrict;
Create training input tables
Here, we prepare input tables for SLIM training.
SLIM input consists of the following columns in slim_training_item
:
i
: axis item idRi
: the user-rating vector of the axis item expressed asmap<userid, rating>
.knn_i
: top- similar item matrix of item ; the user-item rating matrix is expressed asmap<userid, map<itemid, rating>>
.j
: an item id inknn_i
.Rj
: the user-rating vector of the item expressed asmap<userid, rating>
.
DROP TABLE item_matrix;
CREATE table item_matrix as
select
itemid as i,
to_map(userid, rating) as R_i
from
training
group by
itemid;
-- Temporary set off map join because the following query does not work well for map join
set hive.auto.convert.join=false;
-- set mapred.reduce.tasks=64;
-- Create SLIM input features
DROP TABLE slim_training_item;
CREATE TABLE slim_training_item as
WITH knn_item_user_matrix as (
select
l.itemid,
r.userid,
to_map(l.other, r.rating) ratings
from
knn_train l
JOIN training r ON (l.other = r.itemid)
group by
l.itemid, r.userid
),
knn_item_matrix as (
select
itemid as i,
to_map(userid, ratings) as KNN_i -- map<userid, map<itemid, rating>>
from
knn_item_user_matrix
group by
itemid
)
select
l.itemid as i,
r1.R_i,
r2.knn_i,
l.other as j,
r3.R_i as R_j
from
knn_train l
JOIN item_matrix r1 ON (l.itemid = r1.i)
JOIN knn_item_matrix r2 ON (l.itemid = r2.i)
JOIN item_matrix r3 ON (l.other = r3.i)
;
-- set to the default value
set hive.auto.convert.join=true;
Training
Build a prediction model by SLIM
train_slim
function outputs the nonzero elements of an item-item matrix.
For item recommendation or prediction, this matrix is stored into the table named slim_model
.
DROP TABLE slim_model;
CREATE TABLE slim_model as
select
i, nn, avg(w) as w
from (
select
train_slim(i, r_i, knn_i, j, r_j) as (i, nn, w)
from (
select * from slim_training_item
CLUSTER BY i
) t1
) t2
group by i, nn
;
Usage of train_slim
You can obtain information about train_slim
function and its arguments by giving -help
option as follows:
select train_slim("-help");
usage: train_slim( int i, map<int, double> r_i, map<int, map<int, double>> topKRatesOfI,
int j, map<int, double> r_j [, constant string options])
- Returns row index, column index and non-zero weight value of prediction model
[-cv_rate <arg>] [-disable_cv] [-help] [-iters <arg>] [-l1 <arg>] [-l2 <arg>]
-cv_rate,--convergence_rate <arg> Threshold to determine convergence
[default: 0.005]
-disable_cv,--disable_cvtest Whether to disable convergence check
[default: enabled]
-help Show function help
-iters,--iterations <arg> The number of iterations for
coordinate descent [default: 30]
-l1,--l1coefficient <arg> Coefficient for l1 regularizer
[default: 0.001]
-l2,--l2coefficient <arg> Coefficient for l2 regularizer
[default: 0.0005]
Prediction and recommendation
Here, we predict ratng values of binarized user-item rating matrix of testing dataset based on ratings in training dataset.
Based on predicted rating scores, we can recommend top-k items for each user that he or she will be likely to put high scores.
Predict unknown ratings of a user-item matrix
Based on known ratings and SLIM weight matrix, we predict unknown ratings in the user-item matrix. SLIM predicts ratings of user-item pairs based on top- similar items.
The predict_pair
table represents candidates for recommended user-movie pairs, excluding known ratings in the training dataset.
CREATE OR REPLACE VIEW predict_pair
as
WITH testing_users as (
select DISTINCT(userid) as userid from testing
),
training_items as (
select DISTINCT(itemid) as itemid from training
),
user_items as (
select
l.userid,
r.itemid
from
testing_users l
CROSS JOIN training_items r
)
select
l.userid,
l.itemid
from
user_items l
LEFT OUTER JOIN training r ON (l.userid=r.userid and l.itemid=r.itemid)
where
r.itemid IS NULL -- anti join
;
-- optionally set the mean/default value of prediction
set hivevar:mu=0.0;
DROP TABLE predicted;
CREATE TABLE predicted
as
WITH knn_exploded as (
select
l.userid as u,
l.itemid as i, -- axis
r1.other as k, -- other
r2.rating as r_uk
from
predict_pair l
LEFT OUTER JOIN knn_train r1
ON (r1.itemid = l.itemid)
JOIN training r2
ON (r2.userid = l.userid and r2.itemid = r1.other)
)
select
l.u as userid,
l.i as itemid,
coalesce(sum(l.r_uk * r.w), ${mu}) as predicted
-- coalesce(sum(l.r_uk * r.w)) as predicted
from
knn_exploded l
LEFT OUTER JOIN slim_model r ON (l.i = r.i and l.k = r.nn)
group by
l.u, l.i
;
Caution
When is small, slim predicted value may be null
. Then, $mu
replaces null
value.
The mean value of item ratings is a good choice for $mu
.
Top- item recommendation for each user
Here, we recommend top-3 items for each user based on predicted values.
SET hivevar:k=3;
DROP TABLE IF EXISTS recommend;
CREATE TABLE recommend
as
WITH top_n as (
select
each_top_k(${k}, userid, predicted, userid, itemid)
as (rank, predicted, userid, itemid)
from (
select * from predicted
CLUSTER BY userid
) t
)
select
userid,
collect_list(itemid) as items
from
top_n
group by
userid
;
select * from recommend limit 5;
userid | items |
---|---|
1 | [364,594,2081] |
2 | [2028,3256,589] |
3 | [260,1291,2791] |
4 | [1196,1200,1210] |
5 | [3813,1366,89] |
... | ... |
Evaluation
Top- ranking measures: Hit-Rate@K, MRR@K, and Precision@K
In this section, Hit-Rate@k
, MRR@k
, and Precision@k
are computed based on recommended items.
Precision@K
is a good evaluation measure for -hold cross validation.
On the other hand, Hit-Rate
and Mean Reciprocal Rank
(i.e., Average Reciprocal Hit-Rate) are good evaluation measures for leave-one-out cross validation.
SET hivevar:n=10;
WITH top_k as (
select
each_top_k(${n}, userid, predicted, userid, itemid)
as (rank, predicted, userid, itemid)
from (
select * from predicted
CLUSTER BY userid
) t
),
rec_items as (
select
userid,
collect_list(itemid) as items
from
top_k
group by
userid
),
ground_truth as (
select
userid,
collect_list(itemid) as truth
from
testing
group by
userid
)
select
hitrate(l.items, r.truth) as hitrate,
mrr(l.items, r.truth) as mrr,
precision_at(l.items, r.truth) as prec
from
rec_items l
join ground_truth r on (l.userid=r.userid)
;
Leave-one-out result
hitrate | mrr | prec |
---|---|---|
0.21517309922146763 | 0.09377752536606271 | 0.021517309922146725 |
Hit Rate and MRR are similar to ones in the result of Table II in Slim's paper
-hold result
hitrate | mrr | prec |
---|---|---|
0.8952775476387739 | 1.1751514972186057 | 0.3564871582435789 |
Precision value is similar to the result of Mendeley's slide.
Ranking measures: MRR
In this example, whole recommended items are evaluated using MRR.
WITH rec_items as (
select
userid,
to_ordered_list(itemid, predicted, '-reverse') as items
from
predicted
group by
userid
),
ground_truth as (
select
userid,
collect_list(itemid) as truth
from
testing
group by
userid
)
select
mrr(l.items, r.truth) as mrr
from
rec_items l
join ground_truth r on (l.userid=r.userid)
;
Leave-one-out result
mrr |
---|
0.10782647321821472 |
-hold result
mrr |
---|
0.6179983058881773 |
This MRR value is similar to one in the Mendeley's slide.