Our user guide for item-based collaborative filtering (CF) introduced how to make recommendation based on item-item similarities. Here, we particularly focus on DIMSUM, an efficient and approximated similarity computation scheme, and try to make recommendation from the MovieLens data.
Caution
In order to obtain ranked list of items, this page introduces queries using to_ordered_map()
such as map_values(to_ordered_map(rating, movieid, true))
. However, this kind of usage has a potential issue that multiple movieid
-s (i.e., values) which have the exactly same rating
(i.e., key) will be aggregated to single arbitrary movieid
, because to_ordered_map()
creates a key-value map which uses duplicated rating
as key.
Hence, if map key could duplicate on more then one map values, we recommend you to use to_ordered_list(value, key, '-reverse')
instead of map_values(to_ordered_map(key, value, true))
. The alternative approach is available from Hivemall v0.5-rc.1 or later.
Compute movie-movie similarity
As we explained in the general introduction of item-based CF, following query finds top- nearest-neighborhood movies for each movie:
drop table if exists dimsum_movie_similarity;
create table dimsum_movie_similarity
as
with movie_magnitude as ( -- compute magnitude of each movie vector
select
to_map(j, mag) as mags
from (
select
movieid as j,
l2_norm(rating) as mag
from
training
group by
movieid
) t0
),
movie_features as (
select
userid as i,
collect_list(
feature(movieid, rating)
) as feature_vector
from
training
group by
userid
),
partial_result as ( -- launch DIMSUM in a MapReduce fashion
select
dimsum_mapper(f.feature_vector, m.mags, '-threshold 0.1 -disable_symmetric_output')
as (movieid, other, s)
from
movie_features f
left outer join movie_magnitude m
),
similarity as (
-- reduce (i.e., sum up) mappers' partial results
select
movieid,
other,
sum(s) as similarity
from
partial_result
group by
movieid, other
),
topk as (
select
each_top_k( -- get top-10 nearest neighbors based on similarity score
10, movieid, similarity,
movieid, other -- output items
) as (rank, similarity, movieid, other)
from (
select * from similarity
CLUSTER BY movieid
) t
)
select
movieid, other, similarity
from
topk
;
movieid | other | similarity |
---|---|---|
1 | 2095 | 0.9377422722094696 |
1 | 231 | 0.9316530366756418 |
1 | 1407 | 0.9194745656079863 |
1 | 3442 | 0.9133853300741587 |
1 | 1792 | 0.9072960945403309 |
... | ... | ... |
Since we set k=10
, output has 10 most-similar movies per movieid
.
Note
Since we specified an option -disable_symmetric_output
, output table does not contain inverted similarities such as <2095, 1>
, <231, 1>
, <1407, 1>
, ...
Prediction
Next, we predict rating for unforeseen user-movie pairs based on the top- similarities:
drop table if exists dimsum_prediction;
create table dimsum_prediction
as
with similarity_all as (
-- copy (i1, i2)'s similarity as (i2, i1)'s one
select movieid, other, similarity from dimsum_movie_similarity
union all
select other as movieid, movieid as other, similarity from dimsum_movie_similarity
)
select
-- target user
t1.userid,
-- recommendation candidate
t2.movieid,
-- predicted rating: r_{u,i} = sum(s_{i,:} * r_{u,:}) / sum(s_{i,:})
sum(t1.rating * t2.similarity) / sum(t2.similarity) as rating
from
training t1 -- r_{u,<movieid>}
left join -- s_{i,<other>}
similarity_all t2
on t1.movieid = t2.other
where
-- do not include movies that user already rated
NOT EXISTS (
SELECT a.movieid FROM training a
WHERE a.userid = t1.userid AND a.movieid = t2.movieid
)
group by
t1.userid, t2.movieid
;
This query computes estimated rating as follows:
userid | movieid | rating |
---|---|---|
1 | 1000 | 5.0 |
1 | 1010 | 5.0 |
1 | 1012 | 4.246349332667371 |
1 | 1013 | 5.0 |
1 | 1014 | 5.0 |
... | ... | ... |
Theoretically, for the -th nearest-neighborhood item , prediction can be done by top- weighted sum of target user's historical ratings:
where is user 's rating for item (movie) , and is similarity of - (movieid
-other
) pair.
Caution
Since the number of similarities and users' past ratings are limited, we cannot say this output always contains prediction for every unforeseen user-item pairs; sometimes prediction for a specific user-item pair might be missing (i.e., NULL
).
In fact, our goal is to make recommendation, but we can evaluate the intermediate result as a rating prediction problem:
select
mae(t1.rating, t2.rating) as mae,
rmse(t1.rating, t2.rating) as rmse
from
testing t1
left join
dimsum_prediction t2
on t1.movieid = t2.movieid
where
t1.userid = t2.userid
;
mae | rmse |
---|---|
0.7308365821230256 | 0.9594799959251938 |
Rating of the MovieLens data is in [1, 5]
range, so this average errors are reasonable as a predictor.
Recommendation
By using the prediction table, making recommendation for each user is straightforward:
drop table if exists dimsum_recommendation;
create table dimsum_recommendation
as
select
userid,
map_values(to_ordered_map(rating, movieid, true)) as rec_movies
from
dimsum_prediction
group by
userid
;
userid | rec_movies |
---|---|
1 | ["2590","999","372","1380","2078",...] |
2 | ["580","945","43","36","1704",...] |
3 | ["3744","852","1610","3740","2915",...] |
4 | ["3379","923","1997","2194","2944",...] |
5 | ["998","101","2696","2968","2275",...] |
... | ... |
Note
Size of rec_movies
varies depending on each user's training
samples and what movies he/she already rated.
Evaluation
Eventually, you can measure the quality of recommendation by using ranking measures:
with truth as (
select
userid,
map_values(to_ordered_map(rating, cast(movieid as string), true)) as truth
from
testing
group by
userid
)
select
recall_at(t1.rec_movies, t2.truth, 10) as recall,
precision_at(t1.rec_movies, t2.truth, 10) as precision,
average_precision(t1.rec_movies, t2.truth) as average_precision,
auc(t1.rec_movies, t2.truth) as auc,
mrr(t1.rec_movies, t2.truth) as mrr,
ndcg(t1.rec_movies, t2.truth) as ndcg
from
dimsum_recommendation t1
join
truth t2 on t1.userid = t2.userid
where -- at least 10 recommended items are necessary to compute recall@10 and precision@10
size(t1.rec_movies) >= 10
;
measure | accuracy |
---|---|
Recall@10 | 0.027033598585322713 |
Precision@10 | 0.009001989389920506 |
Average Precision | 0.017363681149831108 |
AUC | 0.5264553136097863 |
MRR | 0.03507380742291146 |
NDCG | 0.15787655209987522 |
If you set larger value to the DIMSUM's -threshold
option, similarity will be more aggressively approximated. Consequently, while efficiency is improved, the accuracy is likely to be decreased.
Caution
Before Hivemall v0.5-rc.1, recall_at()
and precision_at()
are respectively registered as recall()
and precision()
. However, since precision
is a reserved keyword from Hive v2.2.0, we renamed the function names. If you are still using recall()
and/or precision()
, we strongly recommend you to use the latest version of Hivemall and replace them with the newer function names.