Cross-validation is a model validation technique for assessing how a prediction model will generalize to an independent data set. This example shows a way to perform k-fold cross validation to evaluate prediction performance.
Caution: Matrix factorization is supported in Hivemall v0.3 or later.
Data set creating for 10-folds cross validation.
use movielens;
set hivevar:kfold=10;
set hivevar:seed=31;
-- Adding group id (gid) to each training instance
drop table ratings_groupded;
create table ratings_groupded
as
select
  floor(rand(${seed})*${kfold}) gid, -- generates group id ranging from 1 to 10
  userid, 
  movieid, 
  rating
from
  ratings
cluster by gid, rand(${seed});
Set training hyperparameters
-- latent factors
set hivevar:factor=10;
-- maximum number of iterations
set hivevar:iters=50;
-- regularization parameter
set hivevar:lambda=0.05;
-- learning rate
set hivevar:eta=0.005;
-- conversion rate (if changes between iterations became less or equals to ${cv_rate}, the training will stop)
set hivevar:cv_rate=0.001;
Due to a bug in Hive, do not issue comments in CLI.
select avg(rating) from ratings;
3.581564453029317
-- mean rating value (Optional but recommended to set ${mu})
set hivevar:mu=3.581564453029317;
Note that it is not necessary to set an exact value for ${mu}.
SQL-generation for 10-folds cross validation
Run generate_cv.sh and create generate_cv.sql.
Then, issue SQL queies in generate_cv.sql to get MAE/RMSE.
0.6695442192077673 (MAE)
0.8502739040257945 (RMSE)
We recommend to use Tez for running queries having many stages.