Field-aware factorization machines (FFM) is a factorization model which has been used by the #1 solution of the Criteo competition.
This page guides you to try the factorization technique with Hivemall's train_ffm
and ffm_predict
UDFs.
Note
This feature is supported from Hivemall v0.5.1 or later.
Preprocess data and convert into LIBFFM format
Since FFM is a relatively complex factor-based model which requires us to spend a significant amount of time for feature engineering, preprocessing data outside of Hive can be a reasonable option.
You can again use the repository takuti/criteo-ffm cloned in the data preparation guide to preprocess the data as the winning solution did:
cd criteo-ffm
# create the CSV files `tr.csv` and `te.csv`
make preprocess
Task make preprocess
executes some Python scripts which are originally taken from guestwalk/kaggle-2014-criteo and chenhuang-learn/ffm.
Eventually, you will obtain the following files in so-called LIBFFM format:
tr.ffm
- Labeled training samplestr.sp
- 80% of the labeled training samples randomly picked fromtr.ffm
va.sp
- Remaining 20% of samples for evaluation
te.ffm
- Unlabeled test samples
<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.
See LIBFFM official README for detail.
In order to evaluate the accuracy of prediction at the end of this tutorial, later sections use tr.sp
and va.sp
.
Insert preprocessed data into tables
Create new tables used by the FFM UDFs:
hadoop fs -put tr.sp /criteo/ffm/train
hadoop fs -put va.sp /criteo/ffm/test
use criteo;
DROP TABLE IF EXISTS train_ffm;
CREATE EXTERNAL TABLE train_ffm (
label int,
-- quantitative features
i1 string,i2 string,i3 string,i4 string,i5 string,i6 string,i7 string,i8 string,i9 string,i10 string,i11 string,i12 string,i13 string,
-- categorical features
c1 string,c2 string,c3 string,c4 string,c5 string,c6 string,c7 string,c8 string,c9 string,c10 string,c11 string,c12 string,c13 string,c14 string,c15 string,c16 string,c17 string,c18 string,c19 string,c20 string,c21 string,c22 string,c23 string,c24 string,c25 string,c26 string
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/criteo/ffm/train';
DROP TABLE IF EXISTS test_ffm;
CREATE EXTERNAL TABLE test_ffm (
label int,
-- quantitative features
i1 string,i2 string,i3 string,i4 string,i5 string,i6 string,i7 string,i8 string,i9 string,i10 string,i11 string,i12 string,i13 string,
-- categorical features
c1 string,c2 string,c3 string,c4 string,c5 string,c6 string,c7 string,c8 string,c9 string,c10 string,c11 string,c12 string,c13 string,c14 string,c15 string,c16 string,c17 string,c18 string,c19 string,c20 string,c21 string,c22 string,c23 string,c24 string,c25 string,c26 string
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/criteo/ffm/test';
Vectorize the LIBFFM-formatted features with rowid
:
DROP TABLE IF EXISTS train_vectorized;
CREATE TABLE train_vectorized AS
SELECT
row_number() OVER () AS rowid,
array(
i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13,
c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26
) AS features,
label
FROM
train_ffm
;
DROP TABLE IF EXISTS test_vectorized;
CREATE TABLE test_vectorized AS
SELECT
row_number() OVER () AS rowid,
array(
i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13,
c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26
) AS features,
label
FROM
test_ffm
;
Training
DROP TABLE IF EXISTS criteo.ffm_model;
CREATE TABLE criteo.ffm_model (
model_id int,
i int,
Wi float,
Vi array<float>
);
INSERT OVERWRITE TABLE criteo.ffm_model
SELECT
train_ffm(
features,
label,
'-init_v random -max_init_value 0.5 -classification -iterations 15 -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002'
)
FROM (
SELECT
features, label
FROM
criteo.train_vectorized
CLUSTER BY rand(1)
) t
;
The third argument of train_ffm
accepts a variety of options:
hive> SELECT train_ffm(array(), 0, '-help');
usage: train_ffm(array<string> x, double y [, const string options]) -
Returns a prediction model [-alpha <arg>] [-auto_stop] [-beta
<arg>] [-c] [-cv_rate <arg>] [-disable_cv] [-enable_norm]
[-enable_wi] [-eps <arg>] [-eta <arg>] [-eta0 <arg>] [-f <arg>]
[-feature_hashing <arg>] [-help] [-init_v <arg>] [-int_feature]
[-iters <arg>] [-l1 <arg>] [-l2 <arg>] [-lambda0 <arg>] [-lambdaV
<arg>] [-lambdaW0 <arg>] [-lambdaWi <arg>] [-max <arg>] [-maxval
<arg>] [-min <arg>] [-min_init_stddev <arg>] [-no_norm]
[-num_fields <arg>] [-opt <arg>] [-p <arg>] [-power_t <arg>] [-seed
<arg>] [-sigma <arg>] [-t <arg>] [-va_ratio <arg>] [-va_threshold
<arg>] [-w0]
-alpha,--alphaFTRL <arg> Alpha value (learning rate)
of
Follow-The-Regularized-Reade
r [default: 0.2]
-auto_stop,--early_stopping Stop at the iteration that
achieves the best validation
on partial samples [default:
OFF]
-beta,--betaFTRL <arg> Beta value (a learning
smoothing parameter) of
Follow-The-Regularized-Reade
r [default: 1.0]
-c,--classification Act as classification
-cv_rate,--convergence_rate <arg> Threshold to determine
convergence [default: 0.005]
-disable_cv,--disable_cvtest Whether to disable
convergence check [default:
OFF]
-enable_norm,--l2norm Enable instance-wise L2
normalization
-enable_wi,--linear_term Include linear term
[default: OFF]
-eps <arg> A constant used in the
denominator of AdaGrad
[default: 1.0]
-eta <arg> The initial learning rate
-eta0 <arg> The initial learning rate
[default 0.1]
-f,--factors <arg> The number of the latent
variables [default: 5]
-feature_hashing <arg> The number of bits for
feature hashing in range
[18,31] [default: -1]. No
feature hashing for -1.
-help Show function help
-init_v <arg> Initialization strategy of
matrix V [random,
gaussian](default: 'random'
for regression / 'gaussian'
for classification)
-int_feature,--feature_as_integer Parse a feature as integer
[default: OFF]
-iters,--iterations <arg> The number of iterations
[default: 10]
-l1,--lambda1 <arg> L1 regularization value of
Follow-The-Regularized-Reade
r that controls model
Sparseness [default: 0.001]
-l2,--lambda2 <arg> L2 regularization value of
Follow-The-Regularized-Reade
r [default: 0.0001]
-lambda0,--lambda <arg> The initial lambda value for
regularization [default:
0.0001]
-lambdaV,--lambda_v <arg> The initial lambda value for
V regularization [default:
0.0001]
-lambdaW0,--lambda_w0 <arg> The initial lambda value for
W0 regularization [default:
0.0001]
-lambdaWi,--lambda_wi <arg> The initial lambda value for
Wi regularization [default:
0.0001]
-max,--max_target <arg> The maximum value of target
variable
-maxval,--max_init_value <arg> The maximum initial value in
the matrix V [default: 0.5]
-min,--min_target <arg> The minimum value of target
variable
-min_init_stddev <arg> The minimum standard
deviation of initial matrix
V [default: 0.1]
-no_norm,--disable_norm Disable instance-wise L2
normalization
-num_fields <arg> The number of fields
[default: 256]
-opt,--optimizer <arg> Gradient Descent optimizer
[default: ftrl, adagrad,
sgd]
-p,--num_features <arg> The size of feature
dimensions [default: -1]
-power_t <arg> The exponent for inverse
scaling learning rate
[default 0.1]
-seed <arg> Seed value [default: -1
(random)]
-sigma <arg> The standard deviation for
initializing V [default:
0.1]
-t,--total_steps <arg> The total number of training
examples
-va_ratio,--validation_ratio <arg> Ratio of training data used
for validation [default:
0.05f]
-va_threshold,--validation_threshold <arg> Threshold to start
validation. At least N
training examples are used
before validation [default:
1000]
-w0,--global_bias Whether to include global
bias term w0 [default: OFF]
Note that debug log describes the change of cumulative loss over iterations as follows:
Iteration #2 | average loss=0.5407147187026483, current cumulative loss=858.114258581103, previous cumulative loss=1682.1101438997914, change rate=0.48985846040280256, #trainingExamples=1587
Iteration #3 | average loss=0.5105058761578417, current cumulative loss=810.1728254624949, previous cumulative loss=858.114258581103, change rate=0.05586835626980435, #trainingExamples=1587
Iteration #4 | average loss=0.49045915570992393, current cumulative loss=778.3586801116493, previous cumulative loss=810.1728254624949, change rate=0.039268344174200345, #trainingExamples=1587
Iteration #5 | average loss=0.4752751205770395, current cumulative loss=754.2616163557617, previous cumulative loss=778.3586801116493, change rate=0.030958816766109738, #trainingExamples=1587
Iteration #6 | average loss=0.46308523885164105, current cumulative loss=734.9162740575543, previous cumulative loss=754.2616163557617, change rate=0.02564805351182389, #trainingExamples=1587
Iteration #7 | average loss=0.4529012395753083, current cumulative loss=718.7542672060143, previous cumulative loss=734.9162740575543, change rate=0.02199163009727323, #trainingExamples=1587
Iteration #8 | average loss=0.44411358945347845, current cumulative loss=704.8082664626703, previous cumulative loss=718.7542672060143, change rate=0.019403016273636577, #trainingExamples=1587
Iteration #9 | average loss=0.4363264696377158, current cumulative loss=692.450107315055, previous cumulative loss=704.8082664626703, change rate=0.017534072365012268, #trainingExamples=1587
Iteration #10 | average loss=0.4292753045556725, current cumulative loss=681.2599083298522, previous cumulative loss=692.450107315055, change rate=0.01616029641267912, #trainingExamples=1587
Iteration #11 | average loss=0.42277515600757143, current cumulative loss=670.9441725840159, previous cumulative loss=681.2599083298522, change rate=0.015142144165104322, #trainingExamples=1587
Iteration #12 | average loss=0.416689617663307, current cumulative loss=661.2864232316682, previous cumulative loss=670.9441725840159, change rate=0.014394266687126348, #trainingExamples=1587
Iteration #13 | average loss=0.4109140194740033, current cumulative loss=652.1205489052433, previous cumulative loss=661.2864232316682, change rate=0.013860672175351585, #trainingExamples=1587
Iteration #14 | average loss=0.4053667348634373, current cumulative loss=643.317008228275, previous cumulative loss=652.1205489052433, change rate=0.013499866998129951, #trainingExamples=1587
Iteration #15 | average loss=0.3999840450561501, current cumulative loss=634.7746795041102, previous cumulative loss=643.317008228275, change rate=0.013278568131893133, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
Prediction and evaluation
DROP TABLE IF EXISTS criteo.test_exploded;
CREATE TABLE criteo.test_exploded AS
SELECT
t1.rowid,
t2.i,
t2.j,
t2.Xi,
t2.Xj
from
criteo.test_vectorized t1
LATERAL VIEW feature_pairs(t1.features, '-ffm') t2 AS i, j, Xi, Xj
;
WITH predicted AS (
SELECT
rowid,
avg(score) AS predicted
FROM (
SELECT
t1.rowid,
p1.model_id,
sigmoid(ffm_predict(p1.Wi, p1.Vi, p2.Vi, t1.Xi, t1.Xj)) AS score
FROM
criteo.test_exploded t1
JOIN criteo.ffm_model p1 ON (p1.i = t1.i) -- at least p1.i = 0 and t1.i = 0 exists
LEFT OUTER JOIN criteo.ffm_model p2 ON (p2.model_id = p1.model_id and p2.i = t1.j)
WHERE
p1.Wi is not null OR p2.Vi is not null
GROUP BY
t1.rowid, p1.model_id
) t
GROUP BY
rowid
)
SELECT
logloss(t1.predicted, t2.label)
FROM
predicted t1
JOIN
criteo.test_vectorized t2
ON t1.rowid = t2.rowid
;
0.47276208106423234
Note
The accuracy varies depending on the random separation of tr.sp
and va.sp
.
Notice that LogLoss around 0.45 is reasonable accuracy compared to the competition leaderboard and output from LIBFFM.