Hivemall has a generic function for classification: train_classifier
. Compared to the other functions we will see in the later chapters, train_classifier
provides simpler and configurable generic interface which can be utilized to build binary classification models in a variety of settings.
Here, we briefly introduce usage of the function. Before trying sample queries, you first need to prepare a9a data. See our a9a tutorial page for further instructions.
Note
This feature is supported from Hivemall v0.5-rc.1 or later.
Training
create table classification_model as
select
feature,
avg(weight) as weight
from
(
select
train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no') as (feature, weight)
from
a9a_train
) t
group by feature;
Prediction & evaluation
WITH test_exploded as (
select
rowid,
label,
extract_feature(feature) as feature,
extract_weight(feature) as value
from
a9a_test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
select
t.rowid,
sigmoid(sum(m.weight * t.value)) as prob,
(case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end)as label
from
test_exploded t
LEFT OUTER JOIN classification_model m
ON (t.feature = m.feature)
group by
t.rowid
),
submit as (
select
t.label as actual,
p.label as predicted,
p.prob as probability
from
a9a_test t
JOIN predict p
on (t.rowid = p.rowid)
)
select
sum(if(actual = predicted, 1, 0)) / count(1) as accuracy
from
submit;
accuracy |
---|
0.8461396720103188 |
Comparison with the other binary classifiers
In the next part of this user guide, our binary classification tutorials introduce many different functions:
All of them actually have the same interface, but mathematical formulation and its implementation differ from each other.
In particular, the above sample queries are almost same as a9a tutorial using Logistic Regression. The difference is only in a choice of training function: logress()
vs. train_classifier()
.
However, at the same time, the options -loss logloss -opt SGD -reg no
for train_classifier
indicates that Hivemall uses the generic classifier as logress
. Hence, the accuracy of prediction based on either logress
and train_classifier
would be (almost) same under the configuration.
In addition, train_classifier
supports the -mini_batch
option in a similar manner to what logress
does. Thus, following two training queries show the same results:
select
logress(add_bias(features), label, '-mini_batch 10') as (feature, weight)
from
a9a_train
select
train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -mini_batch 10') as (feature, weight)
from
a9a_train
Likewise, you can generate many different classifiers based on its options.