Hivemall has a generic function for classification: train_classifier. Compared to the other functions we will see in the later chapters, train_classifier provides simpler and configurable generic interface which can be utilized to build binary classification models in a variety of settings.

Here, we briefly introduce usage of the function. Before trying sample queries, you first need to prepare a9a data. See our a9a tutorial page for further instructions.

Note

This feature is supported from Hivemall v0.5-rc.1 or later.

Training

create table classification_model as
select
 feature,
 avg(weight) as weight
from
 (
  select
    train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no') as (feature, weight)
  from
     a9a_train
 ) t
group by feature;

Prediction & evaluation

WITH test_exploded as (
  select
    rowid,
    label,
    extract_feature(feature) as feature,
    extract_weight(feature) as value
  from
    a9a_test LATERAL VIEW explode(add_bias(features)) t AS feature
),
predict as (
  select
    t.rowid,
    sigmoid(sum(m.weight * t.value)) as prob,
    (case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end)as label
  from
    test_exploded t
    LEFT OUTER JOIN classification_model m 
      ON (t.feature = m.feature)
  group by
    t.rowid
),
submit as (
  select
    t.label as actual,
    p.label as predicted,
    p.prob as probability
  from
    a9a_test t
    JOIN predict p
      on (t.rowid = p.rowid)
)
select 
  sum(if(actual = predicted, 1, 0)) / count(1) as accuracy
from
  submit;
accuracy
0.8461396720103188

Comparison with the other binary classifiers

In the next part of this user guide, our binary classification tutorials introduce many different functions:

All of them actually have the same interface, but mathematical formulation and its implementation differ from each other.

In particular, the above sample queries are almost same as a9a tutorial using Logistic Regression. The difference is only in a choice of training function: logress() vs. train_classifier().

However, at the same time, the options -loss logloss -opt SGD -reg no for train_classifier indicates that Hivemall uses the generic classifier as logress. Hence, the accuracy of prediction based on either logress and train_classifier would be (almost) same under the configuration.

In addition, train_classifier supports the -mini_batch option in a similar manner to what logress does. Thus, following two training queries show the same results:

select
    logress(add_bias(features), label, '-mini_batch 10') as (feature, weight)
from
    a9a_train
select
    train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -mini_batch 10') as (feature, weight)
from
    a9a_train

Likewise, you can generate many different classifiers based on its options.

results matching ""

    No results matching ""