The task is predicting the click through rate (CTR) of advertisement, meaning that we are to predict the probability of each ad being clicked.
https://www.kaggle.com/c/kddcup2012-track2

Caution: This example just shows a baseline result. Use token tables and amplifier to get better AUC score.

Logistic Regression

Training

use kdd12track2;

-- set mapred.max.split.size=134217728; -- [optional] set if OOM caused at mappers on training
-- SET mapred.max.split.size=67108864;
select count(1) from training_orcfile;

235582879

235582879 / 56 (mappers) = 4206837

set hivevar:total_steps=5000000;
-- set mapred.reduce.tasks=64; -- [optional] set the explicit number of reducers to make group-by aggregation faster

drop table lr_model;
create table lr_model 
as
select 
 feature,
 cast(avg(weight) as float) as weight
from 
 (select 
     logress(features, label, "-total_steps ${total_steps}") as (feature,weight)
     -- logress(features, label) as (feature,weight)
  from 
     training_orcfile
 ) t 
group by feature;

-- set mapred.max.split.size=-1; -- reset to the default value

Note

Setting the "-total_steps" option is optional.

Prediction

drop table lr_predict;
create table lr_predict
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
as
select
  t.rowid, 
  sigmoid(sum(m.weight)) as prob
from 
  testing_exploded t
  LEFT OUTER JOIN lr_model m 
      ON (t.feature = m.feature)
group by 
  t.rowid
order by 
  rowid ASC;

Note

sigmoid(sum(m.weight)) not sigmoid(sum(m.weight * t.value))) because t.value is always 1.0 for categorical variable.

Evaluation

You can download scoreKDD.py from KDD Cup 2012, Track 2 site. After logging-in to Kaggle, download scoreKDD.py.

hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/lr_predict lr_predict.tbl

gawk -F "\t" '{print $2;}' lr_predict.tbl > lr_predict.submit

pypy scoreKDD.py KDD_Track2_solution.csv  lr_predict.submit

Measure	Score
AUC	0.741111
NWMAE	0.045493
WRMSE	0.142395

Passive Aggressive

Training

drop table pa_model;
create table pa_model 
as
select 
 feature,
 cast(avg(weight) as float) as weight
from 
 (select 
     train_pa1a_regr(features,label) as (feature,weight)
  from 
     training_orcfile
 ) t 
group by feature;

Note

PA1a is recommended when using PA for regression.

Prediction

drop table pa_predict;
create table pa_predict
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
as
select
  t.rowid, 
  sum(m.weight) as prob
from 
  testing_exploded  t LEFT OUTER JOIN
  pa_model m ON (t.feature = m.feature)
group by 
  t.rowid
order by 
  rowid ASC;

Caution

The "prob" of PA can be used only for ranking and can have a negative value. A higher weight means much likely to be clicked. Note that AUC is sort a measure for evaluating ranking accuracy.

Evaluation

hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/pa_predict pa_predict.tbl

gawk -F "\t" '{print $2;}' pa_predict.tbl > pa_predict.submit

pypy scoreKDD.py KDD_Track2_solution.csv  pa_predict.submit

Measure	Score
AUC	0.739722
NWMAE	0.049582
WRMSE	0.143698

Logistic Regression, Passive Aggressive

Logistic Regression

Training

Note

Prediction

Note

Evaluation

Passive Aggressive

Training

Note

Prediction

Caution

Evaluation

results matching ""

No results matching ""