Feature Selection is the process of selecting a subset of relevant features for use in model construction.
It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction.
Note: This feature is supported from Hivemall v0.5-rc.1 or later.
Supported Feature Selection algorithms
- Chi-square (Chi2)
- In statistics, the χ2 test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer this article for Mathematical details.
- Signal Noise Ratio (SNR)
- The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as ∣μ1−μ2∣/(σ1+σ2), where μk is the mean value of the variable in classes k, and σk is the standard deviations of the variable in classes k. Clearly, features with larger SNR are useful for classification.
Usage
Feature Selection based on Chi-square test
CREATE TABLE input (
X array<double>,
Y array<int>
);
set hivevar:k=2;
WITH stats AS (
SELECT
transpose_and_dot(Y, X) AS observed,
array_sum(X) AS feature_count,
array_avg(Y) AS class_prob
FROM
input
),
test AS (
SELECT
transpose_and_dot(class_prob, feature_count) AS expected
FROM
stats
),
chi2 AS (
SELECT
chi2(r.observed, l.expected) AS v
FROM
test l
CROSS JOIN stats r
)
SELECT
select_k_best(l.X, r.v.chi2, ${k}) as features
FROM
input l
CROSS JOIN chi2 r;
Feature Selection based on Signal Noise Ratio (SNR)
CREATE TABLE input (
X array<double>,
Y array<int>
);
set hivevar:k=2;
WITH snr AS (
SELECT snr(X, Y) AS snr
FROM input
)
SELECT
select_k_best(X, snr, ${k}) as features
FROM
input
CROSS JOIN snr;
Function signatures
[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>
array<number> X |
array<number> Y |
a row of matrix |
a row of matrix |
Output
array<array<double>> dot product |
dot(X.T, Y) of shape = (X.#cols, Y.#cols) |
[UDF] select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>
array<number> X |
array<number> importance_list |
int k |
feature vector |
importance of each feature |
the number of features to be selected |
Output
array<array<double>> k-best features |
top-k elements from feature vector X based on importance list |
[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>
array<number> observed |
array<number> expected |
observed features |
expected features dot(class_prob.T, feature_count) |
Both of observed
and expected
have a shape (#classes, #features)
Output
struct<array<double>, array<double>> importance_list |
chi2-value and p-value for each feature |
[UDAF] snr(X::array<number>, Y::array<int>)::array<double>
array<number> X |
array<int> Y |
feature vector |
one hot label |
Output
array<double> importance_list |
Signal Noise Ratio for each feature |