Feature binning is a method of dividing quantitative variables into categorical values. It groups quantitative values into a pre-defined number of bins.

If the number of bins is set to 3, the bin ranges become something like [-Inf, 1], (1, 10], (10, Inf].

Data Preparation

Prepare sample data (users table) first as follows:

CREATE TABLE users (
  rowid int, name string, age int, gender string
);
INSERT INTO users VALUES
  (1, 'Jacob', 20, 'Male'),
  (2, 'Mason', 22, 'Male'),
  (3, 'Sophia', 35, 'Female'),
  (4, 'Ethan', 55, 'Male'),
  (5, 'Emma', 15, 'Female'),
  (6, 'Noah', 46, 'Male'),
  (7, 'Isabella', 20, 'Female')
;

CREATE TABLE input as
SELECT
  rowid,
  array_concat(
    categorical_features(
      array('name', 'gender'),
      name, gender
    ),
    quantitative_features(
      array('age'),
      age
    )
  ) AS features
FROM
  users;

select * from input limit 2;
input.rowid input.features
1 ["name#Jacob","gender#Male","age:20.0"]
2 ["name#Mason","gender#Male","age:22.0"]

Usage

Custom rule for binning

You can provide a custom rule for binning as follows:

select 
  features as original,
  feature_binning(
    features,
    -- [-INF-10.0], (10.0-20.0], (20.0-30.0], (30.0-40.0], (40.0-INF]
    map('age', array(-infinity(), 10.0, 20.0, 30.0, 40.0, infinity()))
  ) as binned
from
  input;
original binned
["name#Jacob","gender#Male","age:20.0"] ["name#Jacob","gender#Male","age:1"]
["name#Mason","gender#Male","age:22.0"] ["name#Mason","gender#Male","age:2"]
["name#Sophia","gender#Female","age:35.0"] ["name#Sophia","gender#Female","age:3"]
["name#Ethan","gender#Male","age:55.0"] ["name#Ethan","gender#Male","age:4"]
["name#Emma","gender#Female","age:15.0"] ["name#Emma","gender#Female","age:1"]
["name#Noah","gender#Male","age:46.0"] ["name#Noah","gender#Male","age:4"]
["name#Isabella","gender#Female","age:20.0"] ["name#Isabella","gender#Female","age:1"]

Binning based on Quantiles

You can apply feature binning based on quantiles.

Suppose converting age values into 3 bins:

SELECT
  map('age', build_bins(age, 3)) AS quantiles_map
FROM
  users

{"age":[-Infinity,18.333333333333332,30.666666666666657,Infinity]}

In the above query result, you can find 4 values for age in quantiles_map. It's a threshold for 3 bins.

WITH bins as (
  SELECT
    map('age', build_bins(age, 3)) AS quantiles_map
  FROM
    users
)
select
  feature_binning(
    array('age:-Infinity', 'age:-1', 'age:0', 'age:1', 'age:18.333333333333331', 'age:18.333333333333332'), quantiles_map
  ),
  feature_binning(
    array('age:18.3333333333333333', 'age:18.33333333333334', 'age:19', 'age:30', 'age:30.666666666666656', 'age:30.666666666666657'), quantiles_map
  ),
  feature_binning(
    array('age:666666666666658', 'age:30.66666666666666', 'age:31', 'age:99', 'age:Infinity'), quantiles_map
  ),
  feature_binning(
    array('age:NaN'), quantiles_map
  ),
  feature_binning( -- not in map
    array('weight:60.3'), quantiles_map
  )
from
  bins

["age:0","age:0","age:0","age:0","age:0","age:0"] ["age:0","age:1","age:1","age:1","age:1","age:1"] ["age:2","a ge:2","age:2","age:2","age:2"] ["age:3"] ["weight:60.3"]

The following query shows more practical usage:

WITH bins AS (
  SELECT
    map('age', build_bins(age, 3)) AS quantiles_map
  FROM
    users
)
SELECT
  feature_binning(features, quantiles_map) AS features
FROM
  input
  CROSS JOIN bins;
features: array<features::string>
["name#Jacob","gender#Male","age:1"]
["name#Mason","gender#Male","age:1"]
["name#Sophia","gender#Female","age:2"]
["name#Ethan","gender#Male","age:2"]
...

Concrete Example

Here, we show a more practical usage of feature_binning UDF that applied feature binning for given feature vectors.

WITH extracted as (
  select 
    extract_feature(feature) as index,
    extract_weight(feature) as value
  from
    input l
    LATERAL VIEW explode(features) r as feature
  where
    instr(feature, ':') > 0 -- filter out categorical features
),
mapping as (
  select
    index, 
    build_bins(value, 5, true) as quantiles -- 5 bins with auto bin shrinking
  from
    extracted
  group by
    index
),
bins as (
   select 
    to_map(index, quantiles) as quantiles 
   from
    mapping
)
select
  l.features as original,
  feature_binning(l.features, r.quantiles) as features
from
  input l
  cross join bins r
-- limit 10;
original features
["name#Jacob","gender#Male","age:20.0"] ["name#Jacob","gender#Male","age:2"]
["name#Isabella","gender#Female","age:20.0"] ["name#Isabella","gender#Female","age:2"]
... ...

Create a mapping table by Feature Binning

WITH bins AS (
  SELECT build_bins(age, 3) AS quantiles
  FROM users
)
SELECT
  age, feature_binning(age, quantiles) AS bin
FROM
  users CROSS JOIN bins;
age:int bin: int
20 1
22 1
35 2
55 2
15 0
46 2
20 1

Function Signatures

UDAF build_bins(weight num_of_bins [, auto_shrink=false])

Input

weight: int|bigint|float|double num_of_bins: int [auto_shrink: boolean = false]
weight greather than or equals to 2 behavior when separations are repeated: T=>skip, F=>exception

Output

quantiles: array<double>
thresholds of bins based on quantiles

Note

There is the possibility quantiles are repeated because of too many num_of_bins or too few data. If auto_shrink is set to true, skip duplicated quantiles. If not, throw an exception.

UDF feature_binning(features, quantiles_map)

Input

features: array<features::string> quantiles_map: map<string, array<double>>
feature vector a map where key=column name and value=quantiles

Output

features: array<feature::string>
binned features

UDF feature_binning(weight, quantiles)

Input

weight: int|bigint|float|double quantiles: array<double>
weight array of separation value

Output

bin: int
categorical value (bin ID)

results matching ""

    No results matching ""