What's One-hot encoding?
Ont-hot encoding is a method to encode categorical features by a 1-of-K (thus called 1-hot) encoding scheme.
Suppose the following table:
Company | Price |
---|---|
VW | 290 |
Toyota | 300 |
Honda | 190 |
Honda | 250 |
A one-hot encoding output is expected as follows:
Company_VW | Company_Toyota | Company_Honda | Price |
---|---|---|---|
1 | 0 | 0 | 290 |
0 | 1 | 0 | 300 |
0 | 0 | 1 | 190 |
0 | 0 | 1 | 250 |
The above one-hot table is a dense feature format and it can be expressed as follows by a sparse format:
Company | Price |
---|---|
{1} | 290 |
{2} | 300 |
{3} | 190 |
{3} | 250 |
The mapping for company name is {VW->1, Toyota->2, Honda->3}.
Now, suppose encoding two categorical variables as follows into a sparse vector.
category1 | category2 |
---|---|
cat | mammal |
dog | mammal |
human | mammal |
seahawk | bird |
wasp | insect |
wasp | insect |
cat | mammal |
dog | mammal |
human | mammal |
The one-hot encoded feature vector could be as follows:
category1 | category2 | encoded_features |
---|---|---|
cat | mammal | {1,6} |
dog | mammal | {2,6} |
human | mammal | {3,6} |
seahawk | bird | {4,7} |
wasp | insect | {5,8} |
We use this test
table for explaration.
drop table test;
create table test (species string, category string, count int);
truncate table test;
insert into table test values
('cat','mammal',9),
('dog','mammal',10),
('human','mammal',10),
('seahawk','bird',101),
('wasp','insect',3),
('wasp','insect',9),
('cat','mammal',101),
('dog','mammal',1),
('human','mammal',9);
One-hot encoding table
You can get one-hot encoding table for spieces as follows:
WITH t as (
select onehot_encoding(species) m
from test
)
select m.f1 from t;
f1 |
---|
{"seahawk":1,"cat":2,"human":3,"wasp":4,"dog":5} |
WITH t as (
select onehot_encoding(species, category) m
from test
)
select m.f1, m.f2 from t;
| f1 | f2 | | {"seahawk":1,"cat":2,"human":3,"wasp":4,"dog":5} | {"bird":6,"insect":7,"mammal":8} |
You can create a mapping table as follows:
create table mapping as
WITH t as (
select onehot_encoding(species, category) m
from test
)
select m.f1, m.f2 from t;
desc mapping;
col_name | data_type
------------|----------------
f1 | map<string,int>
f2 | map<string,int>
How to use One-hot encoding
The following query applies one-hot encoding using the mapping table.
select
t.species, m.f1[t.species],
t.category, m.f2[t.category]
from
test t
CROSS JOIN mapping m;
cat 2 mammal 8
dog 5 mammal 8
human 3 mammal 8
seahawk 1 bird 6
wasp 4 insect 7
wasp 4 insect 7
cat 2 mammal 8
dog 5 mammal 8
human 3 mammal 8
You can create a sparse feature vector as follows:
select
array(m.f1[t.species],m.f2[t.category],feature('count',count)) as sparse_feature
from
test t
CROSS JOIN mapping m;
sparse_feature
["2","8","count:9"]
["5","8","count:10"]
["3","8","count:10"]
["1","6","count:101"]
["4","7","count:3"]
["4","7","count:9"]
["2","8","count:101"]
["5","8","count:1"]
["3","8","count:9"]
It also can be achieved by a single query as follows:
WITH mapping as (
select
m.f1, m.f2
from (
select onehot_encoding(species, category) m
from test
) tmp
)
select
array(m.f1[t.species],m.f2[t.category],feature('count',count)) as sparse_features
from
test t
CROSS JOIN mapping m;
Note that one-hot encoding is required only for categorical variables. Feature hasing is another scalable way to encode categorical variables to numerical index.