Hivemall supports Feature Hashing (a.k.a. hashing trick) through feature_hashing and mhash functions. Find the differences in the following examples.

feature_hashing function

feature_hashing applies MurmurHash3 hashing to features.

select feature_hashing('aaa');

4063537

select feature_hashing('aaa','-features 3');

2

select feature_hashing(array('aaa','bbb'));

["4063537","8459207"]

select feature_hashing(array('aaa','bbb'),'-features 10');

["7","1"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'));

["4063537:1.0","4063537","8459207:2.0"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-libsvm');

["4063537:1.0","4063537:1","8459207:2.0"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10');

["7:1.0","7","1:2.0"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10 -libsvm');

["1:2.0","7:1.0","7:1"]

select feature_hashing(array(1,2,3));

["11293631","3322224","4331412"]

select feature_hashing(array('1','2','3'));

["11293631","3322224","4331412"]

select feature_hashing(array('1:0.1','2:0.2','3:0.3'));

["11293631:0.1","3322224:0.2","4331412:0.3"]

select feature_hashing(features), features from training_fm limit 2;

["1803454","6630176"] ["userid#5689","movieid#3072"] ["1828616","6238429"] ["userid#4505","movieid#2331"]

select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331"));

["1828616:3.3","6238429:4.999","6238429"]

select feature_hashing();

usage: feature_hashing(array<string> features [, const string options]) -
       returns a hashed feature vector in array<string> [-features <arg>]
       [-libsvm]
 -features,--num_features <arg>   The number of features [default:
                                  16777217 (2^24)]
 -libsvm                          Returns in libsvm format
                                  (<index>:<value>)* sorted by index
                                  ascending order

Note

The hash value is starting from 1 and 0 is system reserved for a bias clause. The default number of features are 16777217 (2^24). You can control the number of features by -num_features (or -features) option.

mhash function

describe function extended mhash;

mhash(string word) returns a murmurhash3 INT value starting from 1

select mhash('aaa');

4063537

Note: The default number of features are 16777216 (2^24).

set hivevar:num_features=16777216;
select mhash('aaa',${num_features});

4063537

Note: mhash returns a +1'd murmurhash3 value starting from 1. Never returns 0 (It's a system reserved number).

set hivevar:num_features=1;
select mhash('aaa',${num_features});

1

Note: mhash does not considers feature values.

select mhash('aaa:2.0');

2746618

Note: mhash always returns a scalar INT value.

select mhash(array('aaa','bbb'));

9566153

Note: mhash value of an array is element order-sentitive.

select mhash(array('bbb','aaa'));

3874068

results matching ""

    No results matching ""