Approximate Counting using HyperLogLog
count(distinct value)
can often cause memory exhausted errors where input data and the cardinality of value are large.
HyperLogLog is an efficient algorithm for approximating the number of distinct elements in a multiset.
Hivemall implements HyperLogLog++ in approx_count_distinct
.
Usage
approx_count_distinct
is less accurate than COUNT(DISTINCT expression), but performs better on huge input.
select
count(distinct rowid) as actual,
approx_count_distinct(rowid) as default_p
from
train;
actual | default_p |
---|---|
45840617 | 45567770 |
select
approx_count_distinct(rowid, '-p 4') as p4,
approx_count_distinct(rowid, '-p 6 -sp 6') as p6_sp6,
approx_count_distinct(rowid, '-p 14') as p14,
approx_count_distinct(rowid, '-p 15') as p15,
approx_count_distinct(rowid, '-p 16') as p16,
approx_count_distinct(rowid, '-p 24') as p24,
approx_count_distinct(rowid, '-p 25') as p25,
approx_count_distinct(rowid, '-p 15 -sp 15') as p15_sp15
from
train;
p4 | p6_sp6 | p14 | p15 | p16 | p24 | p25 | p15_sp15 |
---|---|---|---|---|---|---|---|
38033066 | 49332600 | 45051015 | 45567770 | 45614484 | 45831359 | 45832280 | 45567770 |
Note
p
controls expected precision and memory consumption tradeoff and default p=15
generally works well. Find More information on this paper.
Function Signature
You can find the function signature and options of approx_count_distinct
is as follows:
select
approx_count_distinct(rowid, '-help')
from
train;
usage: HLLEvaluator [-help] [-p <arg>] [-sp <arg>]
-help Show function help
-p <arg> The size of registers for the normal set. `p` MUST be in the
range [4,sp] and 15 by the default
-sp <arg> The size of registers for the sparse set. `sp` MUST be in the
range [4,32] and 25 by the defaul