In a context of anomaly detection, there are two types of anomalies, outlier and change-point, as discussed in this section. Hivemall has two functions which respectively detect outliers and change-points; the former is Local Outlier Detection, and the latter is Singular Spectrum Transformation.

In some cases, we might want to detect outlier and change-point simultaneously in order to figure out characteristics of a time series both in a local and global scale. ChangeFinder is an anomaly detection technique which enables us to detect both of outliers and change-points in a single framework. A key reference for the technique is:

Outlier and Change-Point Detection using ChangeFinder

By using Twitter's time series data we prepared in this section, let us try to use ChangeFinder on Hivemall.

use twitter;

A function changefinder() can be used in a very similar way to sst(), a UDF for Singular Spectrum Transformation. The following query detects outliers and change-points with different thresholds:

SELECT
  num,
  changefinder(value, "-outlier_threshold 0.03 -changepoint_threshold 0.0035") AS result
FROM
  timeseries
ORDER BY num ASC
;

As a consequence, finding outliers and change-points in the data points should be easy:

num result
... ...
16 {"outlier_score":0.051287243859365894,"changepoint_score":0.003292139657059704,"is_outlier":true,"is_changepoint":false}
17 {"outlier_score":0.03994335565212781,"changepoint_score":0.003484242549446824,"is_outlier":true,"is_changepoint":false}
18 {"outlier_score":0.9153515196592132,"changepoint_score":0.0036439645550477373,"is_outlier":true,"is_changepoint":true}
19 {"outlier_score":0.03940593403992665,"changepoint_score":0.0035825157392152134,"is_outlier":true,"is_changepoint":true}
20 {"outlier_score":0.27172093630215555,"changepoint_score":0.003542822324886785,"is_outlier":true,"is_changepoint":true}
21 {"outlier_score":0.006784031454620809,"changepoint_score":0.0035029441620275975,"is_outlier":false,"is_changepoint":true}
22 {"outlier_score":0.011838969816513334,"changepoint_score":0.003519599336202336,"is_outlier":false,"is_changepoint":true}
23 {"outlier_score":0.09609857927656007,"changepoint_score":0.003478729798944702,"is_outlier":true,"is_changepoint":false}
24 {"outlier_score":0.23927000145081978,"changepoint_score":0.0034338476757061237,"is_outlier":true,"is_changepoint":false}
25 {"outlier_score":0.04645945042821564,"changepoint_score":0.0034052091926036914,"is_outlier":true,"is_changepoint":false}
... ...

ChangeFinder for Multi-Dimensional Data

ChangeFinder additionally supports multi-dimensional data. Let us try this on synthetic data.

Data preparation

You first need to get synthetic 5-dimensional data from HERE and uncompress to a synthetic5d.t file:

$ head synthetic5d.t
0#71.45185411564131#54.456141290891466#71.78932846605129#76.73002575911214#81.71265594077099
1#58.374230566196786#57.9798651697631#75.65793151143754#73.76101930504493#69.50315805346253
2#66.3595943896099#52.866595973073295#76.7987325026338#78.95890786682095#74.67527753118893
3#58.242560151043236#52.449574430621226#73.20383710416358#77.81502394558085#76.59077723631032
4#55.89878019680371#52.69611781315756#75.02482987204824#74.11154526135637#75.86881583921179
5#56.93554246767561#56.55687136423391#74.4056583421317#73.82419594611444#71.3017150863033
6#65.55704393868689#52.136347983404974#71.14213602046532#72.87394198561904#73.40278960429114
7#56.65735280596217#57.293605941063035#75.36713340281246#80.70254745535183#75.32423746923857
8#61.22095211566127#53.47603728473668#77.48215321523912#80.7760107465893#74.43951386292905
9#52.47574856682803#52.03250504263378#77.59550963025158#76.16623830860391#76.98394610743863

The first column indicates a dummy timestamp, and the following four columns are values in each dimension.

Second, the following Hive operations create a Hive table for the data:

create database synthetic;
use synthetic;
CREATE EXTERNAL TABLE synthetic5d (
    num INT,
  value1 DOUBLE,
    value2 DOUBLE,
    value3 DOUBLE,
    value4 DOUBLE,
    value5 DOUBLE
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/dataset/synthetic/synthetic5d';

Finally, you can load the synthetic data to the table by:

$ hadoop fs -put synthetic5d.t /dataset/synthetic/synthetic5d

Detecting outliers and change-points of the 5-dimensional data

Using changefinder() for multi-dimensional data requires us to pass the first argument as an array. In our case, the data is 5-dimensional, so the first argument should be an array with 5 elements. Except for that point, basic usage of the function is same as the previous 1-dimensional example:

SELECT
  num,
  changefinder(array(value1, value2, value3, value4, value5), 
               "-outlier_threshold 0.015 -changepoint_threshold 0.0045") AS result
FROM
  synthetic5d
ORDER BY num ASC
;

Output might be:

num result
... ...
90 {"outlier_score":0.014014718350674471,"changepoint_score":0.004520174906936474,"is_outlier":false,"is_changepoint":true}
91 {"outlier_score":0.013145554693405614,"changepoint_score":0.004480713237042799,"is_outlier":false,"is_changepoint":false}
92 {"outlier_score":0.011631759675989617,"changepoint_score":0.004442031415725316,"is_outlier":false,"is_changepoint":false}
93 {"outlier_score":0.012140065235943798,"changepoint_score":0.004404170732687428,"is_outlier":false,"is_changepoint":false}
94 {"outlier_score":0.012555903663657997,"changepoint_score":0.0043670553008087355,"is_outlier":false,"is_changepoint":false}
95 {"outlier_score":0.013503247137325314,"changepoint_score":0.0043306667027628466,"is_outlier":false,"is_changepoint":false}
96 {"outlier_score":0.013896893553710932,"changepoint_score":0.004294969164345527,"is_outlier":false,"is_changepoint":false}
97 {"outlier_score":0.01322874844578159,"changepoint_score":0.004259994590721001,"is_outlier":false,"is_changepoint":false}
98 {"outlier_score":0.019383618511936707,"changepoint_score":0.004225604978710543,"is_outlier":true,"is_changepoint":false}
99 {"outlier_score":0.01121758589038846,"changepoint_score":0.004191881992962213,"is_outlier":false,"is_changepoint":false}
... ...

results matching ""

    No results matching ""