In a context of anomaly detection, there are two types of anomalies, outlier and change-point, as discussed in this section. Hivemall has two functions which respectively detect outliers and change-points; the former is Local Outlier Detection, and the latter is Singular Spectrum Transformation.
In some cases, we might want to detect outlier and change-point simultaneously in order to figure out characteristics of a time series both in a local and global scale. ChangeFinder is an anomaly detection technique which enables us to detect both of outliers and change-points in a single framework. A key reference for the technique is:
- K. Yamanishi and J. Takeuchi. A Unifying Framework for Detecting Outliers and Change Points from Non-Stationary Time Series Data. KDD'02.
Outlier and Change-Point Detection using ChangeFinder
By using Twitter's time series data we prepared in this section, let us try to use ChangeFinder on Hivemall.
use twitter;
A function changefinder()
can be used in a very similar way to sst()
, a UDF for Singular Spectrum Transformation. The following query detects outliers and change-points with different thresholds:
SELECT
num,
changefinder(value, "-outlier_threshold 0.03 -changepoint_threshold 0.0035") AS result
FROM
timeseries
ORDER BY num ASC
;
As a consequence, finding outliers and change-points in the data points should be easy:
num | result |
---|---|
... | ... |
16 | {"outlier_score":0.051287243859365894,"changepoint_score":0.003292139657059704,"is_outlier":true,"is_changepoint":false} |
17 | {"outlier_score":0.03994335565212781,"changepoint_score":0.003484242549446824,"is_outlier":true,"is_changepoint":false} |
18 | {"outlier_score":0.9153515196592132,"changepoint_score":0.0036439645550477373,"is_outlier":true,"is_changepoint":true} |
19 | {"outlier_score":0.03940593403992665,"changepoint_score":0.0035825157392152134,"is_outlier":true,"is_changepoint":true} |
20 | {"outlier_score":0.27172093630215555,"changepoint_score":0.003542822324886785,"is_outlier":true,"is_changepoint":true} |
21 | {"outlier_score":0.006784031454620809,"changepoint_score":0.0035029441620275975,"is_outlier":false,"is_changepoint":true} |
22 | {"outlier_score":0.011838969816513334,"changepoint_score":0.003519599336202336,"is_outlier":false,"is_changepoint":true} |
23 | {"outlier_score":0.09609857927656007,"changepoint_score":0.003478729798944702,"is_outlier":true,"is_changepoint":false} |
24 | {"outlier_score":0.23927000145081978,"changepoint_score":0.0034338476757061237,"is_outlier":true,"is_changepoint":false} |
25 | {"outlier_score":0.04645945042821564,"changepoint_score":0.0034052091926036914,"is_outlier":true,"is_changepoint":false} |
... | ... |
ChangeFinder for Multi-Dimensional Data
ChangeFinder additionally supports multi-dimensional data. Let us try this on synthetic data.
Data preparation
You first need to get synthetic 5-dimensional data from HERE and uncompress to a synthetic5d.t
file:
$ head synthetic5d.t
0#71.45185411564131#54.456141290891466#71.78932846605129#76.73002575911214#81.71265594077099
1#58.374230566196786#57.9798651697631#75.65793151143754#73.76101930504493#69.50315805346253
2#66.3595943896099#52.866595973073295#76.7987325026338#78.95890786682095#74.67527753118893
3#58.242560151043236#52.449574430621226#73.20383710416358#77.81502394558085#76.59077723631032
4#55.89878019680371#52.69611781315756#75.02482987204824#74.11154526135637#75.86881583921179
5#56.93554246767561#56.55687136423391#74.4056583421317#73.82419594611444#71.3017150863033
6#65.55704393868689#52.136347983404974#71.14213602046532#72.87394198561904#73.40278960429114
7#56.65735280596217#57.293605941063035#75.36713340281246#80.70254745535183#75.32423746923857
8#61.22095211566127#53.47603728473668#77.48215321523912#80.7760107465893#74.43951386292905
9#52.47574856682803#52.03250504263378#77.59550963025158#76.16623830860391#76.98394610743863
The first column indicates a dummy timestamp, and the following four columns are values in each dimension.
Second, the following Hive operations create a Hive table for the data:
create database synthetic;
use synthetic;
CREATE EXTERNAL TABLE synthetic5d (
num INT,
value1 DOUBLE,
value2 DOUBLE,
value3 DOUBLE,
value4 DOUBLE,
value5 DOUBLE
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/dataset/synthetic/synthetic5d';
Finally, you can load the synthetic data to the table by:
$ hadoop fs -put synthetic5d.t /dataset/synthetic/synthetic5d
Detecting outliers and change-points of the 5-dimensional data
Using changefinder()
for multi-dimensional data requires us to pass the first argument as an array. In our case, the data is 5-dimensional, so the first argument should be an array with 5 elements. Except for that point, basic usage of the function is same as the previous 1-dimensional example:
SELECT
num,
changefinder(array(value1, value2, value3, value4, value5),
"-outlier_threshold 0.015 -changepoint_threshold 0.0045") AS result
FROM
synthetic5d
ORDER BY num ASC
;
Output might be:
num | result |
---|---|
... | ... |
90 | {"outlier_score":0.014014718350674471,"changepoint_score":0.004520174906936474,"is_outlier":false,"is_changepoint":true} |
91 | {"outlier_score":0.013145554693405614,"changepoint_score":0.004480713237042799,"is_outlier":false,"is_changepoint":false} |
92 | {"outlier_score":0.011631759675989617,"changepoint_score":0.004442031415725316,"is_outlier":false,"is_changepoint":false} |
93 | {"outlier_score":0.012140065235943798,"changepoint_score":0.004404170732687428,"is_outlier":false,"is_changepoint":false} |
94 | {"outlier_score":0.012555903663657997,"changepoint_score":0.0043670553008087355,"is_outlier":false,"is_changepoint":false} |
95 | {"outlier_score":0.013503247137325314,"changepoint_score":0.0043306667027628466,"is_outlier":false,"is_changepoint":false} |
96 | {"outlier_score":0.013896893553710932,"changepoint_score":0.004294969164345527,"is_outlier":false,"is_changepoint":false} |
97 | {"outlier_score":0.01322874844578159,"changepoint_score":0.004259994590721001,"is_outlier":false,"is_changepoint":false} |
98 | {"outlier_score":0.019383618511936707,"changepoint_score":0.004225604978710543,"is_outlier":true,"is_changepoint":false} |
99 | {"outlier_score":0.01121758589038846,"changepoint_score":0.004191881992962213,"is_outlier":false,"is_changepoint":false} |
... | ... |