Isolation Forests for Outlier Detection

Isolation Forests and Outlier

Isolatino forest는 isolation trees를 이용한 앙상블 combination입니다. isolation tree에서는 data는 랜덤으로 선택된 attribute, 랜덤으로 선택된 partition point에서 axis-parallel cuts 하게 재귀적으로 한 node에 하나의 instance가 남을 때까지 나눠집니다. isolation forest의 train 단계에서 unsupervised equivaluents of decision trees인 multiple isolation trees를 생성하고 각 tree는 binary하게 나뉘어집니다. 이때 별도의 설정이 없다면 leaf node에 한 instance가 존재하도록 N개의 데이터에서 N개의 leaf node가 생기며, 이런 parameter free 특징은 unsupervisded 방식에 적합합니다. (height parameter를 통한 early termination도 가능합니다.) outlier data points는 sparse region에 위치해있기 때문에, outlier 의 tree branch는 깊지 않습니다. 이를 활용하기 위해 root node로부터 leaf 까지의 거리를 outlier score로 사용합니다. data points는 isolation forest의 다른 tree들에 대해 average path length 가 구해지며, average path lengh가 짧아질 수록 outlier로 볼 수 있습니다.

Isolation Forests and Subspace outlier detection

Isolation forest는 subspace outlier detection과 밀접한 관련이 있는데, different branches는 data 의 다른 local subspace region에 해당합니다. smaller path는 outlier 가 isolated되는 lower dimensionality에 해당합니다. point를 isolate하는 dimensionality가 작을 수록 더 강한 outlier가 존재할 가능성이 높습니다. 예를 들어 30대 이하이면서 알츠하이머 병을 가지고 있는 경우 두 가지 차원만으로도 이상치를 판단할 수 있습니다.

Isolation Forests Time complexity w/ and w/o subsampling

평균적으로 시간 복잡도는 이며 $\theta(Nlog(N))$ 이며 각 isolation tree마다의 공간 복잡도는 $O(N)$ 입니다. subsampling 을 통해 computational effciency, accuracy, diversity 향상시킬 수 있습니다. 보통 subsample size를 256로 정할 때 잘 작동하나 data set에 따라 다를 수 있습니다. subsample size가 일정하면 data set size와 관계없이 tree를 building 하는데 contant computational and memory requirements 소요됩니다. test phase에서 data set이 전부 사용됐다면 average-case로 각 데이터 마다 $\theta(log(N))$ 의 complexity가 필요합니다. size 256의 subsamples가 이용됐다면 각 point에 대해서 test phase는 constant time이 필요합니다. subsampling이 constant 샘플 수와 constant trial 수와 함께 이뤄졌다면 running time은 training phase일 때 constant, test phase일 때 $O(N)$ 으로 constant 하며, space complexity또한 constant합니다. train phase에서 subsample 을 통해 isolation tree생성 후 test phase 때 out-of-sample을 train phase 때 만든 isolation tree에 적용합니다. (decision tree와 비슷)

Isolation Tree Algorithm

모든 points 를 포함하는 root node 생성 & further splitting을 위한 후보 nodes 를 담은 list $C$ 생성하고 거기에 root node 추가 ( 현재 root node만 $C$ 에 존재 )
$C$ 에서 node $R$ 을 랜덤으로 선택하고 $R$ 을 $C$ 에서 제거
을 $R_1,R_2$ 로 분리할 랜덤 attribute $i$ 선택 및 그 안에서 random value $a$ 를 정하고 그 값을 기준으로 $x_1 \leq a$ 는 $R_1$ 으로, 그 외의 $x_i$ 는 $R_2$ 로 split. random value $a$ 는 attribute $i$ 에서 min value와 max value 사이에서 uniformly하게 랜덤으로 추출됨
만약 $R_i (i\in \{1,2\})$ 이 두 개이상의 points를 포함하고 있다면 $R_i$ 를 $C$ 에 포함하고 그렇지 않으면 $R_i$ 를 isolation-tree leaf로 designate.

Further Enhancements for Subspace Selection

Isolation forest 방식은 Kurtosis measure를 통한 feature pre-selection을 통해 성능이 더 좋아지는데, Kurtosis는 다음과 같이 계산됩니다.

non-uniform한 features는 높은 첨도를 지니고 첨도를 anomaly detection의 feature selection measure로도 볼 수 있습니다. univariate 첨도 값으로 ranking매긴 후 높은 값을 갖는 feature를 이용하여 preselect random forest construct하는 방식도 존재합니다. 이 방식은 global subspace selection을 반환하나 이 방식은 feature bagging 처럼 랜덤일지라도 different local subspaces를 explore가 여전히 가능한 random split approach입니다.

Kurtosis measure의 generalization을 multidimensional Kurtosis라 부릅니다. 이 measure는 위의 식을 이용하여 features들의 subsets을 jointly point의 마할라노비스 distance로 계산합니다. 일반적으로 univariate Kurtosis보다 성능이 좋지만 structured way로 feature subset을 exploration하는 식으로 묶여서 computationally expensive합니다. computational complexity가 문제가 되지 않을 때 generalized Kurtosis measure를 isolation forest에 적용해보는 것도 좋습니다.

Early Termination

tree가 full height 로 성장하지 못하게 하는 방법을 통해서도 enhance할 수 있습니다. node가 duplicate instances를 갖거나 특정 threshold height를 넘자마자 node의 성장을 stop 하는 방식을 통해 termination을 할 수 있습니다. 성장이 멈춘 node들의 path length를 추정하기 위해 additional credit이 이용되는데, node containing $r$ instances, its additional credit $c(r)$ 은 binary search tree에서의 expected path length로 다음과 같이 정의됩니다. 이 credit은 root node로부터의 path length에 더해집니다.

Relationship to Clustering Ensembles and Histograms

isolation forest는 clustering ensemble로 볼 수 있습니다. isolation tree는 data로 부터 계층적인 projected clusters를 생성하는데, 그 cluster는 bounding boxes로 정의됩니다. ⇒ cluster(node)의 a bounding box는 root 부터 해당 node로 axis-parallel split되는 순서에 의해 정의됩니다. However, compared to most projected clustering methods, the isolation tree is extremely randomized because of its focus on ensemble-centric performance.

basic isolation forest는 extremely randomized clustering forests (ERC-forest) 의 variation으로 볼 수 있습니다. 주요 차이는 ERC-Forests는 class label로 적은 supervision을 가능하기 위해 각 node에 대해 multiple trials 이용하지만 Isolation tree는 trial 수를 1로 정하고 full length로 자라게 하여 unsupervised isolation tree 이용 가능하게 하는 것입니다. ERC-Forests 와 isolation forests는 space-partitioning이기 때문에 histogram and density based methods에 쓰이는 intuitive similarities 공유 가능합니다.

isolation tree 는 계층적이고 randomized grid region을 생성하는데, 이 grid region의 expected volume은 각 split마다 factor 2로 감소합니다. ⇒ the path length는 a single data를 포함하는 maximal grid region (fractional) volume의 negative logarithm ( fractional 이므로 negative logarithm ) 에 대한 rough surrogate로 볼 수 있습니다. 이는 전통적인 histogram에 이용된 log-likelihood density의 notion과 비슷하나 histogram과 달리 isolation forests 는 parameterized되지 않아 varying distribution을 갖는 data를 유연하게 다룰 수 있습니다.

cluster의 subspace dimensionality 합과 해당 cluster의 point 수를 outlier score로 이용하는 방법도 있는데, 이때 cluster의 subspace dimensionality는 isolation tree의 rough proxy path length입니다. 또 다른 방법으로 isolation tree의 variation인 fixed-height trees를 이용하는 half-space trees을 사용하는 경우도 있습니다. 여기서 한 point의 relavant node에 있는 point수는 outlier score로 이용됩니다. ⇒ 이는 가장 가까운 cluster의 point 수가 outlier score의 중요한 부분으로 사용되는 clustering-based outlier detection methods와 비슷합니다.

* 이 글은 Charu C. Aggarwal 의 Outlier Analysis Second Edition을 정리한 글입니다.

Outlier Variance Reduction Methods in Ensemble (0)	2021.06.01
데이터 분석을 위한 기본 DATA TYPES (0)	2021.06.01
ROC Curve 설명(해석) 및 그리기(구현)-Python (2)	2021.02.28
Precision-Recall Curves 설명 및 그리기(구현)-Python (2)	2021.02.27
이상치 감지를 위한 Depth-Based Method (0)	2021.02.25

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

All IS WELL