All IS WELL :: Multidimensional Streaming Outlier Detection - Introduction & Individual Data Points as Outliers

Multidimensional Streaming Outlier Detection - Introduction & Individual Data Points as Outliers

Introduction

이전의 section에서는 multiple correlated time series 가 있어도 각 time series는 한 단위로서 처리 되었습니다. 다차원 데이터의 경우 각 instance는 분할 할 수 없는 $d$-dimensions 으로 구성되며 각 individual time series 에서 높은 수준의 temporal continuity을 보입니다.
multidimensional data streams에서는 높은 수준의 temporal continuity가 반드시 존재한다 볼 수 없습니다.
- e.g. multidimensional test stream : text record의 각 attribute의 frequency로 다음 record를 예측할 수 없습니다.
- But, outlier를 예측하기 위해 document 에 있는 present words를 aggregate level로 stream의 history에서 비교 가능합니다.
각 상황에서 기대되는 temporal continuity 가 다르기 때문에 multidimensional data stream 와 multivariate time series에서의 outlier analysis 다릅니다.
multidimensional streaming scenario는 multidimensional outlier analysis 에 사용되는 traditional methods에 temporal component를 추가한 것 입니다. ( 그러나 time series data에 비해 temporal component는 약함 )
multidimensional data streams 에서는 outliers 가 빠르게 발견되어야 하기 때문에 efficiency가 주요 문제입니다.
multidimensional data streams 의 두 가지 type
- individual records 중심의 outlier detection
  - e.g. a first news story on a specific topic represent an outlier of this type.
  - 이런 outlier는 전에 발견된 점이 아니기 때문에 novelty로 불립니다.
- multidimensional data에서의 aggregate trends가 변하는 경우 중심
  - e.g. unusual event that lead to a burst to new stories on a specific topic.
  - 특정 time window에 기반한 aggregated outlier로 표현
  - 두 번째의 경우 change point는 거의 첫 번째 경우의 individual outlier로 시작합니다. 그러나 첫 번째 경우에서 각 individual outlier라해서 언제나 aggregate change point가 되는 것은 아닙니다.

9.4.1 Individual Data Points as Outliers

individual data points를 detecting 하는 문제는 특히 전체의 data stream이 이용될 때 unsupervised novelty detection과 깊은 관계있습니다.
이런 novelties는 종종 trend-setters이나, 결국 normal data의 일부분으로 될 수 있으나 individual record가 data points의 window로서 outlier로 declare 되면 이는 반드시 novelty로 볼 수 없습니다.
이런 경우 거의 직접적으로 알고리즘을 data points의 window에 적용하는 proximity-based algorithms이 incremental scenario를 generalize하는데 용이합니다.

9.4.1.1 Proximity-Based Algorithms

The original distnace-based 정의는 time-window를 추가하여 다음과 같이 수정
The outlier score of a data point is defined in terms of its k-nearest neighbor distance to data points in a time window of length W.
data points 의 the entire window가 main memory에 저장 가능할 때 outlier를 판별하기 쉽습니다.
반면 많은 경우에 entire window를 main memory에 저장할 수 없는 경우가 있는데, 이때 distance-based pruning을 위한 efficient indexes를 생성하기 어려워 이런 streaming scenario는 challenging. 이런 경우에 outlier의 정확한 determination은 computational too expensive하기 때문에 approximate outlier를 이용합니다. 이런 outlier detection process는 accuracy guarantees가 provided.
previous window에 대한 small subsamples 부터의 averaged ensemble score는 variance reduction effects 때문에 exact outlier score보다 효율적입니다.
[443]에서 LOF이 incremental scenario로 확장되었고, 이 방식은 data points의 삽입,삭제와 다룰 수 있음. 삽입은 다음 두 개의 과정을 따릅니다.
- existing LOF model관련 inserted data의 statistics 계산합니다
- existing LOF model is updated in terms of densities, reachability distance, and outlierness of the underlying data points. ⇒ 새로운 data 추가의 영향 때문에 parameters of many of the existing data points need to be updated. 추가된 데이터 주변만 update이므로 모든 점이 update되는 것 x
distance based methods는 computationally expensive하기 때문에 clustering based approach를 통해 complexity 향상 가능합니다.
clustering based는 충분한 데이터 수가 있을 때 효과적인데 data stream은 보통 data가 많아서 clustering이 available.
streaming clustering algorithm context에서 the formation of new clusters 는 unsupervised novelties와 associated. 그러나 entire history가 current set of cluster에 의한 limited-space summary에 반영되지 않으면 new cluster생성 x
- e.g. [28] incoming data가 data내에 존재하는 existing clusters의 특정 statistical radius에 있지 않으면 new cluster 생성에 regulate. ⇒ 이런 incoming data는 outlier로 고려됩니다.
- 많은 case에서 이러한 outlier는 algorithms이 진행되면서 더 많은 data가 그 cluster에 추가될 수록 new trend의 beginning이 됩니다.
- In some cases, 이러한 data points는 novelties에 해당되고 In other cases, such points는 오래전에 봤으나 지금의 clusters에 해당되지 않는 trend일 수 있습니다.
- data stream을 오래 저장하지 않는 이상 이러한 두 종류의 outliers를 구분할 수 없습니다.

9.4.1.2 Probabilistic Algorithms

proababilistic algorithms은 limited data에서 추천되지 않으나 일반적으로 data stream에서는 data가 충분하여 learning algorithm이 효율적으로 수행되는 한 limited data issue는 적음.
[578]에서 data sets의 mixed attribute를 통한 mixture models 생성하는 것 제안. 모델에 data point가 추가되기 전후 incoming data point가 fit 한지 계산. 이는 expectation-maximization algorithm 이용합니다.
optimization process를 향한 iterative approach때문에 Expectation-maximization algorithms은 naturally하게 streaming setting에 적합합니다.

9.4.1.3. High-Dimensional Scenario

high dimensional projected clustering algorithms 때문에 streaming scenario 와 high-dimensional data를 조합하는 것은 특히 challenging.
Data points which start new clustsers in the stream can typically be reported as outliers.

SPOT algorithms

high dimensional data streams 에서의 outlier detection하기 위해 고안
data의 sparse subspaces를 directly하게 찾도록 고안
underlying data stream의 summary에 기반한 decaying cell 이용. 이는 projected cell을 결정하기 위해 사용됨. data points가 속한 cell 이 sparse subspaces에 해당되면 outlier로 판별

Ensemble

ensemble이 많이 연구되지 않았지만, variable size의 subsamples과 rotated bagging 을 결합하여 실시간으로 outlier score를 제공할 수 있습니다. 이는 any base outlier detector와 함께 이용 가능하며 accuracy 향상에 도움이 될 가능성 높습니다.
ensemble 의 resulting data sets는 extremely compact하기 때문에 multiple data sets에 대한 small history maintain 가능합니다.

* 이 글은 Charu C. Aggarwal 의 Outlier Analysis Second Edition을 정리한 글입니다.

'인공지능 > ML' 카테고리의 다른 글

Position Outliers in Discrete Sequences (0)	2021.08.13
Outlier Detection in Discrete Sequences - Position & Combination Outliers (0)	2021.08.13
PCA 변환 과정 및 특성 (0)	2021.06.13
Outlier Variance Reduction Methods in Ensemble (0)	2021.06.01
데이터 분석을 위한 기본 DATA TYPES (0)	2021.06.01

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Multidimensional Streaming Outlier Detection - Introduction & Individual Data Points as Outliers

Introduction

9.4.1 Individual Data Points as Outliers

9.4.1.1 Proximity-Based Algorithms

9.4.1.2 Probabilistic Algorithms

9.4.1.3. High-Dimensional Scenario

'인공지능 > ML' 카테고리의 다른 글

검색

티스토리툴바