데이터 분석을 위한 기본 DATA TYPES

실제 data는 attribute type이 여러 가지 섞여 있으며, point-to-point dependence 합니다. 이번 포스팅에서는 분석에 사용되는 여러 기본 data types에 대해 알아보도록 하겠습니다.

1. Categorical, Text and Mixed Attributes

현실에서 race, gender, ZIP code 같은 discrete unordered values 를 지닌 categorical attributes 을 포함한 data set 많습니다. 이러한 discrete unordered value를 다룰 때 주요 challenge는 discrete data의 의미를 보존하면서 distance ( or similarity ) function을 construct 하는 것입니다. 한 예로는 discrete attribute의 값이 너무 많지 않을 때 dummy 변수화 하여 regression based model 적용하는 방법이 있습니다.

이런 regression based model은 text domain에서 단어 발생간의 correlations을 regression model 생성에 사용될 수 있습니다. latent semantic analysis 에 기반한 text de-noising에 most successful한 모델도 linear regression analysis 입니다. 또한 clustering, proximity-based, probabilistic , frequent pattern mining methods 적용 가능합니다.

2. When the Data Values have Dependencies

실제로 different data values는 서로 시간, 공간, 네트워크 등으로 관련되어있는 경우가 많으며, 이러한 dependencies는 definition 단계에서도 anomaly detection process에 큰 영향을 미칩니다. 데이터의 기대되는 값은 그들의 contextual dependencies 에 영향을 받고, outlier는 이러한 contextually 기대되는 값에 대한 deviation으로 정의되며 이를 contextual anomalies or conditional anomalies 라고 합니다. collective anomaly or outlier 는 a set of data items가 a group of points로서 anomalous 함을 뜻합니다. Dependency-oriented data의 모든 anomalies는 contextual or collective anomalies라 볼 수 있습니다. unexpected pattern을 결정하기 위해 인접한 데이터 points간의 관계를 바탕으로 기대되는 값을 계산하기 때문입니다.

2.1 Time-series Data and Data Streams

Time series data는 시간에 따라 continuous measurement에 의해 생성된 a set of values를 뜻합니다. normal 한 상태일 때 보통 연속적인 time stamp 값은 급격하게 변하지 않거나 혹은 부드럽게 변합니다. 따라서 sudden change는 anomalous events로 볼 수 있으며 time series 에서 anomalous points 의 발견은 event detection과 관계가 깊습니다. 이러한 events는 관련 시간대에서 contextual or collective anomalies로서 드러나게 됩니다. time series data와 그 상황에서의 outlier 의 예시는 다음과 같습니다.

time-stamp가 6일 때 sudden change가 되지만 이후 이는 new noraml이 됩니다. time-stamp가 12일 때 그 값이 3이 됩니다. 이 값은 이전에 있던 값이지만 consecutive data values에서 sudden change 이므로 outlier로 볼 수 있습니다.

temporal data 에서의 common challenge는 new data values가 arrive하자마자 실시간으로 outlier detection 하는 것입니다. outlier detection in temporal data와 change analysis가 관련이 깊지만 같은 것은 아닙니다. data stream이 시간에 따라 천천히 변하는 concept drift는 identify outlier에 사용되지 않기 때문입니다. → 모든 anomaly detection in temporal data가 change detection에 사용되지 않습니다. data stream이 abrupt하게 변하는 것만이 identify outlier에 사용하며 이때 change analysis 와 anomaly detection in temporal data는 분리하기 어려울 정도로 긴밀하기 때문에 한 쪽에서의 솔루션을 다른 쪽에 적용 가능합니다.

일반적으로 online analysis는 change detection 에 적합하고 offline analysis는 데이터의 unusual aspects를 explore 하는데 적합합니다.

e.g. online analysis) time series에서 large change in trends 발생하면 anomalies로 볼 수 있습니다.
e.g. offline analysis) multidimensional data stream에서 aggregate distribution에서 change가 발생하면 unusual event로 볼 수 있습니다.

2.2 Discrete Sequences

discrete sequence에서 a single position의 value를 주변 placement의 value와 상대적인 비교를 통해 예측하며, 예측 값과의 deviations을 contextual outliers 로 정의합니다. Discrete sequence 와 Continuous sequence(e.g. 시계열)는 적용되는 specific techniques, similarity functions, representation data structure는 다르지만, conceptual level에서 유사합니다. discrete sequences가 시계열과 관련 있을 때 individual position에서 categorical values를 지닌 a categorical or discrete analog of time-series data 라고도 볼 수 있습니다. discrete sequence 예측에 대표적으로 Markovian models 사용됩니다.

2.3 Spatial Data

non-spatial attributes ( e.g. temperature, pressure, image pixel, color intensity )가 spatial location에서 측정되며, 특정 장소에서 이러한 값들의 비정상인 변화를 outlier로 봅니다. 관심 attribute에 특정 수준의 continuity를 요구한다는 점에서 time-series data와 비슷합니다.

temporal data : 인접한 time stamp 간의 continuity
spatial data : 인접한 위치와의 continuity
spatiotemporal data : temporal data + spatial data
- e.g. formation cyclones : 바다 표면의 temperature와 pressure가 temporal & spatial continuity 기대합니다.

2.4 Network and Graph Data

data values 를 node, data values 간의 관계를 edge로 표현하며, node 간의 불규칙성 혹은 edge 를 outlier로 볼 수 있습니다. graph같은 complex data는 outlier 정의에 있어 큰 유연함과 복잡성을 지닙니다. outlier를 정하는 unique한 방법이 존재하지 않으며, 다루고 있는 도메인에 따라 outlier가 정의되므로 모델링 목적에 맞춰 어떤 상태를 normal로 정의할지에 대한 prior inference 필요합니다. outlier modeling을 위해 다른 dependencies type과 결합이 가능합니다. (e.g. 시간에 따라 구조가 변한다면 structural + temporal 로 나타낼 수 있습니다.)

network data와 그 상황에서의 outlier 의 예시는 다음과 같습니다.

(a) 를 보면 6th node 는 unusual locality structure를 지녀 outlier로 볼 수 있습니다.
(b) 를 보면 서로 구별 되는 communities를 연결하여 relationship outlier or community outlier로 볼 수 있습니다.

* 이 글은 Charu C. Aggarwal 의 Outlier Analysis Second Edition을 정리한 글입니다.

'인공지능 > ML' 카테고리의 다른 글

PCA 변환 과정 및 특성 (0)	2021.06.13
Outlier Variance Reduction Methods in Ensemble (0)	2021.06.01
Isolation Forests for Outlier Detection (0)	2021.04.30
ROC Curve 설명(해석) 및 그리기(구현)-Python (2)	2021.02.28
Precision-Recall Curves 설명 및 그리기(구현)-Python (2)	2021.02.27

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

All IS WELL