ROC Curve 설명(해석) 및 그리기(구현)-Python

Goal

이 페이지에서는 ROC Curve가 무엇이고, 구체적으로 어떻게 그리는지 알아보겠습니다. 그동안 ROC Curve에 대해서 알고있던 것들은 분류기(classifier)에 쓰이는 Metric이며, 왼쪽 위쪽으로 볼록하고 AUC값이 1에 가까울 수록 좋다 정도였습니다. 굉장히 흔하게 쓰이는 평가 Metric인데도 불구하고 몰랐다는 점에 공부 필요성을 느꼈고, 공부한 내용을 정리하기 위해 글을 올리게 되었습니다. ROC Curve를 그리는 방법을 이를 위해서 필요하다고 생각되는 False Positive Rate(FPR) 과 True Positive Rate(TPR), 그리고 Threshold와 FPR와 TPR 의 관계를 먼저 알아보겠습니다.

FPR 와 TPR

FPR와 TPR은 분류기의 모델을 평가하는 metric 입니다. FPR와 TPR은 Confusion Matrix를 기반으로 만들어집니다. Confusion Matrix는 분류 문제에서 모델이 예측한 것과 실제 Label과 일치, 불일치 하는 수를 요소로 가지며, 다음과 같습니다. FPR와 TPR외에도 Accuracy, F1-score 등 여러가지 Metric도 Confusion Matrix를 기반으로 합니다.

Confision Matrix에 있는 각 요소 들에 설명을 하겠습니다.

$TP$ = 실제 Positive를 모델이 Positive로 옳게 예측한 수

$T N$ = 실제 Negative를 모델이 Negative로 옳게 예측한 수

$F P$ = 실제 Negative를 모델이 Positive로 잘못 예측한 수

$F N$ = 실제 Positive를 모델이 Negative로 잘못 예측한 수

FPR와 TPR은 Confusion Matrix의 요소들로 다음과 같이 나타낼 수 있습니다.

FPR은 실제 Negative에서 모델이 Positive라고 예측한 비율을 뜻하며, 다음과 같이 표현합니다.
$FPR$ = $\cfrac{FP}{TN+FP}$
TPR은 실제 Positive에서 모델이 Positive라고 예측한 비율을 뜻하며, 다음과 같이 표현합니다.
$TPR$ = $\cfrac{TP}{TP+FN}$

Threshold와 FPR, TPR의 관계

Positive / Negative를 확률로 구분하거나, 특정 Score로 구분하는 경우 FPR와 TPR은 Threshold의 함수입니다. 즉, FPR와 TPR은 Threshold 값에 따라 바뀌는 변수이며, 다음과 같이 볼 수 있습니다. (Threshold를 T로 줄이겠습니다.)

$FPR(T)$ = $\cfrac{FP(T)}{TN(T)+FP(T)}$

$TPR(T)$ = $\cfrac{TP(T)}{TP(T)+FN(T)}$

그림을 예를 들어 Threshold 값에 따른 TPR(T) 와 FPR(T)의 변화를 보겠습니다. 다음 그림은 T(=Threshold )가 1, 1.5, 2일 때의 TPR(T), FPR(T) 값입니다. 모델은 X축의 값이 Treshold 이상이면 Positive, Treshold 미만이면 Negative로 판단합니다.

1) T = 1

$FPR(1)$ = $\cfrac{FP(1)}{TN(1)+FP(1)}$

$TPR(1)$ = $\cfrac{TP(1)}{TP(1)+FN(1)}$

$FPR(1)$ = $\cfrac{30}{70+30}$ = 0.3

$TPR(1)$ = $\cfrac{90}{90+10}$ = 0.9

2) T = 1.5

$FPR(1.5)$ = $\cfrac{FP(1.5)}{TN(1.5)+FP(1.5)}$

$TPR(1.5)$ = $\cfrac{TP(1.5)}{TP(1.5)+FN(1.5)}$

$FPR(1.5)$ = $\cfrac{20}{80+20}$ = 0.2

$TPR(1.5)$ = $\cfrac{80}{80+20}$ = 0.8

3) T = 2

$FPR(2)$ = $\cfrac{FP(2)}{TN(2)+FP(2)}$

$TPR(2)$ = $\cfrac{TP(2)}{TP(2)+FN(2)}$

$FPR(2)$ = $\cfrac{10}{90+10}$ = 0.1

$TPR(2)$ = $\cfrac{70}{70+30}$ = 0.7

즉, FPR과 TPR은 Threshold 값에 따라 바뀌는 변수임을 알 수 있습니다.

ROC Curve Plot

ROC Curve 설명 및 특징

ROC Curves는 Parameter인 Threshold를 변화시키면서 FPR과 TPR을 Plot 한 Curve입니다. ROC Curves는 X축으로는 FPR을, Y축으로는 TPR을 가집니다. ROC Curve는 단조함수이기때문에 이러한 이유로 Precision-Recall Curve에 비해 직관적이라는 장점을 갖습니다.

처음 Threshold가 매우 엄격할 때 Positive로 예측할 수 있는 모델이 없어 FPR, TPR은 모두 0이 되어 ROC Curve는 (0,0)을 지납니다. 점점 Threshold가 완화되다가 제일 느슨할 때 모든 데이터를 Positive로 예측해 FPR, TPR은 모두 1이 되어 ROC Curve는 (1,1)을 지납니다. 랜덤 분류기는 Y = X 형태의 ROC Curve를 갖는데, 이는 Threshold가 증가함에 따라 그만큼 FPR, TPR이 상승하기 때문입니다.

ROC Curve Plot Algorithm

ROC Curve를 구체적으로 plot 하는 방법을 알아보겠습니다.

1. Probabillity (혹은 Score - 꼭 확률이지 않아도 됩니다. non-thresholded decision values 면 상관없습니다. )를 기준으로 내림차순 정렬합니다.

2. Threshold를 가장 높은 Probability(or Scroe)로 정합니다.

3. Threshold 값에 따른 Model의 예측 값을 정합니다.

Threshold이상의 값이면 Positive, 미만이면 Negative로 예측합니다.

4. FPR과 TPR을 구해 ROC Curve에 Plot 합니다.

5. Threshold를 현재 Probability(or Scroe) 보다 한 단계 낮은 값으로 정하고 3번으로 돌아갑니다.

현재 Threshold보다 낮은 Probaility(or Scroe)가 없다면 종료합니다.

ROC Curve Plot Example

다음은 위의 과정을 설명하기 위한 예시입니다. True Label에 Positive, Negative가 각각 10개씩 있습니다.

1. Probability를 기준으로 내림차순으로 정렬합니다.

2. Threshold를 가장 높은 Probability인 0.9로 정합니다.

3. Threshold값에 따른 Model의 예측값을 정합니다. (현재 TH=0.9)

Probability >= Threshold이면 Positive로 Predict 합니다.

Probability < Threshold이면 Negative로 Predict 합니다.

4. FPR과 TPR 값을 구해 ROC Curve에 Plot합니다.

$FPR$ = 실제 Negative를 Positive로 예측 / 실제 Negative

$TPR$ = 실제 Positive를 Positive로 예측 / 실제 Positive

FPR = $\cfrac{0}{10}$ = 0

TPR = $\cfrac{1}{10}$ = 0.1

Plot ( 0 , 0.1 )

5. Threshold를 현재 Probability보다 한 단계 낮은 값으로 정하고 3번으로 돌아갑니다. Threshold = 0.8

3. Threshold값에 따른 Model의 예측값을 정합니다. (현재 TH=0.8)

Probability >= Threshold이면 Positive로 Predict 합니다.

Probability < Threshold이면 Negative로 Predict 합니다.

4. FPR과 TPR 값을 구해 ROC Curve에 Plot합니다.

$FPR$ = 실제 Negative를 Positive로 예측 / 실제 Negative

$TPR$ = 실제 Positive를 Positive로 예측 / 실제 Positive

FPR = $\cfrac{0}{10}$ = 0

TPR = $\cfrac{2}{10}$ = 0.2

Plot ( 0 , 0.2 )

5. Threshold를 현재 Probability보다 한 단계 낮은 값으로 정하고 3번으로 돌아갑니다. Threshold = 0.7

3. Threshold값에 따른 Model의 예측값을 정합니다. (현재 TH=0.7)

Probability >= Threshold이면 Positive로 Predict 합니다.

Probability < Threshold이면 Negative로 Predict 합니다.

4. FPR과 TPR 값을 구해 ROC Curve에 Plot합니다.

$FPR$ = 실제 Negative를 Positive로 예측 / 실제 Negative

$TPR$ = 실제 Positive를 Positive로 예측 / 실제 Positive

FPR = $\cfrac{1}{10}$ = 0.1

TPR = $\cfrac{2}{10}$ = 0.2

Plot ( 0.1 , 0.2 )

5. Threshold를 현재 Probability보다 한 단계 낮은 값으로 정하고 3번으로 돌아갑니다. Threshold = 0.6

이러한 과정을 반복하여 Threshold = 0.1까지 진행합니다.

3. Threshold값에 따른 Model의 예측값을 정합니다. (현재 TH=0.1)

Probability >= Threshold이면 Positive로 Predict

Probability < Threshold이면 Negative로 Predict 합니다.

4. FPR과 TPR 값을 구해 ROC Curve에 Plot합니다.

$FPR$ = 실제 Negative를 Positive로 예측 / 실제 Negative

$TPR$ = 실제 Positive를 Positive로 예측 / 실제 Positive

FPR = $\cfrac{10}{10}$ = 1

TPR = $\cfrac{10}{10}$ = 1

Plot ( 1 , 1 )

5. 현재 Threshold보다 낮은 Probaility(or Scroe)가 없어 종료합니다.

위의 과정 중에 나온 FPR, TPR을 Plot 하면 다음과 같은 ROC Curve가 그려지게 됩니다.

Code 구현

y_true = np.array([1,1,0,1,1,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0])
y_scores = np.array([0.9,0.8,0.7,0.6,0.55,0.54,0.53,0.52,0.51,0.505,0.4,0.39,0.38,0.37,0.36,0.35,0.34,0.33,0.3,0.1 ])

def get_tpr(y_true,y_scores,threshold):
    predict_positive_num = len(y_scores[y_scores >= threshold]) 
    tp = len([x for x in y_true[:predict_positive_num] if x == 1])
    ground_truth = len(y_true[y_true==1]) 
    tpr =  tp / ground_truth
    return tpr
    
def get_fpr(y_true,y_scores,threshold):
    predict_positive_num = len(y_scores[y_scores >= threshold] )
    fp = len([x for x in y_true[:predict_positive_num] if x == 0 ])
    ground_negative = len(y_true[y_true==0])
    fpr = fp / ground_negative
    return fpr

def roc_plot(y_true,y_scores):
    tpr , fpr = [] , []

    for _ in y_scores: # y_scores 를 thresholds 처럼 사용했음
        tpr.append(get_tpr(y_true,y_scores,_ ))
        fpr.append(get_fpr(y_true,y_scores,_ ))

    fig = plt.figure(figsize=(9, 6))

    #3d container
    ax = plt.axes(projection = '3d')
    #3d scatter plot
    ax.plot3D(fpr, y_scores, tpr)
    ax.scatter3D(fpr, y_scores, tpr)
    ax.plot3D([0,1],[1,0],[0,1])
    #give labels
    ax.set_xlabel('False-Positive-Rate')
    ax.set_ylabel('Thresholds')
    ax.set_zlabel('True-Positive-Rate')
    ax.set_title('ROC Curve 3D')
    #set fpr,tpr limit 0 to 1
    ax.set_xlim(0,1)
    ax.set_zlim(0,1)
    ax.view_init(26) #방향 돌려서 보기. 
    plt.show()
    
    fig = plt.figure(figsize = (9,6))
    plt.plot(fpr, tpr)
    plt.scatter(fpr,tpr)
    plt.plot([0,1],[0,1])
    plt.xlabel('False-Positive-Rate')
    plt.ylabel('True-Positive-Rate')
    plt.xlim(0,1)
    plt.ylim(0,1)
    plt.title('ROC Curve 2D')
    plt.show()    
    
roc_plot(y_true, y_scores)

코드 결과물 및 설명

우리가 보통 보는 ROC Curve는 FPR과 TPR 축으로만 구성된 2D Curve입니다. 하지만, FPR과 TPR은 Parameter인 Threshold에 따라 정해지는 값이므로, Threshold까지 포함하면 3D가 됩니다. 이를 표현하기 위해 Threshold 축을 추가하여 3D로 그려봤습니다. 이 3D에서 FPR-TPR 평면으로 Projection하면 우리가 평소에 보던 2D Curve를 볼 수 있습니다.

이번 포스팅에서는 ROC Curve 에 대해 알아봤는데 이와 성격이 비슷한 Precision-Recall Curves 설명 및 그리기는 다음 URL에 정리해두었습니다.

https://ardentdays.tistory.com/20

Precision-Recall Curves 설명 및 그리기(Python)

Precision-Recall Curves 설명 및 그리기(Python) Goal 이 페이지에서는 Precision-Recall Curve가 무엇이고, 어떻게 그려지는지 알아보겠습니다. 이를 위해서 필요하다고 생각되는 Precision과 Recall, 그리..

ardentdays.tistory.com

'인공지능 > ML' 카테고리의 다른 글

데이터 분석을 위한 기본 DATA TYPES (0)	2021.06.01
Isolation Forests for Outlier Detection (0)	2021.04.30
Precision-Recall Curves 설명 및 그리기(구현)-Python (2)	2021.02.27
이상치 감지를 위한 Depth-Based Method (0)	2021.02.25
Box plot 정리 (0)	2021.02.25

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

All IS WELL