[Paper Review] Mitigating Neural Network Overconfidence with Logit Normalization

래훈

|2025. 9. 29. 17:51

논문 링크: LogitNorm
Mitigating Neural Network Overconfidence with Logit Normalization 논문 리뷰입니다.

Introduction

현대의 Neural Network는 open world에 배포되었을 때, 학습과정에서 접하지 않은 분포에서 온 sample인 Out-of-distribution (OOD) input을 처리하는 데 어려움을 겪는다. 이런 sample은 test시점에 높은 confidence로 예측되어서는 안된다.

신뢰할 수 있는 Classifier는 in-distribution (ID) sample을 정확하게 분류할 뿐만 아니라, OOD input을 unknown으로 식별할 수 있어야 한다.

OOD detection을 위한 naive한 해결책은 maximum softmax probability (MSP)를 사용하는 방법이 있다. 이 방법은 OOD data가 ID data보다 상대적으로 낮은 softmax confidence를 유발해야 한다는 것이다. 하지만 실제로 Neural Network는 input이 train data와 크게 떨어져 있더라도 비정상적으로 높은 softmax confidence를 제공한다. 다양한 OOD scoring function을 정의하는 연구들이 존재하지만 overconfidence 문제의 원인과 완화 방법에 대한 연구는 제한적이다.

본 연구에서는 cross-entropy loss에 대한 간단한 변형을 통해 overconfidence 문제를 완화할 수 있음을 보인다. 구체적으로 logit vector의 norm을 일정하게 유지하도록 강제하는 방법이다.

Logit Normalization (LogitNorm)은 Neural network의 logit vector norm에 대한 분석 결과, 대부분의 train sample이 올바른 class로 분류되더라도, softmax cross-entropy loss는 logit vector의 크기를 계속 증가시킬 수 있음을 발견했다. 이를 완화하기 위해 output norm의 영향을 최적화 과정에서 분리하였다. 구체적으로 logit vector를 normalize하여 일정한 norm을 유지하도록 한다. normalized output을 통해 학습된 network는 보수적인 예측을 내는 경향이 있으며, ID와 OOD input간 softmax confidence score에서 강한 분리성을 보인다.

종합적으로, LogitNorm loss는 ID data에 대한 accuracy를 유지하면서, OOD detection에서 뛰어난 성능을 보인다. 이 방법은 적용하기에 용이하며, 간단히 구현할 수 있고 학습 방식에 복잡한 변화를 요구하지 않는다.

Preliminaries

$\mathcal{X}$: input space
$\mathcal{Y}=\{1, \dots, k\}$: label space with $k$ classes
$\mathcal{D}_\text{train}=\{x_{i},y_{i}\}^{N}_{i=1}$: training datasets
$\mathcal{P}_{\mathcal{XY}}$: joint data distribution
$\mathcal{P}_\text{in}$: marginal distribution $\mathcal{X}$
$f:\mathcal{X}\rightarrow \mathbb{R}^{k}$: classifier
- $\theta \in \mathbb{R}^{p}$: parameter

$$\mathcal{R}_{\mathcal{L}}(f) = \mathbb{E}_{(x,y) \sim \mathcal{P}_{\mathcal{XY}}} \left[ \mathcal{L}\big(f(x;\theta), y\big) \right]$$

$\mathcal{L}$: cross-entropy loss

$$\mathcal{L}_{\mathrm{CE}}\big(f(x;\theta), y\big) = - \log p(y \mid x) = - \log \frac{e^{f_{y}(x;\theta)}}{\sum_{i=1}^{k} e^{f_{i}(x;\theta)}}
$$

$f_{y}(x;\theta)$: $y$-th element of $f(x;\theta)$
$y$: ground-truth label
$p(y|x)$: softmax probability

OOD detection task는 binary classification problem으로 정의된다.
$$
g(x) = \begin{cases}
\text{in}, & \text{if } S(x) \geq \gamma, \\
\text{out}, & \text{if } S(x) < \gamma,
\end{cases}
$$

$\gamma$: high fraction of ID data (95%)
$S(x)$: scoring function
- 값이 클수록 ID, 낮을수록 OOD로 분류된다.

Method: Logit Normalization

Motivation

일반적인 softmax cross-entropy loss로 학습된 Neural Network가 왜 overconfident한 예측을 하는 경향이 있는지 분석하였다. 그 결과로 Neural Network의 큰 magnitude가 원인일 수 있음을 시사한다.

$f$를 network의 output $f(x;\theta)$ : (logit 또는 pre-softmax output)이라고 표기한다.
$f$는 두 성분으로 분해될 수 있다.
$$f=\Vert f\Vert \cdot \hat{f}$$

$\Vert f \Vert = \sqrt{f_{1}^{2}+f_{2}^{2}+ \cdots + f_{k}^{2}}$ : Euclidean norm of the logit vector (magnitude)
$\hat{f}$ : unit vector in the same direction as $f$ (direction)

test 단계에서 model은 class 예측을 한다. ($c=arg max_{i}(f_{i})$)

Proposition 3.1.

임의의 상수 $s > 1$에 대해, 만약 $\arg\max_{i}(f_{i})=c$라면, 항상 $\arg\max_{i}(sf_{i})=c$가 성립한다.

위 명제에서 logit의 크기를 scaling해도 예측된 class는 변하지 않는다.

Proposition 3.2.

softmax cross-entropy loss에서, $\sigma$를 softmax activation function이라하자. 임의의 scalar $s>1$에 대해 만약, $c = \arg\max_{i}(f_{i})$라면, 다음이 성립한다:
$$\sigma_{c}(sf)\ge \sigma_{c}(f)$$

Proposition 3.2.를 보면 logit의 크기를 증가시키면 softmax confidence score는 더 높아지지만, 최종 예측은 변하지 않는다. 학습 objective에 미치는 영향을 분석하기 위해 아래와 같이 정리할 수 있다:

$$\mathcal{L}_{CE}(f(x;\theta), y)=-\log p(y|x)=-log \frac{e^{\Vert f\Vert \cdot \hat{f}_{y}}}{\sum_{i=1}^{k}e^{\Vert f\Vert \cdot \hat{f}_{i}}}$$

여기서 train loss가 logit의 크기와 방향에 의존함을 알 수 있다. 방향을 고정한 상태에서 크기가 학습 Loss에 어떤 영향을 주는지 분석할 수 있다. 만약 $y = \arg \max_{i}(f_{i})$라면 $\Vert f \Vert$를 증가시키는 것은 $p(y|x)$를 증가시킨다. 이는 이미 잘 분류된 train sample에 대해서도, $\Vert f \Vert$를 더 키워 더 높은 softmax confidence score를 만들고, 더 작은 loss를 얻게 됨을 의미한다.

학습 과정에서의 logit norm을 보여준다. 실제로 softmax cross-entropy loss는 model이 ID와 OOD sample 모두 점점 더 큰 norm을 가진 logit을 생성하도록 유도한다. 이러한 norm은 overconfident한 softmax score로 이어지며, ID와 OOD data를 구분하는데 어려움을 초래한다.

Method

본 논문의 핵심 idea는 logit의 magnitude가 network 최적화에 미치는 영향을 decouple하는 것이다. 즉, 학습과정에서 logit의 L2 vector norm을 일정한 상수로 유지하는 것이 목표다

$$
\begin{align} & \underset{\theta}{\text{minimize}} & & \mathbb{E}_{P_{xy}}[\mathcal{L}_{\text{CE}}(f(x; \theta), y)] \\ & \text{subject to} & & ||f(x; \theta)||_2 = \alpha. \end{align}
$$

그러나 이러한 constrained optimization을 수행하는 것은 단순하지 않다. 이 문제를 피하기 위해 논문에서는 objective를 end-to-end로 학습가능한 대체 loss function으로 변환하여 logit vector norm을 엄격히 일정하게 유지하도록 한다.

Logit Normalization

이 방법은 logit의 magnitude를 최적화하지 않고, logit의 방향이 해당 one-hot label과 일관되도록 유도한다. 구체적으로 logit vector는 일정한 크기를 갖는 unit vector로 정규화된다. 이후 원래의 output이 아닌 정규화된 logit vector에 softmax cross-entropy loss를 적용한다.

$$\mathcal{R}_{\mathcal{L}}(f) = \mathbb{E}_{(x,y)\sim P_{\mathcal{XY}}}[\mathcal{L}_{\text{CE}}(\hat{f}(x; \theta), y)]$$

$\hat{f}(x; \theta), y)=f(x;\theta) / \Vert f(x;\theta) \Vert$ : normalized logit vector

동등하게, 새로운 loss function은 다음과 같이 정의될 수 있다:
$$\mathcal{L}_{\text{logit-norm}}(f(x; \theta), y) = -\log \frac{e^{f_y / (\tau ||f||)}}{\sum_{i=1}^{k} e^{f_i / (\tau ||f||)}}$$

$\tau$ : logit의 magnitude를 조절
loss function은 $x$에 의존하는 temperature, 즉 $\tau \Vert f(x;\theta) \Vert$를 갖는 것으로 해석할 수 있다.

logit normalization을 통해 output vector의 magnitude는 일정하게 유지된다. ($\frac{1}{\tau}$)

loss를 최소화하는 것은 logit output의 방향을 조정하는 방식으로만 가능하다. 그 결과, model은 $\mathcal{P}_{\text{in}}$에서 멀리 떨어진 입력에 대해 보수적인 예측을 하는 경향을 보인다.

logit normalization으로 학습된 경우 ID와 OOD sample이 더 잘 구분됨을 보여준다.

또한, softmax 출력에 대한 t-SNE 시각화에서 LogitNorm이 ID와 OOD sample을 더 의미있게 구분하는 것을 보인다.

Proposition 3.3. (Lower Bound of Loss)

임의의 input $x$와 임의의 양수 $\tau \in \mathbb{R}^{+}$에 대해, $\mathcal{L}_{\text{logitnorm}}$은 다음과 같은 lower bound를 가진다:
$$
\mathcal{L}_{\text{logitnorm}} \ge \log(1+(k-1)e^{-2/\tau})
$$

$k$ : class의 수

LogitNorm loss는 $\tau$와 $k$에 의존하는 lower bound를 가짐을 알 수 있다. 특히 $\tau$값이 커질수록 loss값의 하한도 증가한다. 본 논문에서는 상대적으로 작은 $\tau < 1$을 사용하였다.

Experiments

In-distribution datasets: CIFAR-10, CIFAR-100
Out-of-distribution datasets: Textures, SVHN, Places365, LSUN-Crop, LSUN-Resize, iSUN
$\tau$는 {0.001, 0.005, 0.01, ... , 0.05}로 선택하였다. CIFAR-10은 0.04로 설정하였다.

Results

How does logit normalization influence OOD detection performance?

Score Function $S(x)=max_{i} \frac{e^{f_{i}(x;\theta)}} {\sum_{j=1}^{k}{e^{f_{j}(x;\theta)}}}$ : softmax confidence score
LogitNorm loss를 적용했을 때 OOD detection 성능이 크게 향상된다.

Cross-entropy loss의 경우 ID와 OOD data모두 softmax score가 높은 값에 집중되는 반면, LogitNorm loss로 학습된 network는 ID와 OOD data간의 score가 명확히 구분되었다.

종합적으로, 실험 결과는 LogitNorm loss로 학습할 경우 softmax score가 ID와 OOD간에 더 구분 가능해지고, 효과적인 OOD detection이 가능함을 보여준다.

또한, Softmax score 뿐만 아니라 다른 Score function (cross-entropy loss로 학습된 모델을 기반으로 개발된 것)에서도 우수한 OOD detection 성능을 달성한다.

LogitNorm은 다양한 model architecture에서 효과적이다. 그리고, classification accuracy를 유지한다.

Discussion

Logit normalization vs. Logit penalty

logit의 L2 norm에 penalty를 부여하는 방식으로 유사한 효과를 얻을 수 있는지 논의하였다.

본 연구에서는 Lagrangian multiplier를 통해 logit norm을 직접적으로 제약하는 것이 잘 작동하지 않음을 보인다.

$$\mathcal{L}_{\text{logit-penalty}}(f(x; \theta), y) = \mathcal{L}_{\text{CE}}(f(x;\theta),y)+\lambda \Vert f(x;\theta)\Vert_{2}$$

$\lambda$: Lagrangian multiplier이다.

logit penalty와 logit normalization 모두 logit의 L2 norm을 작게 만드는 효과를 보인다. 그러나 LogitNorm과 달리, logit penalty 방법은 OOD data에도 큰 L2 norm을 생성하여 OOD detection 성능이 저하된다. 실제로 실험에서 $\lambda$값이 너무 클 경우 수렴에 실패함을 확인하였다.

종합적으로, logit norm을 제약하는 것만으로는 OOD detection task를 해결할 수 없으며, LogitNorm loss는 성능을 크게 향상시킨다.

Relations to temperature scaling.

$\tau$가 OOD detection 성능에 미치는 영향을 분석하였다.
CIFAR-10을 기반으로 6개의 OOD dataset에서 평균 FPR95를 측정한 결과, Proposition 3.3.의 분석과 일치하며 $\tau$가 커질수록 loss의 lower bound가 커지게 되며 이는 최적화 관점에서 바람직하지 않음을 보여준다.

Conclusion

본 논문에서는 cross-entropy loss의 단순한 대안인 Logit Normalization (LogitNorm)을 소개한다.

logit norm이 학습 objective 및 최적화 과정에 미치는 영향을 decouple함으로써, model은 OOD input에 대해 보수적인 예측을 수행하게 되며, 그 결과 ID data와의 분리도가 강화된다.

실험을 통해 OOD detection과 confidence calibration성능을 모두 향상시키면서 ID data에서의 classification accuracy를 유지함을 보였다. 제안된 방법은 실제 환경에서도 쉽게 적용 가능하며, 간단하게 구현할 수 있다.

저작자표시 비영리 변경금지 (새창열림)

'Paper Review' 카테고리의 다른 글

[Paper Review] Discriminability-Driven Channel Selection for Out-of-Distribution Detection (0)	2025.11.06
[Paper Review] How Does Unlabeled Data Provably Help Out-of-Distribution Detection? (2)	2025.08.18
[Paper Review] Training OOD Detectors in their Natural Habitats (3)	2025.07.29
[Paper Review] Learning Loss for Active Learning (5)	2025.07.17
[Paper Review] Unsupervised Out-of-Distribution Detection with Diffusion Inpainting (2)	2025.05.27