[QLoRA 리뷰] Qlora: Efficient finetuning of quantized llms

Notice

Recent Posts

Recent Comments

Today

Total

작심삼일

[QLoRA 리뷰] Qlora: Efficient finetuning of quantized llms 본문

Deep Learning

[QLoRA 리뷰] Qlora: Efficient finetuning of quantized llms

yun_s 2023. 9. 6. 21:18

728x90

Introduction

LLM같은 큰 모델을 fine-tuning하는 것은 너무 expensive하다
근래에 사용하는 quantization 방법들은 아직 부족하다
- inference단에서만 사용 가능

QLoRA

는 pre-trained model을 4-bit로 quantize하는 high-precision technique이다.

QLoRA를 사용하면 큰 모델들을 GPU 단 한개로 fine-tuning할 수 있다.

주요 방법들

4-bit NormalFloat
- Normally distributed data에 딱 맞는 quantization용 data type
DoubleQuantization
- Quantization constants를 quantization함으로써 메모리 감소
Paged Optimizer
- Optimizer를 paging하기

Background

Block-wise k-bit Quantization

문제점
- Outlier가 존재한다면, 어떤 영역에는 숫자가 거의 포함되지 않을 수 있음
해결 방안
- Input tensor별로 quantization constant $c$를 갖도록 함

Low-rank Adapters (LoRA)

간단히 설명하자면 $Y=XW=X(W_1+W_2)=X(W_1+L_1L_2)$로 학습 과정의 $W_2$를 matrix decomposition 한 것이디ㅏ.

QLoRA Fine-tuning

4-bit NormalFloat Quantization

Quantile Quantization
- 정의
  - the empitical cumulative distribution function을 이용해 각 quantization bin이 같은 수의 숫자를 포함하도록 함
- 문제점
  - the empitical cumulative distribution function이 expensive 함
만약 input tensor의 분포가 고정되어있다면 해결할 수 있음
- pre-trained neural lweights는 대게 $N(0, \sigma )$임
- 이것의 범위를 $[-1, 1]$로 변경

0을 정확히 표현하기 위해 Asymmetric data type으로 결정
- negative: $2^{k-1}$
- positive: $2^{k-1}+1$
그래서 최종적으로 NF4의 값은 다음과 같이 정해졌다.

Double Quantization

추가적인 메모리 절약을 위해 Quantization Constants(QC)를 quantization 함
- ex
  - $W$를 blocksize 64로 32-bit QC ($C_2^{FP32}$)를 가지며 quantization을 한다면 W → 32/64=0.5 bits per parameter를 가지게 됨
  - $C_2^{FP32}$를 blocksize 256로 32-bit QC ($C_1^{FP32}$)를 가지며 quantization을 한다면 8/64+32/(64*256)=0.127 bits per parameter를 가지게 됨
  - 따라서 0.5 → 0.127 bit로 감소
- 이를 진행해도 성능이 줄지는 않았음

Paged Optimizer

GPU의 out-of-memory 문제를 해결하기 위해 사용
CPU RAM과 disk 사이에서 paging하는 것처럼, GPU와 CPU 사이에서 진행

QLoRA

Experiment

16-bit를 4-bit로 줄여도 그 성능이 떨어지지 않음

같은 bit를 사용한다했을 때 QLoRA를 사용하면 그 성능이 더 좋음

Reference

Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).

728x90

Comments

작심삼일

[QLoRA 리뷰] Qlora: Efficient finetuning of quantized llms 본문

[QLoRA 리뷰] Qlora: Efficient finetuning of quantized llms

Introduction

QLoRA

주요 방법들

Background

Block-wise k-bit Quantization

Low-rank Adapters (LoRA)

QLoRA Fine-tuning

4-bit NormalFloat Quantization

Double Quantization

Paged Optimizer

QLoRA

Experiment

Reference

티스토리툴바