๐Ÿ“ฆ[Review] DeepSpeed-MoE


[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] ๊ฐ•๋™๊ทœ Reviewed by Kade Kang (devkade12@gmail.com) Reviewed:: 13, 2024

The Purpose of This Study

GPT๊ฐ€ ๋‚˜์˜ค๊ณ  ์ง€๋‚œ 3๋…„๊ฐ„ ์„ฑ๋Šฅ์ฆ๊ฐ€๋ฅผ ์œ„ํ•ด LLM ๋“ค์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ์ฆ๊ฐ€ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋Š” ๊ฒƒ์€ computing cost ๋กœ ์ธํ•ด ์ ์  ๋” ์–ด๋ ค์›Œ์ง€๊ณ  ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 2021๋…„ 11์›” Megatron-Turing NLG 530B Model์˜ ๊ฒฝ์šฐ 2000๊ฐœ์˜ A100 GPU๋ฅผ ๊ฐ€์ง€๊ณ ๋„ ํ›ˆ๋ จํ•˜๋Š”๋ฐ 3๋‹ฌ์˜ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ๋‹ค.

๋”ฐ๋ผ์„œ, ๋‹ค์Œ์˜ ์งˆ๋ฌธ์„ ๋˜์ง€๊ฒŒ ๋œ๋‹ค.

Computing cost๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค์ง€ ์•Š๊ณ  ์œ ์˜๋ฏธํ•œ ํ–ฅ์ƒ์„ ์ด๋ค„๋‚ด๋Š” ๋ฐฉ๋ฒ•์€ ์—†์„๊นŒ? ํ˜น์€, 3~5๋ฐฐ์˜ ๋” ์ ์€ ๋น„์šฉ์œผ๋กœ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ์—†์„๊นŒ?

โ‡’ Mixture-of-Experts(MoE)

Lit. Review

What is MoE?

Reference : Switch Transformer

์œ„ ๊ทธ๋ฆผ์€ Switch transformer ์˜ ๊ตฌ์กฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. Mixture-of-Experts ๋Š” ์œ„ ๊ตฌ์กฐ์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ „๋ฌธ๊ฐ€ FFN์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์— ๋งž๋Š” FFN์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด ๋‚ธ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ MoE๋„ ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ์ง€๋‹Œ๋‹ค.

  1. Limited Scope: NLP์—์„œ MoE ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ํ™œ์šฉ ๋ฒ”์œ„๊ฐ€ encoder-decoder ๊ตฌ์กฐ, Seq2seq ์ž‘์—… ๋“ฑ์œผ๋กœ ์ œํ•œ๋œ๋‹ค.(๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋กœ ์ธํ•ด Auto-Regressive ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€)
  2. Massive Memory Requirements: ๊ธฐ์กด Dense ๋ชจ๋ธ๋ณด๋‹ค ์ƒ๋‹นํžˆ ๋งŽ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•„์š”๋กœํ•˜๊ณ  ์ด๋Š” ๋” ๋‚ฎ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ ๋ณด์ธ๋‹ค.
  3. Limited Inference Performance: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์œผ๋กœ ์ธํ•ด ์ถ”๋ก  ์†๋„๋„ ๋–จ์–ด์ง„๋‹ค.

Large Scale Dense NLP Models

  • Hundreds of millions of parameters
    • BERT, XLNet, RoBERTa, ALBERT, and GPT, etc.
  • Billions to dozens of billions models
    • GPT-2, TuringNLG, Megatron-LM, T5, etc.
  • Extra-Large Model
    • GPT-3, Megatron-Turing NLG 530B model

Methods

DeepSpeed-MoE for NLG: Reducing the Training Cost of Language Models by 5 Times

Natural Language Generation(NLG)๋Š” ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ๋Œ€ํ•ด์„œ ํ™•์‹คํ•œ ๋‹ต์„ ์ œ๊ณตํ•ด์ค€๋‹ค. ํ™œ์šฉ์„ฑ์ด ๋›ฐ์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— NLG์˜ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๊ด€์‹ฌ์‚ฌ์˜€๊ณ , DeepSpeed-MoE ์˜ ๊ฒฝ์šฐ ๊ฐ™์€ ํ›ˆ๋ จ ๋น„์šฉ์„ ๊ฐ€์ง€๊ณ ๋„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋„๋ก ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.

MoE based NLG Model Architecture

  • MoE based NLG
    • 350M (24layers, 1024 hidden size, 16 attention heads)
    • 1.3B (24 layers, 2048 hidden size, 16 attention heads)
    • 6.7B (32 layers, 4096 hidden size, 32 attention heads)
    • MoE-128 : ๊ฐ FFN๋งˆ๋‹ค 128๊ฐœ์˜ ์ „๋ฌธ๊ฐ€๋ฅผ ์ ์šฉํ•œ ๊ฒƒ.
  • Transformer ๊ธฐ๋ฐ˜ NLG ๋ชจ๋ธ์ธ GPT๋ฅผ ์—ฐ๊ตฌํ•ด ์œ„ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์„ ์ •
  • ์‹ค์ œ๋กœ ์ˆœ์ „ํŒŒ, ์—ญ์ „ํŒŒ ์‹œ์— ํ™œ์„ฑํ™”๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๋Š” MoE๋ฅผ ์ ์šฉํ–ˆ์„ ๋•Œ์™€ ์ ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ๊ฐ€ ๋™์ผํ•˜๋‹ค. (e.g. 1.3B ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ 1.3B์™€ 1.3B+MoE-128์˜ ํ† ํฐ๋‹น ํ™œ์„ฑํ™”๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 1.3B ์ด๋‹ค.)
  • ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด gating function์„ ํ†ตํ•ด์„œ ๊ฐ ์ „๋ฌธ๊ฐ€๋กœ ์ „๋‹ฌํ•œ๋‹ค.

Training and Evaluation Setting

Reference : [2201.05596] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  • Ampere A100 GPU 128๊ฐœ ์‚ฌ์šฉ
  • Data Parallel + Expert Parallel ์‚ฌ์šฉ
  • ๋ฐ์ดํ„ฐ : MT-NLG ๋ชจ๋ธ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ

MoE Leads to Better Quality for NLG Models

Reference : [2201.05596] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  • MoE๋ฅผ ์ ์šฉํ•œ Loss๊ฐ€ Dense ๋ชจ๋ธ๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์Œ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.
  • 6.7B Dense ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ณด๋‹ค 5๋ฐฐ ์ ์€ 1.3B+MoE-128์ด ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
    • 4~5๋ฐฐ๋ฅผ ์ ˆ๊ฐํ•˜์—ฌ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ฆ๊ฐ€, ํ›ˆ๋ จ ์‹œ๊ฐ„ ๋ฐ ๋น„์šฉ ์ ˆ๊ฐ์œผ๋กœ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

Reference : [2201.05596] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  • Zero-Shot ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ์—๋„ Dense์— ๋น„ํ•ด 4~5๋ฐฐ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.#Zero-Shot

PR-MoE and MoS: Reducing the Model Size and Improving Parameter Efficiency

Table 1 ์˜ ๊ฐ ๋ชจ๋ธ๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด MoE๋ฅผ ์ ์šฉํ•œ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ Dense ๋ชจ๋ธ์— ๋น„ํ•ด ์•ฝ 8๋ฐฐ ์ •๋„ ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฐ–๋Š”๋‹ค. MoE ๋ชจ๋ธ์€ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๊ณ , ์ด๋Š” ๋‹ค์Œ์˜ ๋ฌธ์ œ๋ฅผ ๊ฐ–๋Š”๋‹ค.

  • ๋ชจ๋ธ์˜ ํ•™์Šต ์‹œ ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌ
  • ์ถ”๋ก ์—์„œ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ฝ๋Š” ๋ฐ ์†Œ๋น„๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ฃผ์š”ํ•œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ์›์ธ์ด๋‹ค. ์ฆ‰, MoE๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋กœ ์ธํ•ด ์ถ”๋ก  ์†๋„๊ฐ€ ๋А๋ ค์ง„๋‹ค.

โ‡’ ์ „์ฒด ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ตœ๋Œ€ 3๋ฐฐ๊นŒ์ง€ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” PR-MoE + Distillation์„ ํ™œ์šฉํ•œ Mixture-of-Student(MoS) ๋ฅผ ์ œ์‹œํ–ˆ๋‹ค.#Distillation

PR-MoE: Pyramid-Residual-MoE for Smaller Model Size and Fast Inference

Reference : [2201.05596] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

  • ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋ฉด์„œ ์„ฑ๋Šฅ์€ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์œ„ ๊ทธ๋ฆผ์˜ ๊ตฌ์กฐ์™€ ๊ฐ™์€ PR-MoE ๊ตฌ์กฐ๋ฅผ ์ œ์‹œํ–ˆ๋‹ค.
  • PR-MoE ๋Š” ๋งˆ์ง€๋ง‰ ๋ช‡ ๊ฐœ์˜ ๊ณ„์ธต์—์„œ ๋” ๋งŽ์€ ์ „๋ฌธ๊ฐ€๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , MLP ๋ชจ๋“ˆ๊ณผ MoE ๋ชจ๋“ˆ์„ ๋™์‹œ์— ์‚ฌ์šฉํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค.
  • PR-MoE ๊ตฌ์กฐ๋Š” ์•„๋ž˜์˜ ๊ณผ์ •์„ ํ†ตํ•ด ์ œ์‹œ๋๋‹ค.
Phenomenon1

Standard MoE ๊ตฌ์กฐ๋Š” ๊ฐ MoE ๊ณ„์ธต๋งˆ๋‹ค์˜ ์ „๋ฌธ๊ฐ€์˜ ์ˆ˜์™€ ๊ตฌ์กฐ๊ฐ€ ๋™์ผํ•˜๋‹ค. ์ด ๋™์ผํ•œ ๊ตฌ์กฐ๋ฅผ ์ค„์ผ ์ˆ˜๋Š” ์—†์„๊นŒ? โ‡’ CV์—์„œ๋Š” ์–•์€ ๋ ˆ์ด์–ด์—์„œ ์ผ๋ฐ˜์ ์ธ ํŠน์ง•์„ ํ•™์Šตํ•˜๊ณ  ๊นŠ์€ ๋ ˆ์ด์–ด์—์„œ ๋ณด๋‹ค ๊ตฌ์ฒด์ ์ด๊ณ  ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ํ‘œํ˜„์„ ํ•™์Šตํ•œ๋‹ค. ์ด๋Ÿฐ ํŠน์ง•์„ ์ด์šฉํ•ด ๋ฏธ์„ธ ์กฐ์ •์‹œ ์–•์€ ๋ ˆ์ด์–ด๋Š” ๊ณ ์ •ํ•˜๊ณ  ๊นŠ์€ ๋ ˆ์ด์–ด๋งŒ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•ด๋ณด๊ณ ์ž ์•„๋ž˜ 2๊ฐ€์ง€๋ฅผ ๋น„๊ตํ–ˆ๋‹ค.

  • First-Half MoE(์–•์€ ๋ถ€๋ถ„์—์„œ ์ค‘๊ฐ„ ๋ถ€๋ถ„๊นŒ์ง€ MoE ์ ์šฉ)๊ณผ
  • Second-Half MoE(์ค‘๊ฐ„ ๋ถ€๋ถ„์—์„œ ๋ ๋ถ€๋ถ„๊นŒ์ง€ MoE ์ ์šฉ)๋ฅผ ๋น„๊ตํ–ˆ๋‹ค.

Reference : [2201.05596] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

์œ„ Figure 2์˜ ์™ผ์ชฝ์„ ๋ณด๋ฉด Second-Half MoE๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ์ฆ‰, ๋ ๋ถ€๋ถ„์— MoE๋ฅผ ์ ์šฉํ•  ๊ฒฝ์šฐ ์ „๋ฌธ๊ฐ€์˜ ํšจ๊ณผ๊ฐ€ ๋” ๋›ฐ์–ด๋‚จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Phenomenon2

MoE ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ด ์žˆ์„๊นŒ?

  1. ์ „๋ฌธ๊ฐ€ ์šฉ๋Ÿ‰(๊ฐ ํ† ํฐ์ด ๊ฑฐ์น˜๋Š” ์ „๋ฌธ๊ฐ€ ์ˆ˜)๋ฅผ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „๋ฌธ๊ฐ€ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•
  2. ์ „๋ฌธ๊ฐ€ ์ˆ˜๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „๋ฌธ๊ฐ€ ์šฉ๋Ÿ‰์„ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•

1๋ฒˆ์˜ ๊ฒฝ์šฐ ์ „๋ฌธ๊ฐ€ ์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๊ธฐ์— ํ•™์Šต์— ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•œ๋‹ค. 2๋ฒˆ์˜ ๊ฒฝ์šฐ ์šฉ๋Ÿ‰์ด ์ปค์ง€๋ฉด์„œ ํ†ต์‹ ๋Ÿ‰๋„ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต, ์ถ”๋ก ์— ๋ณ‘๋ชฉ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ•™์Šต, ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ์—†์„๊นŒ? ์ „๋ฌธ๊ฐ€์˜ ์šฉ๋Ÿ‰์„ ์™œ ๋Š˜๋ฆฌ๊ณ ์ž ํ• ๊นŒ? ๋‘ ๋ช…์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ํŒ๋‹จํ•œ๋‹ค๋ฉด ๋” ์ผ๋ฐ˜ํ™”๋œ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋‘ ๋ช…์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ํŒ๋‹จ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์ถ”๊ฐ€ ์ „๋ฌธ๊ฐ€๊ฐ€ ์ฒซ ๋ฒˆ์งธ ์ „๋ฌธ๊ฐ€์—๊ฒŒ ์ฒจ์–ธ์„ ํ†ตํ•ด ํŒ๋‹จ์˜ ์ˆ˜์ •์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. โ‡’ ๊ทธ๋ ‡๋‹ค๋ฉด ๋ผ์šฐํŒ…์„ ํ†ตํ•ด ์ฒซ ๋ฒˆ์งธ ์ „๋ฌธ๊ฐ€๋ฅผ ๋งค ํ† ํฐ๋งˆ๋‹ค ์ˆ˜์ •ํ•ด์•ผ ํ•˜๋Š”๊ฐ€? ํ˜น์€ ํ•œ ๋ช…์˜ ์ „๋ฌธ๊ฐ€๋ฅผ ๊ณ ์ •์œผ๋กœ ํ•ด๋‘๊ณ  ์ฒจ์–ธํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋ฌธ๊ฐ€๋ฅผ ๋‘์–ด์•ผ ํ•˜๋Š”๊ฐ€?

์ด๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜ 2๊ฐ€์ง€๋ฅผ ๋น„๊ตํ–ˆ๋‹ค.

  • ์šฉ๋Ÿ‰์„ 2๋ฐฐ๋กœ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•(Top2-MoE: 2๋ช…์˜ ์ „๋ฌธ๊ฐ€์—๊ฒŒ ์ „๋‹ฌ, ์ถœ๋ ฅ์„ ํ•ฉ์‚ฐ)
  • ํ•œ ์ „๋ฌธ๊ฐ€๋กœ ๊ณ ์ •ํ•˜๊ณ  ํ† ํฐ๋งˆ๋‹ค ๋‘ ๋ฒˆ์งธ ์ „๋ฌธ๊ฐ€๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐฉ๋ฒ•(Residual-MoE: MLP ๋ชจ๋“ˆ๋กœ ๊ณ ์ •, MoE ๋ชจ๋“ˆ์„ ํ†ตํ•ด ์ „๋ฌธ๊ฐ€ ๋ฝ‘์•„ ํ•ฉ์‚ฐ)

Figure 2์˜ ์˜ค๋ฅธ์ชฝ์ด ํ•ด๋‹น ์‹คํ—˜์˜ ๊ฒฐ๊ณผ์ด๋‹ค. Top2-MoE ์™€ Residual-MoE ๊ฐ€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Residual-MoE ์˜ ๊ฒฝ์šฐ Top-1 gating๊ณผ ๋™์ผํ•œ ์–‘์˜ ํ†ต์‹ ๋Ÿ‰์œผ๋กœ ๋ ˆ์ด์–ด๋‹น 2๊ฐœ์˜ ์ „๋ฌธ๊ฐ€๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์‹คํ—˜์—์„œ๋Š” Residual-MoE ์˜ ์†๋„๊ฐ€ Top2-MoE ๋ณด๋‹ค 10% ์ด์ƒ ๋น ๋ฅด๋‹ค๊ณ  ํ•œ๋‹ค.

Efficient Training an MoE Model

๊ฐ MoE ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ ์ „๋ฌธ๊ฐ€์— ์ˆœ์ „ํŒŒ๋กœ ํ†ต๊ณผํ•˜๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ปค์„œ, ํ›ˆ๋ จ์ด ์ž˜ ๋˜์–ด์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ „๋ฌธ๊ฐ€์˜ ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ์ „๋ฌธ๊ฐ€ ํ•˜๋‚˜๊ฐ€ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ํ† ํฐ ์ˆ˜๊ฐ€ ์ค„์–ด๋“ ๋‹ค. โ‡’ Data Parallel + Expert Parallel ์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•œ๋‹ค.

์ „๋ฌธ๊ฐ€ ์ˆ˜์™€ ๋ณ‘๋ ฌํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ์ž์›์ด ๋™์ผํ•˜๋‹ค๋ฉด ํšจ์œจ์ ์œผ๋กœ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๊ฒ ์ง€๋งŒ, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ๊ฐ€ ๋” ๋งŽ๋‹ค. ์ฆ‰, ๋‹ค์Œ์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

  • ์ „๋ฌธ๊ฐ€ ๋ณ‘๋ ฌํ™”๋ฅผ ์ตœ์†Œ ์ „๋ฌธ๊ฐ€ ์ˆ˜๋กœ ์„ค์ •ํ•œ๋‹ค๋ฉด GPU๋‹น ๋‹ค์ˆ˜์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๊ณ , ๋‚ฎ์€ ํšจ์œจ์„ ๋‚ด๊ฒŒ ๋œ๋‹ค.
  • ์ „๋ฌธ๊ฐ€ ๋ณ‘๋ ฌํ™”๋ฅผ ๋ชจ๋ธ์—์„œ ๊ฐ€์žฅ ๋งŽ์€ ์ˆ˜์˜ ์ „๋ฌธ๊ฐ€๋กœ ์„ค์ •ํ•˜๋ฉด load balancing ๋ฌธ์ œ๋กœ ์ธํ•ด ํšจ์œจ์„ฑ์ด ์ œํ•œ๋œ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด DeepSpeed-MoE๋ฅผ ์ด์šฉํ•ด ์œ ์—ฐํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์„ค๊ณ„๋ฅผ ๊ฐœ๋ฐœํ–ˆ๋‹ค. ์ •ํ™•ํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋Š” ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•ด๋ด์•ผ ํ•  ๊ฒƒ ๊ฐ™๋‹ค.

Ablation Study of Different MoE Architectures

Reference : [2201.05596] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Mixture-of-Students: Distillation for Even Smaller Model Size and Faster Inference

๊ธฐ์กด์— LLM์„ ์ž‘์—…๋ณ„ ์ž‘์€ ๋ชจ๋ธ๋กœ ์ฆ๋ฅ˜ํ•˜๋Š” ๋ฐ KD๋ฅผ ์ ์šฉํ•œ ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ์œผ๋‚˜, ์ž‘์€ ํŠธ๋ Œ์Šคํฌ๋จธ, ์ธ์ฝ”๋” ๊ธฐ๋ฐ˜ LM ๋ชจ๋ธ๋งŒ์„ ๊ณ ๋ คํ–ˆ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” KD๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ์ž‘์€ MoE ๋ชจ๋ธ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์ž‘์—…์—์„œ zero-shot ํ‰๊ฐ€์™€ ๊ฐ™์€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๊ณ , ๋” ๊ฐ€๋ณ๊ณ  ๋น ๋ฅธ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.

Architecture Choice and Optimization Objective

  • ๊ต์‚ฌ MoE ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•œ๋‹ค.
  • ๊ต์‚ฌ ๋ชจ๋ธ์—์„œ ๊ฐ ์ „๋ฌธ๊ฐ€์˜ ๊นŠ์ด๋ฅผ ์ค„์—ฌ ํ•™์ƒ์„ ์–ป๋Š”๋‹ค.
  • ํ•ด๋‹น ํ•™์ƒ ๋ชจ๋ธ์„ MoS๋ผ ๋ถ€๋ฅธ๋‹ค.
  • MoS๋Š” ์•„๋ž˜ KD Loss๋ฅผ ํ†ตํ•ด ๊ต์‚ฌ๋ฅผ ๋ชจ๋ฐฉํ•˜๋„๋ก ํ•œ๋‹ค.
  • : ์˜ˆ์ธก๊ณผ ์ฃผ์–ด์ง„ Hard Label ์‚ฌ์ด ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค
  • : ์˜ˆ์ธก๊ณผ ๊ต์‚ฌ์˜ Soft Label ์‚ฌ์ด KL Divergence ์†์‹ค

  • ์ฒ˜์Œ์—๋Š” ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋‚˜, ํ›ˆ๋ จ์ด ๋๋‚ ์ˆ˜๋ก ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค.

  • ํ•™์ƒ ๋ชจ๋ธ์ด ์ถฉ๋ถ„ํ•œ ์šฉ๋Ÿ‰์„ ๊ฐ€์ง€์ง€ ๋ชปํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ํ›ˆ๋ จ์˜ ๋๋ถ€๋ถ„์—์„œ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ํฌ์ƒ์‹œํ‚ค๋ฉด์„œ KL Divergence ์†์‹ค์„ ์ค„์ด๊ณ ์ž ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ ์ฐจ KL Divergence ์†์‹ค์˜ ์˜ํ–ฅ์„ ์ค„์ธ๋‹ค. ์ด๋•Œ์˜ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

DeepSpeed-MoE Inference

์ถ”ํ›„ ์ •๋ฆฌ

Results & Discussion

Critique