Optimizer ๋ž€?

  • Optimization(์ตœ์ ํ™”) ๊ธฐ๋Šฅ์„ ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งํ•œ๋‹ค.

Optimization ์ด๋ž€?

  • ์–ด๋–ค ์ œ์•ฝ์กฐ๊ฑด์ด ์žˆ์„ ์ˆ˜๋„ ์žˆ๋Š” ์ƒํ™ฉ์—์„œ ํ•จ์ˆ˜์˜ ์ตœ๋Œ€์น˜์™€ ์ตœ์†Œ์น˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.
  • ์ตœ์ ํ™”๋ฌธ์ œ๋Š” ์ž์› (๋ฉ”๋ชจ๋ฆฌ, ์‹œ๊ฐ„ ๋“ฑ) ์˜ ํ•œ๊ณ„ ๋‚ด์—์„œ ๊ฐ€๋Šฅํ•œ ํ•œ ์ตœ์„ ์˜ ๊ฐ’์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด๋‹ค.

์ตœ์ ํ™” ๊ณผ์ •

์ตœ์ ํ™”๊ณผ์ •{: .center}

  1. ํ›ˆ๋ จ์„ธํŠธ์—์„œ Neural Network ๋ฅผ ๊ฑฐ์ณ Output ์„ ์‚ฐ์ถœํ•œ๋‹ค.
  2. Output ๊ณผ ์‹ค์ œ๊ฐ’ ์‚ฌ์ด์˜ Loss Function ์„ ์ •์˜ํ•˜๊ณ  Loss ๊ฐ’์„ ๊ตฌํ•œ๋‹ค.
  3. Loss ๊ฐ’์„ ์ค„์—ฌ์ฃผ๋Š” Gradient ๋ฅผ ๊ตฌํ•˜๊ณ  Weight ์— ์—…๋ฐ์ดํŠธ ํ•ด์ค€๋‹ค.
  4. 1~3 ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜์—ฌ ์ฃผ์–ด์ง„ ์ž์› ๋‚ด์—์„œ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์•„๋‚ธ๋‹ค.

โ‡’ ์œ„์˜ ๊ณผ์ •์„ ์ตœ์ ํ™” ๊ณผ์ •์ด๋ผ ํ•œ๋‹ค.

Optimizer ์ข…๋ฅ˜

Optimizer ๋ฐœ๋‹ฌ๊ณ„๋ณด{: .center}

Gradient Descent (๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)

  • ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” Optimizer ์ด๋‹ค.
  • ๊ฒฝ์‚ฌ๋ฅผ ๋”ฐ๋ผ ๋‚ด๋ ค๊ฐ€๋ฉด์„œ Weight ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด ์ตœ์†Ÿ๊ฐ’์„ ์ฐพ๋Š” ๋ฐฉ์‹์ด๋‹ค.

Gradient Descent ํŽ˜์ด์ง€ ์ฐธ๊ณ 

  • ์—…๋ฐ์ดํŠธ ๊ฐ’์€ ํ•™์Šต๋ฅ ๊ณผ gradient ์˜ ๊ณฑ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

๋‹จ์ 

Momentum

  • Global Minima ๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ด€์„ฑ์˜ ๋ฒ•์น™์„ ์ ์šฉ์‹œํ‚จ Optimizer ์ด๋‹ค.

Local Minima ์— ๋น ์กŒ์„ ๋•Œ ์ด์ „์— ๋–จ์–ด์ง€๋˜ ์†๋„๋ฅผ ์‚ด๋ ค ๋น ์ ธ๋‚˜๊ฐ€๋ณด์ž! ๋ผ๋Š” ์•„์ด๋””์–ด๋ฅผ ๊ฐ–๋Š”๋‹ค.

  • SGD ์— Momentum ์„ ํ•ฉ์ณ์„œ ์ ์šฉ์‹œํ‚จ๋‹ค.

Momentum{: width=โ€œ600โ€}{: .center}

์ถœ์ฒ˜ : https://ratsgo.github.io/deep learning/2017/04/22/NNtricks/

You can't use 'macro parameter character #' in math mode\rho $$ - ๋งˆ์ฐฐ๋ ฅ์„ ์˜๋ฏธํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ 0.9 ๋˜๋Š” 0.99์˜ ๊ฐ’์„ ์ด์šฉํ•œ๋‹ค. - gradient์™€ ์ด์ „ ์†๋„๊ฐ’์„ ์ ์€๋Ÿ‰ ๋ฐ˜์˜์‹œ์ผœ Momentum์„ ๊ตฌํ•œ๋‹ค. - Momentum์— ํ•™์Šต๋ฅ ์„ ์ ์šฉ์‹œ์ผœ ์—…๋ฐ์ดํŠธ ์‹œํ‚จ๋‹ค. ์ถœ์ฒ˜ : CS231n Lecture7 Training Neural Net 2 ## Nesterov Momentum - Momentum์˜ ๊ฒฝ์šฐ ํ˜„์žฌ ์œ„์น˜์—์„œ gradient์™€ ๊ธฐ์กด์˜ ๊ฐ€์ง€๊ณ  ์žˆ๋˜ momentum์„ ํ•ฉ์นœ Optimizer์ด๋‹ค. - Momentum์„ ์•„๋ž˜ ๊ทธ๋ฆผ์˜ -10์—์„œ ์–ธ๋•์„ ๋‚ด๋ ค๊ฐ€๋Š” ๊ณต์ด๋ผ ๋น„์œ ํ•ด๋ณด์ž. Momentum์€ ์—ญ์‚ผ๊ฐํ˜•์„ ๊ฑฐ์ณ ์–ธ๋•์„ ๋‹ค ๋‚ด๋ ค๊ฐ€๊ณ  ๋‚˜์„œ ์ตœ์ €์ ์ธ 0์„ ์ง€๋‚˜ ๋‹ค์‹œ ์˜ฌ๋ผ๊ฐ”๋‹ค ์†๋„๋ฅผ ์ค„์ด๋ฉฐ ๋‹ค์‹œ ์ตœ์ €์ ์œผ๋กœ ํ–ฅํ•œ๋‹ค. ![graph](https://user-images.githubusercontent.com/64977390/208243570-8e42a52e-75ac-430f-84a7-52e70e8cdbee.png){: width="400"}{: .center} - Nesterov Momentum์€ ์˜ฌ๋ผ๊ฐ”๋‹ค ๋‹ค์‹œ ๋‚ด๋ ค์˜ค์ง€ ๋ง๊ณ  ๋ฏธ๋ฆฌ ์†๋„๋ฅผ ์ค„์ด์ž๋Š” ์•„์ด๋””์–ด๋ฅผ ๊ฐ–๋Š”๋‹ค. Nesterov Momentum์€ ๋ฏธ๋ž˜ ์œ„์น˜๋ฅผ ์ถ”์ •ํ•˜์—ฌ ๊ทธ ์œ„์น˜์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•œ๋‹ค.

v_{t+1}=\rho v_t - \alpha\nabla f(x_t+\rho v_t) \~\

w_{t+1} = w_t+v_{t+1}

You can't use 'macro parameter character #' in math mode 1. ํ˜„์žฌ momentum๊ณผ ์œ„์น˜๋ฅผ ํ•ฉ์นœ ๋ฏธ๋ž˜ ์œ„์น˜๋ฅผ ์ฐพ์•„ gradient๋ฅผ ๊ตฌํ•œ๋‹ค. (ํ•œ ๊ฑธ์Œ ๋ฏธ๋ฆฌ ๊ฐ€๋ณธ ์œ„์น˜์˜ gradient) 2. ๊ตฌํ•œ gradient๋ฅผ ํ˜„์žฌ momentum๊ณผ ๋นผ์„œ momentum์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. 3. momentum๊ณผ์˜ ํ•ฉ์œผ๋กœ ๋ฏธ๋ž˜ ์œ„์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ![Nesterov method](https://user-images.githubusercontent.com/64977390/208243587-fc77a179-78ed-4ca8-8e84-af318d721de9.png){: width="600"}{: .center} ์ถœ์ฒ˜ : [https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc](https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc) ## Adagrad(Adaptive Gradient) - ์ ์‘์ ์œผ๋กœ Gradient๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์„ learning rate์— ๋”ฐ๋ผ์„œ ์กฐ์ •ํ•˜๋Š” Optimizer์ด๋‹ค. - ์ง€๊ธˆ๊นŒ์ง€ ๋น„๊ต์  ๋งŽ์ด ์—…๋ฐ์ดํŠธ๋œ ๋ณ€์ˆ˜๋Š” ์ ๊ฒŒ, ์ ๊ฒŒ ์—…๋ฐ์ดํŠธ๋œ ๋ณ€์ˆ˜๋Š” ๋งŽ์ด ์—…๋ฐ์ดํŠธํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ๊ฐ–๋Š”๋‹ค.

G_{t+1}=G_t+(g_{t+1})^2 \~\

w_{t+1}=w_t-\frac \alpha {\sqrt{G_{t+1}+e}}g_{t+1}

g_t

G_t

You can't use 'macro parameter character #' in math mode- ์ด์ „ gradient์˜ ์ ์šฉ์„ ๊ณ„์† ์ถ•์ ์‹œ์ผœ ๋ฃจํŠธ๋ฅผ ์”Œ์›Œ ๋ถ„๋ชจ๋กœ ๋‘์–ด ๊ณ„์‚ฐํ•œ๋‹ค. - ์ด์ „๊นŒ์ง€ ์—…๋ฐ์ดํŠธ๋œ ์ด๋Ÿ‰์ด ํฌ๋‹ค๋ฉด ๋ถ„๋ชจ๊ฐ€ ์ปค์ ธ ์—…๋ฐ์ดํŠธํ•˜๋Š” ์–‘์ด ๊ฐ์†Œํ•˜๊ณ  ๋ฐ˜๋ณต์ด ๊ณ„์†๋ ์ˆ˜๋ก G๊ฐ€ ๊ณ„์† ์ฆ๊ฐ€ํ•˜๋ฉด์„œ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™”์ •๋„๊ฐ€ ๋งค์šฐ ์ž‘์•„์ง„๋‹ค. ## RMSProp Adagrad์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•œ optimizer๋‹ค. Adagrad์˜ ๊ฒฝ์šฐ gradient๋ฅผ ์ œ๊ณฑํ•ด ๋ˆ„์ ํ•˜๊ณ  ์—…๋ฐ์ดํŠธ๊ฐ€ ๊ณ„์†๋ ์ˆ˜๋ก ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ๋Š” 0์— ์ˆ˜๋ ดํ•œ๋‹ค. ์ด๋ฅผ RMSProp์€ ๊ณผ๊ฑฐ๊ฐ’์˜ ๋ฐ˜์˜์„ ์ ์ฐจ ์ค„์ด๊ณ  ์ตœ๊ทผ๊ฐ’์˜ ๋ฐ˜์˜์„ ๋Š˜์ด๋Š” **์ง€์ˆ˜์ด๋™ํ‰๊ท (Exponential Moving Average)**์„ ์ด์šฉํ•˜์—ฌ ๋ณด์™„ํ–ˆ๋‹ค. - ๋ˆ„์ ๋œ ์—…๋ฐ์ดํŠธ์˜ ์ด๋Ÿ‰๊ณผ ์ตœ๊ทผ ์—…๋ฐ์ดํŠธํ•œ ์–‘์„ ๊ณ ๋ คํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•œ๋‹ค.

G_{t+1}=\gamma G_t+(1-\gamma)(g_{t+1})^2 \~\

w_{t+1}=w_t-\frac \alpha {\sqrt{G_{t+1}+\epsilon}}g_{t+1}

\gamma

\alpha

\epsilon

์ง€๊ธˆ๊นŒ์ง€์—…๋ฐ์ดํŠธํ•œ๋ˆ„์ ์–‘๋„๊ณ ๋ คํ•˜์ง€๋งŒ์ตœ๊ทผ์—…๋ฐ์ดํŠธํ•œ์–‘์„๋”๋งŽ์ด๊ณ ๋ คํ•œ๋‹ค

\sqrt{G_{t+1}+\epsilon}

You can't use 'macro parameter character #' in math mode ## Adam(Adaptive Moment Estimation) - RMSProp + Momentum์˜ ํ˜•ํƒœ๋กœ ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์žฅ์ ์„ ๋ชจ์•„ ๋งŒ๋“  Optimizer์ด๋‹ค. - ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์ด์šฉ๋œ๋‹ค. - Adam์— ๋Œ€ํ•ด ์ž˜ ์„ค๋ช…ํ•ด๋†“์€ ๋ธ”๋กœ๊ทธ๊ฐ€ ์žˆ์–ด ์ฒจ๋ถ€ํ•œ๋‹ค. (์ถ”ํ›„ ์ •๋ฆฌ) - [https://velog.io/@yookyungkho/๋”ฅ๋Ÿฌ๋‹-์˜ตํ‹ฐ๋งˆ์ด์ €-์ •๋ณต๊ธฐ๋ถ€์ œ-CS231n-Lecture7-Review](https://velog.io/@yookyungkho/%EB%94%A5%EB%9F%AC%EB%8B%9D-%EC%98%B5%ED%8B%B0%EB%A7%88%EC%9D%B4%EC%A0%80-%EC%A0%95%EB%B3%B5%EA%B8%B0%EB%B6%80%EC%A0%9C-CS231n-Lecture7-Review)