Summary ๐Ÿค™


Gradient Descent๋Š” ๋‹ค์Œ์œผ๋กœ ๊ฐ„๋‹จํžˆ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Stochastic Gradient Descent : ํ•œ๊ฐœ์˜ ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ Gradient๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์—…๋ฐ์ดํŠธ
  • Mini-batch Gradient Descent : ๋ฐ์ดํ„ฐ์˜ subset์œผ๋กœ Gradient๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์—…๋ฐ์ดํŠธ
  • Batch Gradient Descent : ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋กœ Gradient๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์—…๋ฐ์ดํŠธ

๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๊ฐ€ ํฌ๋ฉด sharp minimizers์— ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๊ณ , ์ž‘์œผ๋ฉด flat minimizer์— ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. On Large-batch Training for Deep Learning Generalization Gap and Sharp Minima, 2017

Index ๐Ÿ‘€


Stochastic Gradient Descent


\[W_{t+1} = W_t - \eta g_t\]

๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ Gradient Descent์˜ ๋ฐฉ๋ฒ•์ด๋‹ค.

์ด๋•Œ Learning Rate๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” hyperparameter $\eta$๋ฅผ ์ ์ ˆํžˆ ์„ค์ •ํ•ด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.



Momentum


\[\begin{aligned} a_{t+1} & \leftarrow \beta a_{t}+g_{t} \\ W_{t+1} & \leftarrow W_{t}-\eta a_{t+1} \end{aligned}\]

์ด์ „ ํ•™์Šต์˜ ๊ด€์„ฑ์„ ์œ ์ง€์‹œ์ผœ Gradient๊ฐ€ ๋ณ€ํ•˜๋”๋ผ๋„ ์ด์ „ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ๋‹ค.
$\beta$๋Š” Momentum์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๋Š” hyper parmeter์ด๋‹ค.
$a_t$๋Š” accumulation์œผ๋กœ์„œ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜๋Š” ๋ฐ ์‹ค์ œ๋กœ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.

๋‹ค๋งŒ ๊ด€์„ฑ์œผ๋กœ ์ธํ•ด ๋ฉˆ์ถฐ์•ผํ•  ์ง€์ (local minima)์— convergeํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค.



Nesterov accelerated Gradient(NAG)


\[\begin{aligned} a_{t+1} & \leftarrow \beta a_{t}+\nabla \mathcal{L}\left(W_{t}-\eta \beta a_{t}\right) \\ W_{t+1} & \leftarrow W_{t}-\eta a_{t+1} \end{aligned}\]

Momentum์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. Momentum์€ ๊ด€์„ฑ๊ณผ Gradient๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์ง€๋งŒ, NAG๋Š” Momentum๋งŒํผ ๋จผ์ € ์ด๋™ํ•œ ๋’ค Gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๋”ฐ๋ฆฌ์„œ Momentum์˜ ์žฅ์ ์€ ์œ ์ง€ํ•˜๋ฉด์„œ, ๋ฉˆ์ถฐ์•ผํ•  ์ง€์ ์— ๋ฉˆ์ถ”๋Š”๋ฐ์— ํ›จ์”ฌ ์šฉ์ดํ•˜๋‹ค.



Adagrad


\(W_{t+1}=W_{t}-\frac{\eta}{\sqrt{G_{t}+\epsilon}} g_{t}\) $G_t$ : sum of Gradient squares
$\epsilon$ : for numerical stability

Adaptive Gradient๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’๋“ค์˜ ๋ณ€ํ™”๋Ÿ‰์— ์˜ํ•ด Learning Rate๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๋ณ€ํ™”๋Ÿ‰์ด ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” LearningRate๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๊ณ , ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” LearningRate๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋„๋ก ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋‹จ, ํ•™์Šต์ด ์ง„ํ–‰๋  ์ˆ˜๋ก $G_t$ ๊ฐ’์ด ์ฆ๊ฐ€ํ•˜์—ฌ LearningRate๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๋ฌธ์ œ(monotonically decreasing)๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.



Adadelta


\[\begin{aligned} G_{t} &=\gamma G_{t-1}+(1-\gamma) g_{t}^{2} \\ W_{t+1} &=W_{t}-\frac{\sqrt{H_{t-1}+\epsilon}}{\sqrt{G_{t}+\epsilon}} g_{t} \\ H_{t} &=\gamma H_{t-1}+(1-\gamma)\left(\Delta W_{t}\right)^{2} \end{aligned}\]

$G_t$ : EMA of Gradient squares
$H_t$ : EMA of difference squares

Adagrad์˜ ๋ฌธ์ œ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด EMA(exponential moving average)๋ฅผ ์ทจํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š” ์ตœ๊ทผ์— Gradient์— ๋”ฐ๋ผ ๋ฐ˜๋Œ€๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

ํŠน์ง•์€ ๋ช…์‹œ์ ์ธ Learning Rate๊ฐ€ ์—†๋‹ค๋Š” ์ ์ด๋‹ค.
๋”ฐ๋ผ์„œ ์ปค์Šคํ…€ ์˜์—ญ์ด ์ ์–ด ์‹ค์ œ๋กœ๋Š” ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค.



RMSprop


\[\begin{aligned} G_{t} &=\gamma G_{t-1}+(1-\gamma) g_{t}^{2} \\ W_{t+1} &=W_{t}-\frac{\eta}{\sqrt{G_{t}+\epsilon}} g_{t} \end{aligned}\]

$G_t$ : EMA of Gradient squares

๋…ผ๋ฌธ์„ ํ†ตํ•ด ์†Œ๊ฐœ๋œ ๋ฐฉ๋ฒ•๋ก ์€ ์•„๋‹ˆ๊ณ  Geoff Hinton์˜ ๊ฐ•์˜์—์„œ ์†Œ๊ฐœ๋œ ๋…ํŠนํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

ํŠน์ง•์€ Adadelta์— Learning rate๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.



Adam


\[\begin{aligned} m_{t} &=\beta_{1} m_{t=1}+\left(1-\beta_{1}\right) g_{t} \\ v_{t} &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right) g_{t}^{2} \\ W_{t+1} &=W_{t}-\frac{\eta}{\sqrt{v_{t}+\epsilon}} \frac{\sqrt{1-\beta_{2}^{t}}}{1-\beta_{1}^{t}} m_{t} \end{aligned}\]

$m_t$ : Momentum
$v_t$ : EMA of Gradient squares

Adaptive Moment Estimation

์•ž์„œ ์†Œ๊ฐœ๋œ Momentum๊ณผ Adaptive, ์ด ๋‘ ๊ฐœ๋…์„ ์ž˜ ์šฉํ•ฉํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.