์ธ๊ณต์ง€๋Šฅ ๐ŸŒŒ/CS231n

CS231n 3๊ฐ• Loss Functions and Optimization

23.8 2024. 3. 7. 13:52
๋ฐ˜์‘ํ˜•

 

 

Linear classifier์„ ์ •์˜ํ–ˆ๋‹ค๋ฉด ์ด์ œ๋Š” ๋ญ˜ ํ•ด์•ผํ• ๊นŒ?

 

์šฐ์„  ์ข‹์€ ๊ฐ€์ค‘์น˜(W)๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. 

๊ทธ๋ ‡๋‹ค๋ฉด ์šฐ๋ฆฌ์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ์ข‹์€์ง€ ๋‚˜์˜์ง€๋ฅผ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์„๊นŒ?

=> W๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ์Šค์ฝ”์–ด๋ฅผ ํ™•์ธํ•˜๊ณ  ์šฐ๋ฆฌ์˜ W๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ข‹๊ณ  ๋‚˜์œ์ง€๋ฅผ ์ •๋Ÿ‰ํ™” ํ•ด์ฃผ๊ธฐ ์œ„ํ•œ ์†์‹คํ•จ์ˆ˜๊ฐ€ ํ•„์š”

=> ์†์‹คํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ  Loss๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ์ตœ์ ์˜ W๋ฅผ ์ฐพ์•„์•ผ ํ•จ(์ตœ์ ํ™”) 

 

  1. Define a loss function that quantifies our unhappiness with the scores across the training data.
    ํ•™์Šต ๋ฐ์ดํ„ฐ ์ „์ฒด scores์— ๋Œ€ํ•ด ๊ฐ€์ค‘์น˜(๋ชจ๋ธ)์˜ ์„ฑ๋Šฅ์„ ์ˆ˜์น˜ํ™”ํ•˜๊ธฐ ์œ„ํ•œ loss function์„ ์ •์˜ 
  2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)
    loss function์„ minimizeํ•˜๊ธฐ ์œ„ํ•œ parameters๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ์ ‘๊ทผ ๋ฐฉ๋ฒ• (optimization)

๊ฒฐ๊ตญ ์šฐ๋ฆฌ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ’๊ณผ ์‹ค์ œ ์ •๋‹ต ๋ฐ์ดํ„ฐ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•œ loss fuction์„ ์ •์˜ํ•˜๊ณ 

์ด loss function ์ตœ์†Œํ™” ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค.

 

 

 

 

Loss function

 

 

Loss function

loss function์€ ์‹ค์ œ ์ •๋‹ต ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ์ด ์ •๋‹ต์ด๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์˜ ์ฐจ์ด๋กœ, ์šฐ๋ฆฌ์˜ classifier์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์ค€๋‹ค.

 

x_i ๋ฅผ ์ด๋ฏธ์ง€, y_i๋ฅผ label์ด๋ผ๊ณ  ํ•  ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž.

$${(x_i, y_i)}^{N}_{i=1}$$

์ด๋•Œ dataset์— ๋Œ€ํ•œ Loss๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$$L = \frac{1}{N}\sum_{i}(f(x_i, W), y_i)$$

 

์ด๋Š” N๊ฐœ์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์ธ f(x, W)์™€ ์‹ค์ œ ์ •๋‹ต๊ฐ’์ธ y ์‚ฌ์ด์˜ Loss๋ฅผ ๊ตฌํ•˜์—ฌ ํ•ฉํ•œ ๊ฐ’์ด๋‹ค. ์ฆ‰ ์ „๋ฐ˜์ ์ธ dataset์— ๋Œ€ํ•œ Loss๋ฅผ ๊ตฌํ•œ ๊ฒƒ์ด๋‹ค. 

 

Multiclass SVM loss

 

multi-class SVM loss๋Š” ์—ฌ๋Ÿฌ ํด๋ž˜์Šค๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•œ SVM์˜ ์ผ๋ฐ˜ํ™”๋œ ๋ฒ„์ „์ด๋‹ค.

 

์ด๋•Œ SVM loss ์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌ๋œ๋‹ค.

$$L_i = \sum_{j \neq  y_i}max(0, s_j - s_{y_i} + 1)$$

 

SVM loss์—์„œ๋Š” ์ •๋‹ต label์˜ score ๊ฐ’๊ณผ ๋‚˜๋จธ์ง€ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ฐ’๋“ค๊ฐ„์˜ ์ฐจ๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋œ๋‹ค. 

๋งŒ์•ฝ ์ •๋‹ต label์— ํ•ด๋‹นํ•˜๋Š” score์˜ ๊ฐ’์ด ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ score ๊ฐ’๋ณด๋‹ค ๋†’๊ณ  (= ์ž˜ ์˜ˆ์ธกํ•จ)

๋‘ score ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ safety margin(์˜ˆ์‹œ์—์„œ๋Š” 1) ์ด์ƒ์ด๋ผ๋ฉด 

์ •๋‹ต label์˜ score ๊ฐ’์ด ์ด์ƒ์ ์ธ ๊ฒƒ์œผ๋กœ Loss ๊ฐ’์€ 0์ด ๋œ๋‹ค.

 

์ด์™€ ๊ฐ™์ด 0๊ณผ ๋‹ค๋ฅธ ๊ฐ’์˜ ์ตœ๋Œ€๊ฐ’ Max(0, value)์™€ ๊ฐ™์€ ํ˜•์‹์˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ hinge loss ๋ผ๊ณ ๋„ ๋ถ€๋ฅธ๋‹ค. 

์šฐ์ธก ์ƒ๋‹จ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด x์ถ•์ด s_{y_i}์ด๊ณ  y์ถ•์ด loss๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ •๋‹ต ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก loss๋Š” ์„ ํ˜•์ ์œผ๋กœ ์ค„์–ด๋“ค๋ฉฐ safety margin์„ ๋„˜์–ด์„œ๋ฉด loss๋Š” 0์ด ๋œ๋‹ค.

 

์ˆ˜์‹์œผ๋กœ ์‚ดํŽด๋ณด๋ฉด s_j ๋Š” ์ •๋‹ต์ด ์•„๋‹Œ label์˜ ๊ฐ’์ด๊ณ , s_{y_i}๋Š” ์ •๋‹ต label์˜ score ๊ฐ’์ด๋‹ค.

๊ณ ์–‘์ด ์‚ฌ์ง„์„ ์˜ˆ๋กœ ๋“ค๋ฉด ๊ณ ์–‘์ด๋ฅผ car์™€ frog๋ผ๊ณ  ํ•œ ๊ฐ’๋“ค(5.1, -1.7)์ด s_j์ธ ๊ฒƒ์ด๊ณ  3.2๊ฐ€ s_y_i ์ด๋‹ค.

 

๊ณ ์–‘์ด ์ด๋ฏธ์ง€์—์„œ cat์ด๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ score์ธ 3.2์™€ car๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ score 5.1๋กœ ๋ณด๋ฉด car๋ผ๊ณ  ์˜ˆ์ธกํ•œ score ์˜ ๊ฐ’์ด ํฌ๋‹ˆ loss ๊ฐ’์ด ์ปค์•ผํ•œ๋‹ค๊ณ  ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. loss ๊ฐ’์„ ๊ตฌํ•ด๋ณด๋ฉด max(0, 5.1-3.2+1) = max(0, 2.9) = 2.9

 

๋‹ค์Œ์œผ๋กœ cat์ด๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ score 3.2์™€ frog๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ score -1.7 ์„ ๋ณด๋ฉด cat ์˜ score๊ฐ€ ๋” ํฌ๋‹ˆ loss๋Š” 0์ด๋  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. loss๊ฐ’์„ ๊ตฌํ•ด๋ณด๋ฉด max(0, -1.7-3.2+1) = max(0, -3.9) = 0

 

์ฆ‰ ์ž˜๋ชป ์˜ˆ์ธกํ•œ score ๊ฐ’์ด ์‹ค์ œ ์ •๋‹ต์ด์–ด์•ผ ํ•  score ๊ฐ’๋ณด๋‹ค ํฌ๋‹ค๋ฉด ์šฐ๋ฆฌ์˜ loss ๊ฐ’์€ 0๋ณด๋‹ค ํฌ๊ฒŒ ๋œ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด margin saftey๋Š” ์™œ ๋„ฃ๋Š” ๊ฒƒ์ผ๊นŒ?

 

๋งŒ์•ฝ s_j์™€ s_{y_i} ๊ฐ’์˜ ์ฐจ์ด๊ฐ€ margin saftey๋ณด๋‹ค ์ž‘๋‹ค๋ฉด, loss๊ฐ€ ์ƒ๊ธด๋‹ค. ์ฆ‰ ์ ์–ด๋„ margin safety ๋งŒํผ์˜ ๊ฐ„๊ฒฉ์ด ์กด์žฌํ•ด์•ผ ์˜ฌ๋ฐ”๋ฅธ ํด๋ž˜์Šค๋ฅผ ์„ ํƒํ–ˆ๋‹ค๋Š” ์ฒ™๋„์ธ ๊ฒƒ์ด๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด ๊ณ ์–‘์ด ๋ฐ์ดํ„ฐ์—์„œ cat์ด๋ผ๊ณ  ์˜ˆ์ธกํ•œ score๊ฐ€ 3์ด๊ณ  car๋ผ๊ณ  ์˜ˆ์ธกํ•œ score๋„ 3์ด๋ผ๊ณ  ํ•ด๋ณด์ž. ์ด๋•Œ safety margin ์—†์ด loss ๊ฐ’์„ ๊ตฌํ•˜๋ฉด max(0, 3-3) = 0์ด ๋œ๋‹ค. ํ•˜์ง€๋งŒ cat๊ณผ car์˜ score๊ฐ’์ด ๋™์ผํ•˜๋‹ค๋ฉด ๋ถ„๋ช… ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ด๋‹ˆ ์šฐ๋ฆฌ๋Š” ์ •๋‹ต score๊ฐ€ ๋‹ค๋ฅธ score๊ณผ ์ ์–ด๋„ 1์˜ ๊ฐ’์€ ์ฐจ์ด๊ฐ€ ๋‚˜์•ผ ํ•œ๋‹ค๊ณ  safety margin์„ ์ค€๊ฒƒ์ด๋‹ค. margin์„ ํฌํ•จํ•œ loss ๊ฐ’์„ ๋‹ค์‹œ ๊ตฌํ•ด๋ณด๋ฉด max(0, 3-3+1) = max(0,1) = 1 ์ด ๋œ๋‹ค. ๊ฒฐ๊ตญ safety margin์„ ์ถ”๊ฐ€ํ•˜์—ฌ์„œ loss ๊ฐ’์ด 0์—์„œ 1์ด ๋œ ๊ฒƒ์ด๋‹ค.

 

๋‹ค์‹œ ์ •๋ฆฌํ•˜๋ฉด safety margin์€ ๋ชจ๋ธ์ด ์˜ฌ๋ฐ”๋ฅธ ํด๋ž˜์Šค์™€ ๋‹ค๋ฅธ ํด๋ž˜์Šค ๊ฐ„์˜ ๊ฐ„๊ฒฉ์„ ์ถฉ๋ถ„ํžˆ ๋„“ํžˆ๋ ค๊ณ  ๋…ธ๋ ฅํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

 

๊ทธ๋Ÿฌ๋ฉด margin safety์˜ ๊ฐ’์€ ์–ด๋–ป๊ฒŒ ์ •ํ• ๊นŒ? ์‚ฌ์‹ค ์šฐ๋ฆฌ๋Š” ์Šค์ฝ”์–ด๊ฐ€ ์ •ํ™•ํžˆ ๋ช‡์ธ์ง€๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์Šค์ฝ”์–ด ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์ฐจ์ด๋ฅผ ๋ณธ๋‹ค. ๋‹ค์‹œ ๋งํ•˜๋ฉด ์ •๋‹ต ์Šค์ฝ”์–ด๊ฐ€ ๋‹ค๋ฅธ label ์Šค์ฝ”์–ด๋ณด๋‹ค ์–ผ๋งˆ๋‚˜ ๋” ํฐ์ง€๋ฅผ ๋ณด๊ณ  ์‹ถ์€ ๊ฒƒ์ด๋‹ค. ์ถ”ํ›„ ํ–‰๋ ฌ W๋ฅผ ์Šค์ผ€์ผ๋งํ•˜๊ฒŒ ๋˜๋ฉด 1์ด๋ผ๋Š” ๊ฐ’์„ ํฌ๊ฒŒ ์ƒ๊ด€์ด ์—†์–ด์ง€๊ณ  W ์Šค์ผ€์ผ์— ์˜ํ•ด ์ƒ์‡„๋˜๋ฏ€๋กœ ์—ฌ๊ธฐ์„œ๋Š” loss์™€ safety margin์— ๋Œ€ํ•ด ์ดํ•ด๋งŒ ํ•˜๊ณ  ๋„˜์–ด๊ฐ€๋ฉด ๋œ๋‹ค.

 

๊ฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ loss ๊ฐ’์„ ๊ตฌํ–ˆ๋‹ค๋ฉด ์ด๋ฅผ ๋ชจ๋‘ ๋”ํ•œ๋‹ค์Œ ํ‰๊ท ์„ ๋‚ด์ค€๊ฒŒ ์šฐ๋ฆฌ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ L ๊ฐ’์ด ๋œ๋‹ค.

 

 

Q: What happens to loss if car scores change a bit?

A: car score๊ฐ€ ์•ฝ๊ฐ„ ๋ฐ”๋€Œ๋”๋ผ๋„ ์ด๋ฏธ car score๊ฐ€ ๋‹ค๋ฅธ scroe๋ณด๋‹ค ๋†’๊ธฐ์— Loss๋Š” ๋ณ€ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.

 

Q2: what is the min/max possible loss?

A2: min์€ 0์ผ ๊ฒƒ์ด๊ณ  max๋Š” ๋ฌดํ•œ๋Œ€์ด๋‹ค.

 

Q3: At initialization W is small so all s ≈ 0. What is the loss?

A3: ํด๋ž˜์ˆ˜์˜ ์ˆ˜ -1 ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” loss๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์ •๋‹ต์ด ์•„๋‹Œ ํด๋ž˜์Šค๋ฅผ ์ˆœํ™”ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ margin์œผ๋กœ ์ธํ•ด ๊ฐ iteration๋งˆ๋‹ค 1์„ ์–ป๊ฒŒ ๋  ๊ฒƒ์ด๊ณ  ๊ฒฐ๋ก ์ ์œผ๋กœ C-1๋ฒˆ ์ˆœํšŒํ•˜๋ฏ€๋กœ ์šฐ๋ฆฌ์˜ socre๊ฐ€ 0์— ๊ฐ€๊น๋‹ค๋ฉด loss๋Š” ํด๋ž˜์ˆ˜์˜ ์ˆ˜ -1 ์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฐ ๋œ๋‹ค. ์ด๋Š” ๋””๋ฒ„๊น…์—์„œ ์œ ์šฉํ•˜๊ฒŒ ์“ธ ์ˆ˜ ์žˆ๋Š”๋ฐ ๋งŒ์•ฝ ์œ„์™€ ๊ฐ™์€ ์กฐ๊ฑด์—์„œ ํ•™์Šต์„ ์‹œ์ผฐ๋Š”๋ฐ loss๊ฐ€ C-1์ด ์•„๋‹ˆ๋ผ๋ฉด ๋ฒ„๊ทธ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ง์ž‘ํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

Q4: What if the sum was over all classes? (including j = y_i)

A4: ์šฐ๋ฆฌ๊ฐ€ SVM loss๋ฅผ ๊ตฌํ•  ๋•Œ๋Š” ์ •๋‹ต score๋Š” ์ œ์™ธํ•˜๊ณ  ( ์ •๋‹ต์ด ์•„๋‹Œ score - ์ •๋‹ต score ) ์˜ ์ฐจ๋ฅผ ๋ณด์•˜๋Š”๋ฐ ๋งŒ์•ฝ ์ •๋‹ต score๋„ ๊ณ ๋ ค๋ฅผ ํ•ด์„œ ๋”ํ•œ๋‹ค๋ฉด Loss์— 1์ด ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค. ์ •๋‹ต ํด๋ž˜์Šค๋ฅผ ์ œ์™ธํ•˜๋Š” ์ด์œ ๋Š” Loss๊ฐ€ 0์ด ๋˜๋Š” ๊ฒƒ์ด ์ข‹์€ ํ•ด์„์œผ๋กœ ๊ฐ„์ฃผ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

Q5: What if we used mean instead of sum?

A5: ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค. ์™œ๋ƒํ•˜๋ฉด ํ‰๊ท ์„ ์ทจํ•˜๋Š”๊ฑด ๊ทธ๋ƒฅ ์†์‹คํ•จ์ˆ˜๋ฅผ resacling ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

Q6: What if we used 

$$L_i = \sum_{j \neq  y_i}max(0, s_j - s_{y_i} + 1)^2$$

A6: ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค. 0์€ ๊ทธ๋Œ€๋กœ ์ง€๋งŒ ๋งŒ์•ฝ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•œ score๊ฐ€ ๋” ํฌ๋‹ค๋ฉด ์ด๋Ÿฌํ•œ loss์— ์ œ๊ณฑ์„ ํ•ด์„œ ํŒจ๋„ํ‹ฐ๋ฅผ ํฌ๊ฒŒ ๋ถ€์—ฌํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ƒฅ hinge loss๊ฐ€ ์กฐ๊ธˆ ์ž˜๋ชป๋œ ๊ฒƒ๊ณผ ๋งŽ์ด ์ž˜๋ชป๋œ ๊ฒƒ์„ ํฌ๊ฒŒ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š๋Š”๋‹ค๋ฉด squared hinge loss๋Š” ๋งค์šฐ ์‹ฌํ•˜๊ฒŒ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์„ ์›ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด๋‹ค.

 

Multiclass SVM Loss: Example code

def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0
    loss_i = np.sum(margins)
    return loss_i

 

 

 

 

์ง€๊ธˆ๊นŒ์ง€ Loss ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ๊ฐ€ ์–ด๋–ค W๋ฅผ ์‹ ๊ฒฝ์“ฐ๋Š”์ง€ ์ˆ˜์น˜ํ™”ํ–ˆ๋Š”๋ฐ, ๋‹จ์ˆœํžˆ Loss๊ฐ€ 0์ธ W๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ๋ชจ์ˆœ์ด๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์šฐ๋ฆฌ์˜ loss๋Š” train ๋ฐ์ดํ„ฐ๋งŒ ์‹ ๊ฒฝ์“ฐ๊ณ  test ๋ฐ์ดํ„ฐ์—์„œ์˜ ์„ฑ๋Šฅ์€ ์‹ ๊ฒฝ์“ฐ๊ณ  ์žˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

ํŒŒ๋ž€์ƒ‰ ์› ๋ฐ์ดํ„ฐ๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ, ์ดˆ๋ก์ƒ‰ ๋„ค๋ชจ๊ฐ€ test ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•  ๋•Œ, ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ๋Š” ์ดˆ๋ก ์„ ์ด ๋œ๋‹ค. train ๋ฐ์ดํ„ฐ์—๋งŒ fit ๋˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๊ฐ€ Regularization์ด๋‹ค.

 

์šฐ๋ฆฌ๊ฐ€ ์•ž์—์„œ ์ •์˜ํ•œ ์†์‹คํ•จ์ˆ˜์— Regularization ์ด๋ผ๋Š” ํ•ญ์„ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค. loss fucntion์ด trainindg dataset์— fitํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์ด์—ˆ๋‹ค๋ฉด, Regularization์€ ๋ชจ๋ธ์ด simple W๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค. 

 

์‚ฌ์‹ค Regulatization์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๋ณด๊ณ  ์ •๊ทœํ™”๊ฐ€ ๋ฐ”๋กœ ๋– ์˜ฌ๋ž๊ณ  ๊ทธ๋Ÿฌ๋‹ค๋ณด๋‹ˆ normalization๊ณผ ํ—ท๊ฐˆ๋ ธ๋Š”๋ฐ ์•„๋ž˜ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•ด๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

https://hichoe95.tistory.com/55

 

L2 Regularization (Weight Decay)

Regularization์—๋„ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋Š”๋ฐ, L2 Regularization(Weight Decay)๊ฐ€ ๊ฐ€์žฅ ๋ณดํŽธ์ ์ด๋‹ค.

 

L2 ์ •๊ทœํ™” ๊ธฐ๋ฒ•์€ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W์— ๋Œ€ํ•œ Euclidean Norm๋กœ, W์˜ euclidean norm์— ํŒจ๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

L1 ์ •๊ทœํ™” ๊ธฐ๋ฒ•์€ W์— ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ํ–‰๋ ฌ W๊ฐ€ ํฌ์†Œํ–‰๋ ฌ์ด ๋˜๋„๋ก ํ•œ๋‹ค.

Elastic net regularization์€ L1๊ณผ L2๋ฅผ ํ˜ผํ•ฉํ•œ ํ˜•ํƒœ์ด๋‹ค.

 

๊ฒฐ๋ก ์ ์œผ๋กœ Regularization์€ ๋ชจ๋ธ์ด training dataset์— ์™„์ „ํžˆ fitํ•˜์ง€ ์•Š๊ฒŒ๋” ๋ชจ๋ธ์˜ ๋ณต์žก๋„์— ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

 

๊ทธ๋Ÿฌ๋ฉด Regularization์ด ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜์žˆ์„๊นŒ?

 

์œ„์˜ ์˜ˆ์‹œ์—์„œ x๋Š” train data์ด๊ณ  w๋Š” ๊ฐ€์ค‘์น˜์ด๋‹ค.

x์™€ w๋ฅผ ๊ฐ€์ง€๊ณ  dot product(๋‚ด์ )์„ ํ•ด๋ณด์ž. xw1, xw2๋ชจ๋‘ ๋‚ด์  ๊ฒฐ๊ณผ ๊ฐ’์€ 1์ด ๋‚˜์˜จ๋‹ค.

ํ•˜์ง€๋งŒ L2 Regularization์€ w2์˜ norm์ด ๋” ์ž‘์œผ๋ฏ€๋กœ w2๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. L2 Regularization์€ x์˜ ๋ชจ๋“  ์š”์†Œ๊ฐ€ W์— ์˜ํ–ฅ์„ ์ฃผ๊ธธ ๋ฐ”๋ผ๋ฉฐ ๋ณ€๋™์ด ์‹ฌํ•œ ํŠน์ • ์ž…๋ ฅ๋ณด๋‹ค๋Š” ๋ชจ๋“  x์˜ ์š”์†Œ๊ฐ€ ๊ณจ๊ณ ๋ฃจ ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ธธ ์›ํ•  ๋•Œ ์‚ฌ์šฉ

 

๋ฐ˜๋ฉด L1์€ L2์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ๋ณต์žก๋„๋ฅผ ์ •์˜ํ•ด์„œ ๊ฐ€์ค‘์น˜ W์— ์žˆ๋Š” 0์˜ ๊ฐœ์ˆ˜์— ๋”ฐ๋ผ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ๋‹ค๋ฃจ๊ณ , ์ผ๋ฐ˜์ ์œผ๋กœ sparseํ•œ solutions์„ ์„ ํ˜ธํ•œ๋‹ค.

 

๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์ œ์™€ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ• ์ง€ ์„ ํƒํ•ด์•ผ ํ•œ๋‹ค.

 

์ถ”๊ฐ€๋กœ ๊ณต๋ถ€ํ•ด๋ณผ ๊ฒƒ

Bayes theorem(๋ฒ ์ด์ฆˆ์ •๋ฆฌ)์™€ MLE/MAP, regularization term์ด ์‹ค์งˆ์ ์œผ๋กœ ์–ด๋–ค์‹์œผ๋กœ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€..

 

 

 

Softmax Classifier (Multinomial Logistic Regression)

SVM Loss์—์„œ ์šฐ๋ฆฌ๋Š” ๊ทธ์ € ์ •๋‹ต ํด๋ž˜์Šค๊ฐ€ ๋‹ค๋ฅธ ํด๋ž˜์Šค๋ณด๋‹ค ๋†’์€ ์Šค์ฝ”์–ด๋ฅผ ๋‚ด๊ธฐ๋งŒ ์›ํ–ˆ์„ ๋ฟ ์Šค์ฝ”์–ด ์ˆซ์ž ์ž์ฒด์— ๋Œ€ํ•œ ํ•ด์„์€ ํ•˜์ง€ ์•Š์•˜๋‹ค. 

 

Softmax(Multinominal Logistic Regression)์€ ์Šค์ฝ”์–ด ์ž์ฒด์— ์˜๋ฏธ๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. softma๋Š” score๋ฅผ ํ™•๋ฅ ๋กœ ๋ณ€ํ•˜์‹œ์ผœ์„œ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์ •๋‹ต ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ ์ด 1์— ๊ฐ€๊น๊ฒŒ ๋˜๋„๋ก ๋งŒ๋“ค๊ธฐ๋ฅผ ์›ํ•œ๋‹ค. 

 

์ด๋•Œ Loss๋Š” -log(์ •๋‹ต ํด๋ž˜์Šค์˜ ํ™•๋ฅ )์ด๋‹ค. Log๋Š” ๋‹จ์กฐ ์ฆ๊ฐ€ ํ•จ์ˆ˜๋กœ log๋ฅผ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์ด ํ™•๋ฅ  ๊ฐ’์„ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ๋ณด๋‹ค ์‰ฝ๋‹ค. ๋‹ค๋งŒ log P ๋ฅผ ์ตœ๋Œ€ํ™” ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์€ log P ๋ฅผ ๋†’์ด๊ณ  ์‹ถ๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ Loss ํ•จ์ˆ˜๋Š” ์ด๋ฆ„์—์„œ ๋ณด์ด๋“ฏ์ด "์–ผ๋งˆ๋‚˜ ์„ฑ๋Šฅ์ด ๋‚˜์œ์ง€"๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ log์— ๋งˆ์ด๋„ˆ์Šค๋ฅผ ๋ถ™ํžˆ๋Š” ๊ฒƒ์ด๋‹ค, 

 

* Softmax ํ•จ์ˆ˜๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ด๊ณ  Loss๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Cross Entropy๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ

 

ํ™•๋ฅ  ๊ณ„์‚ฐ์‹œ ์˜ˆ์ธก ๊ฐ’์ด ์Œ์ˆ˜๊ฐ€ ๋‚˜์˜ค๊ฑฐ๋‚˜ ๋ถ„๋ชจ๊ฐ€ 0์ด ๋˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด exp ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ค€๋‹ค. ์ด๋•Œ ์ž์—ฐ์ƒ์ˆ˜ e๋ฅผ ๋ฐ‘์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” 1. ๋ฏธ๋ถ„ ์‹œ ๊ณ„์‚ฐ ์šฉ์ด 2. ํฐ ๊ฐ’์€ ๋” ํฌ๊ฒŒ๋ณด๊ณ  ์ž‘์ ๊ฐ’์€ ์ž‘๊ฒŒ ๋ด์„œ ๊ตฌ๋ถ„์„ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค.

 

 

SVM์™€ ๋‹ค๋ฅด๊ฒŒ score๋ฅผ ๊ทธ๋Œ€๋กœ ์“ฐ์ง€ ์•Š๊ณ , ์Šค์ฝ”์–ด๋ฅผ ์ง€์ˆ˜ํ™” ์‹œํ‚จ๋‹ค์Œ ํ•ฉ์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™” ์‹œ์ผœ์ฃผ๊ณ 

์ •๋‹ต ์Šค์ฝ”์–ด์—๋งŒ -log๋ฅผ ์”Œ์—ฌ์ค€๋‹ค.

 

Q: What is the min/max possible loss L_i?

A: ์ตœ์†Œ๋Š” 0, ์ตœ๋Œ€๋Š” ๋ฌดํ•œ๋Œ€. ๋งŒ์•ฝ ์ •๋‹ต์„ ์ž˜ ๋งž์ท„๋‹ค๋ฉด ์ •๋‹ต ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์ด 1์ผ ๊ฒƒ์ด๊ณ  ๋„ˆ~๋ฌด ๋ชป๋งž์ท„๋‹ค๋ฉด ํ™•๋ฅ ์€ 0์ด ๋  ๊ฒƒ์ด๋‹ค. ํ™•๋ฅ ์ด 1์ด๋ผ๋ฉด loss๋Š” 0์ด ๋˜๊ณ  ํ™•๋ฅ ์ด 0์ด๋ผ๋ฉด -log(0)์€ ์–‘์˜ ๋ฌดํ•œ๋Œ€๋กœ ์ด๋•Œ์˜ ์Šค์ฝ”์–ด๋Š” ๋ฌดํ•œ๋Œ€์— ๊ฐ€๊น๊ฒŒ ๊ทน๋‹จ์ ์œผ๋กœ ๋†’์•„์ง„๋‹ค.

 

Q2: Usually at initialization W is small so all s ≈ 0. What is the loss?

A2: -log(1/C) = log(C)๊ฐ€ ๋œ๋‹ค. 

 

 

Softmax vs SVM

 

SVM์€ ์ •๋‹ต ์Šค์ฝ”์–ด์™€ ์ •๋‹ต์ด ์•„๋‹Œ ์Šค์ฝ”์–ด ๊ฐ„์˜ margins์„ ์‹ ๊ฒฝ์“ฐ๊ณ  ์ผ์ • margins์„ ๋„˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋” ์ด์ƒ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์Œ

softmax๋Š” ํ™•๋ฅ ์„ ๊ตฌํ•ด์„œ -log P ์ฒ˜๋ฆฌํ•จ. ์ด๋ฏธ ์ •๋‹ต ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์ด ๋‹ค๋ฅธ ํด๋ž˜์Šค๋ณด๋‹ค ๋†’์•„๋„ ์„ฑ๋Šฅ์„ ๋” ๋†’์ด๋ ค๊ณ  

 

 

 

 

 

Optimization

 

 

 

์šฐ๋ฆฌ์˜ ์ตœ์ข… ๋ชฉํ‘œ๋Š” ์•ž์—์„œ ์ •์˜ํ•œ loss fucntion์ด ์ตœ์†Œ๊ฐ€ ๋˜๊ฒŒ ํ•˜๋Š” ๊ฐ€์ค‘์น˜ W๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ์‹ค์ œ Loss๋ฅผ ์ค„์ด๋Š” W๋ฅผ ์–ด๋–ป๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์„๊นŒ?

 

์šฐ๋ฆฌ๊ฐ€ ์–ด๋””์ธ์ง€ ๋ชจ๋ฅผ ์‚ฐ ๊ณจ์งœ๊ธฐ์— ๋†“์—ฌ์กŒ๋‹ค๊ณ  ํ•˜์ž. ์šฐ๋ฆฌ๋Š” ์–ด๋–ป๊ฒŒ๋“  ๊ณจ์งœ๊ธฐ์˜ ๋ฐ‘๋ฐ”๋‹ฅ์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ๋‚˜์˜ ์œ„์น˜(๋†’์ด)๋ฅผ loss, ์ฃผ๋ณ€ ์‚ฐ๊ณผ ๊ณจ์งœ๊ธฐ๊ฐ€ w์ธ ๊ฒƒ์ด๋‹ค. w์— ๋”ฐ๋ผ loss๊ฐ€ ๋ณ€ํ•˜๋Š”๋ฐ ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ๊ฐ€์žฅ ๋‚ฎ์€ loss๋ฅผ ์ฐพ๋Š” ๊ฒƒ. 

 

์ด ์ƒํ™ฉ์—์„œ ๊ฐ€์žฅ ๋จผ์ € ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋‹จ์ˆœํ•˜๊ณ ๋„ ๋น„ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์€ random search์ด๋‹ค.

 

Strategy #1: A first very bad idea solution: Random search

 

์ž„์˜๋กœ ์ƒ˜ํ”Œ๋งํ•œ W๋ฅผ ์—„์ฒญ ๋งŽ์ด ๋ชจ์€๋‹ค์Œ loss๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ์–ด๋–ค W๊ฐ€ ์ข‹์€์ง€๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด๋‹ค.

 

import numpy as np
#assume X_train is the data where each column is an example(e.g, 3073 x 50,000)
#assume Y_train are the labels (e.g. 1D array of 50,000)
#assume the function L evaluates the loss funtion

bestloss = float("inf") # Python assigns the highest possible float value
for num in range(1000):
    W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
    loss = L(X_train, Y_train, W) # get the loss over the entire training set
    if loss < bestloss :
        bestloss = loss
        bestW = W
    print ('in attempt %d the loss was %f, best %f' % (num, loss, bestloss))

 

์•ฝ 15%์˜ ์ •ํ™•๋„ ๋‚˜์˜ด 

 

Strategy #2: Follow the slope

๋‹จ์ˆœํžˆ random searchํ•˜์ง€ ๋ง๊ณ  local geometry์˜ ํŠน์„ฑ์„ ์ด์šฉํ•˜์—ฌ ๊ฒฝ์‚ฌ๋ฅผ ๋Š๋ผ๋ฉฐ ์ฐพ์•„ ๋‚ด๋ ค๊ฐ€๋ณด์ž.

์—ฌ๊ธฐ์„œ slope๋Š” ์–ด๋–ค ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฏธ๋ถ„๊ฐ’์ด๋‹ค. 

 

 

ํŠน์ • ๋ฐฉํ–ฅ์—์„œ ์–ผ๋งˆ๋‚˜ ๊ฐ€ํŒŒ๋ฅธ์ง€ ์•Œ๊ธฐ์œ„ํ•ด ๊ทธ ๋ฐฉํ–ฅ์˜ ์œ ๋‹›๋ฒกํ„ฐ์™€ gradient ๋ฒกํ„ฐ๋ฅผ ๋‚ด์ .

gradient๋Š” ํ•จ์ˆ˜์˜ ์–ด๋–ค ์ ์—์„œ ์„ ํ˜• 1์ฐจ ๊ทผ์‚ฌ ํ•จ์ˆ˜๋ฅผ ์•Œ๋ ค์ค€๋‹ค.

 

1. gradient๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์‹ ์ฐพ๊ณ 

2. ํ•œ๋ฒˆ์— gradient dW๋ฅผ ๊ณ„์‚ฐ

 

In summary:

- Numerical gradient: approximate, slow, easy to write

- Analytic gradient: exact, fast, error-prone

=> In practice: Always use analytic gradient, but check implementation with numerical gradient.

      This is called a gradient check.

 

 

 

Gradient Descent

#Vanilla Gradient Descent
while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights += - step_size * weights_grad # perform parameter update

 

๋‹จ ์„ธ์ค„์˜ ์ด ์ฝ”๋“œ๋Š” ์•„๋ฌด๋ฆฌ ํฌ๊ณ  ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง์ด๋”๋ผ๋„ ๊ทธ ์‹ ๊ฒฝ๋ง์„ ์–ด๋–ป๊ฒŒ ํ•™์Šต์‹œํ‚ฌ์ง€์— ๋Œ€ํ•œ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ๋‹ด๊ณ ์žˆ๋‹ค.

 

Gradient descent์—์„œ ์šฐ์„  W๋ฅผ ์ž„์˜์˜ ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.

์ดํ›„ Loss์™€ Gradient๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ 

๊ฐ€์ค‘์น˜๋ฅผ gradient์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค. (gradient๊ฐ€ ํ•จ์ˆ˜์—์„œ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์ด๋ฏ€๋กœ -gradient๋ฅผ ํ•ด์•ผ ๋‚ด๋ ค๊ฐ€๊ฒŒ ๋œ๋‹ค.)

 

*step size๋Š” learning rate๋ผ๊ณ ๋„ ํ•˜๋ฉฐ ํ•™์Šต์—์„œ ์ •ํ•ด์ค˜์•ผ ํ•˜๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. 

 

 

๋นจ๊ฐ„์ƒ‰ : loss๊ฐ€ ๋‚ฎ์€ ์˜์—ญ

ํŒŒ๋ž€์ƒ‰ : loss๊ฐ€ ๋†’์€ ์˜์—ญ

 

์šฐ๋ฆฌ๋Š” ์ž„์ด์˜ ์  W์—์„œ ์ถœ๋ฐœํ•˜์—ฌ -gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด์„œ ์ด ๊ฐ’์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ด๊ฐ€๋ฉด์„œ

cost๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ ์ง€์ ์— ๋„๋‹ฌํ•˜๊ฒŒ ๋œ๋‹ค.

 

Gradient Descent ์™ธ์—๋„ momentum์ด๋‚˜ Adam optimizer๋ฅผ ์ด์šฉํ•˜์—ฌ W๋ฅผ ์—…๋ฐ์ดํŠธํ•ด ๋‚˜๊ฐ€๋Š” ๋ฐฉ๋ฒ•๋“ค๋„ ์žˆ๋‹ค.

 

 

 

Stochastic Gradient Descent (SGD)

 

 

์šฐ๋ฆฌ๋Š” ์†์‹คํ•จ์ˆ˜ L์„ ์ •์˜ํ–ˆ๊ณ , training dataset์˜ ์ „์ฒด Loss๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ Loss ๋“ค์˜ ํ‰๊ท ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” N์ด ์—„์ฒญ๋‚˜๊ฒŒ ์ปค์งˆ ์ˆ˜ ์žˆ๊ณ  ์ด๋Š” ๊ณ„์‚ฐ์ ์œผ๋กœ ๋น„ํšจ์œจ์ ์ด๋‹ค.

 

Gradient๋Š” ์„ ํ˜• ์—ฐ์‚ฐ์ž์ธ๋ฐ ์‹ค์ œ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์„ ์‚ดํŽด๋ณด๋ฉด 

๊ฐ ๋ฐ์ดํ„ฐ Loss์˜ Gradient์˜ ํ•ฉ์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‹ˆ Gradient๋ฅผ ํ•œ๋ฒˆ ๋” ๊ณ„์‚ฐํ•˜๋ ค๋ฉด N๊ฐœ์˜ ์ „์ฒด ํŠธ๋ ˆ์ด๋‹ ์…‹์„ ํ•œ๋ฒˆ ๋” ๋Œ๋ฉด์„œ ๊ณ„์‚ฐํ•ด์•ผ๋˜๊ณ  W๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๋ ค๋ฉด ๊ต‰์žฅํžˆ ๋งŽ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋œ๋‹ค.

=> ์‹ค์ œ๋กœ๋Š” SGD(Stochastic Gradient Descent)๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ๋งŽ์ด ์“ด๋‹ค.

 

์ „์ฒด ๋ฐ์ดํ„ฐ ์…‹์˜ gradient์™€ loss๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋ณด๋‹ค๋Š” Minibatch๋ผ๋Š” ์ž‘์€ ํŠธ๋ ˆ์ด๋‹ ์ƒ˜ํ”Œ ์ง‘ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ ์„œ ํ•™์Šต์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. (Minibatch๋Š” ๋ณดํ†ต 2์˜ ์Šน์ˆ˜๋กœ ์ •ํ•˜๋ฉฐ 32, 64, 128์„ ์ฃผ๋กœ ์“ด๋‹ค)

 

Minibatch๋ฅผ ์ด์šฉํ•ด์„œ loss์˜ ์ „์ฒด ํ•ฉ์˜ "์ถ”์ •์น˜"์™€ ์‹ค์ œ gradient์˜ "์ถ”์ •์น˜"๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

#Valina Minibatch Gradient Descent

while True:
    data_batch = sample_training_data(data, 256) #sample 256 examples
    weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
    weights += - step_size * weights_grad # perform parameter update

 

์ž„์˜์˜ minibatch๋ฅผ ๋งŒ๋“ค๊ณ , minibatch ๋‚ด์—์„œ loss์™€ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  W๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

(Loss์˜ ์ถ”์ •์น˜์™€ Gradient์˜ ์ถ”์ •์น˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ)

 

 

์›น ๋ฐ๋ชจ

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/

728x90
๋ฐ˜์‘ํ˜•

'์ธ๊ณต์ง€๋Šฅ ๐ŸŒŒ > CS231n' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

CS231n 5๊ฐ• Convolutional Neural Networks  (0) 2024.03.20
CS231n 4๊ฐ• Introduction to Neural Networks  (0) 2024.03.20
CS231n 2๊ฐ• Image Classification Pipeline  (0) 2024.03.07