์ธ๊ณต์ง€๋Šฅ ๐ŸŒŒ/CS231n

CS231n 4๊ฐ• Introduction to Neural Networks

23.8 2024. 3. 20. 21:00
๋ฐ˜์‘ํ˜•

3๊ฐ• ์š”์•ฝ

ํ•จ์ˆ˜ F๋กœ classifier(network) ์ •์˜ (x : input data, W : weights, ์ถœ๋ ฅ : score vector)

Loss function ์œผ๋กœ ์šฐ๋ฆฌ์˜ ํ•จ์ˆ˜ F๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ฑ๋Šฅ์ด ์•ˆ์ข‹์€์ง€ ํ™•์ธ (e.g. SVM, BCE ๋“ฑ...) 

ํ•จ์ˆ˜ F๊ฐ€ training dataset์—๋งŒ fit ํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด(์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” test dataset) Regularization term ์ถ”๊ฐ€

Loss๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์•„์ง€๋Š” W๋ฅผ ์ฐพ๊ณ ์ž Gradient Descent ํ™œ์šฉ

 

Computational graphs

์šฐ๋ฆฌ๋Š” ์ตœ์ข…์ ์œผ๋กœ gradient ๊ฐ’์„ ์ž๋™์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด analytic gradient๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

Computational graph ๋ฅผ ํ™œ์šฉํ•˜์—ฌ analytic gradient ๊ณ„์‚ฐ ๋‹จ๊ณ„๋ฅผ ํ™•์ธํ•ด๋ณด์ž.

 

์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ๊ฐ ๋…ธ๋“œ๋Š” ์—ฐ์‚ฐ ๋‹จ๊ณ„๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

Computational Graph๋ฅผ ์‚ฌ์šฉํ•ด์„œ backpropagation ๊ณผ์ •์„ ๋ณด๋ฉด, gradient๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด graph ๋‚ด๋ถ€์˜ ๋ชจ๋“  ๋ณ€์ˆ˜์— ๋Œ€ํ•ด chain rule์„ ์žฌ๊ท€์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.

 

์•„๋ž˜ ์˜ˆ์‹œ์—์„œ ๋ฌด์Šจ ์˜๋ฏธ์ธ์ง€ ํ™•์ธํ•ด๋ณด๋„๋ก ํ•˜์ž

 

 

Backpropagation: a simple example

์œ„์— ์˜ˆ์‹œ์—์„œ x, y, z๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๊ณ  ์ค‘๊ฐ„ ๋…ธ๋“œ๋“ค์—์„œ ๊ณ„์‚ฐ์ด ์ด๋ฃจ์–ด์ง„๋‹ค.

 

์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’์ธ f์˜ output์€ -12๊ฐ€ ๋˜๋Š”๋ฐ, ์ด -12๊ฐ€ loss์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ(๋ฏธ๋ถ„๊ฐ’)์€ 1์ด๋‹ค.

์ด์ œ backpropagation์—์„œ chain rule์„ ์ ์šฉํ•ด๊ฐ€๋ฉฐ ๊ฐ ์ค‘๊ฐ„ ๋‹จ๊ณ„์—์„œ์˜ ๋ฏธ๋ถ„๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์ค‘๊ฐ„ ๊ฐ’์˜ ๋ฏธ๋ถ„๊ฐ’์€ ๊ฐ ์š”์†Œ๊ฐ€ ์ตœ์ข… loss์— ์–ผ๋งŒํผ์˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ผ๊ณ  ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

์šฐ๋ฆฌ๋Š” ๊ฐ ๋…ธ๋“œ์™€ ๊ฐ ๋…ธ๋“œ์˜ local ์ž…๋ ฅ์— ๋Œ€ํ•ด์„œ๋งŒ ์•Œ๊ณ  ์žˆ๋‹ค.

 

์œ„์—์„œ local ์ž…๋ ฅ์€ x, y์ด๊ณ  ์ถœ๋ ฅ์€ z์ด๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด๋ฅผ ํ†ตํ•ด local gradient๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

โญ ๋‹ค์‹œ ์ •๋ฆฌ

Backpropagation ๊ณผ์ •์—์„œ local gradient์™€ upstream gradient๋ฅผ ๊ณฑํ•˜๋Š” ์ด์œ ๋Š” ์—ฐ์‡„ ๋ฒ•์น™(Chain Rule) ๋•Œ๋ฌธ์ด๋‹ค.

์—ฐ์‡„ ๋ฒ•์น™์— ๋”ฐ๋ฅด๋ฉด, ํ•จ์ˆ˜ ๊ฐ€ ์žˆ๊ณ  ์ธ ๊ฒฝ์šฐ, ์— ๋Œ€ํ•œ ์˜ ๋ฏธ๋ถ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ ์ด๋‹ค.

 

๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ์†์‹ค ํ•จ์ˆ˜๋ฅผ ๊ฐ ์ธต์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฏธ๋ถ„ํ•  ๋•Œ backpropagation์€ ์—ฐ์‡„ ๋ฒ•์น™์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ธต์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

  • "Local gradient" : ํ˜„์žฌ ์ธต์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„.
    ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ถœ๋ ฅ์˜ ๋ฏธ์„ธํ•œ ๋ณ€ํ™”์— ๋Œ€ํ•œ ์ธต ๋‚ด์˜ ๋ณ€ํ™”๋ฅผ ์ธก์ •
  • "Upstream gradient"๋Š” ํ˜„์žฌ ์ธต ์ดํ›„์˜ ์ธต์—์„œ ์ „๋‹ฌ๋œ ๊ทธ๋ž˜๋””์–ธํŠธ.
    ์ดํ›„ ๋‹ค์Œ ์ธต์—์„œ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ ํ˜„์žฌ ์ธต์˜ ์ถœ๋ ฅ์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ์ธก์ •.

๋”ฐ๋ผ์„œ ์ด ๋‘ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณฑํ•จ์œผ๋กœ์จ, ํ˜„์žฌ ์ธต์˜ ๊ฐ€์ค‘์น˜์™€ ํŽธํ–ฅ์— ๋Œ€ํ•œ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

 

add gate๋Š” local gradient๊ฐ€ 1์ด๊ธฐ์— ๊ฐ™์€ gradient๋ฅผ ๋ฐ›์œผ๋ฏ€๋กœ gradient distributor๋กœ

max gate๋Š” ๋‘˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜๋ฏ€๋กœ gradient router๋กœ,

mul gate๋Š” ์„œ๋กœ์˜ ๊ฐ’์œผ๋กœ ๋ถ€ํ„ฐ ์˜ํ–ฅ์„ ๋ฐ›์œผ๋ฏ€๋กœ gradient swithcer๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

์—ฌ๋Ÿฌ ๋…ธ๋“œ์™€ ์—ฐ๊ฒฐ๋œ ํ•˜๋‚˜์˜ ๋…ธ๋“œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž.

์ด๋•Œ๋Š” upstream gradient ๊ฐ’๋“ค์„ ํ•ฉํ•ด์„œ ํ•˜๋‚˜์˜ ๋…ธ๋“œ์—์„œ ๋ฐ›๊ฒŒ ๋œ๋‹ค.

 

 

 

์•ž์—์„œ๋Š” ์Šค์นผ๋ผ๋งŒ ๋‹ค๋ค˜๋Š”๋ฐ x, y, z๊ฐ€ ๋ฒกํ„ฐ๋ผ๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ๋ ๊นŒ?

=> ์•ž์—์„œ ํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค. 

๋‹ค๋งŒ gradient๊ฐ€ Jacobian ํ–‰๋ ฌ์ด ๋œ๋‹ค. (๊ฐ ์š”์†Œ์˜ ๋ฏธ๋ถ„์„ ํฌํ•จํ•˜๋Š” ํ–‰๋ ฌ)

ex) x์˜ ๊ฐ ์›์†Œ์— ๋Œ€ํ•ด z์— ๋Œ€ํ•œ ๋ฏธ๋ถ„์„ ํฌํ•จํ•˜๋Š”

 

 

 

4096์ฐจ์›์˜ ๋ฒกํ„ฐ ์ž…๋ ฅ์„ ๋ฐ›๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.

์œ„์˜ ํšŒ์ƒ‰ ๋ฐ•์Šค(๋…ธ๋“œ)๋Š” ๋ฒกํ„ฐ์˜ ๊ฐ ์š”์†Œ์™€ 0์„ ๋น„๊ตํ•ด์„œ ์ตœ๋Œ€๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ถœ๋ ฅ๊ฐ’์€ 4096์ฐจ์›์˜ ๋ฒกํ„ฐ๊ฐ€ ๋œ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์œ„์—์„œ Jacobian matrix์˜ ์‚ฌ์ด์ฆˆ๋Š” ๋ช‡์ด ๋ ๊นŒ?

Jacobian ํ–‰๋ ฌ์˜ ๊ฐ ํ–‰์€ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ถœ๋ ฅ์˜ ํŽธ๋ฏธ๋ถ„์ด ๋œ๋‹ค.

=> (4096)^2 ๊ฐ€ ๋œ๋‹ค.

 

Q: what is the size of the Jacobian matrix? [4096 x 4096!]

A: in practice, we process an entire minibatch (e.g. 100) of examples at one time:

    i.e. Jacobian would technically be a [409,600 x 409,600] matrix :\

 

Q2: what does it look like?

max(0,x)์—์„œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด๋ณด์ž.

์ด๋•Œ ํŽธ๋ฏธ๋ถ„์— ๋Œ€ํ•ด ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ž…๋ ฅ์˜ ์–ด๋–ค ์ฐจ์›์ด ์ถœ๋ ฅ์˜ ์–ด๋–ค ์ฐจ์›์— ์˜ํ–ฅ์„ ์ค„๊นŒ?

Jacobian ํ–‰๋ ฌ์—์„œ ์–ด๋–ค ์ข…๋ฅ˜์˜ ๊ตฌ์กฐ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‚˜? --> ๋Œ€๊ฐ์„ 

์ž…๋ ฅ์˜ ๊ฐ ์š”์†Œ, ์ฒซ ๋ฒˆ์งธ ์ฐจ์›์€ ์˜ค์ง ์ถœ๋ ฅ์˜ ํ•ด๋‹น ์š”์†Œ์—๋งŒ ์˜ํ–ฅ์„ ์ค€๋‹ค.

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ์˜ Jacobian ํ–‰๋ ฌ์€ ๋Œ€๊ฐํ–‰๋ ฌ์ด ๋  ๊ฒƒ์ด๋‹ค.

ํ•˜์ง€๋งŒ Jacobian ํ–‰๋ ฌ์„ ์ž‘์„ฑํ•˜๊ณ  ๊ณต์‹ํ™”ํ•  ํ•„์š”๋Š” ์—†๋‹ค.

์šฐ๋ฆฌ๋Š” ์ถœ๋ ฅ์— ๋Œ€ํ•œ x์˜ ์˜ํ–ฅ์— ๋Œ€ํ•ด์„œ ๊ทธ๋ฆฌ๊ณ  ์ด ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ๊ณผ

์šฐ๋ฆฌ๊ฐ€ ๊ณ„์‚ฐํ•œ gradient์˜ ๊ฐ’์„ ์ฑ„์›Œ ๋„ฃ๋Š”๋‹ค๋Š” ์‚ฌ์‹ค๋งŒ ์•Œ๋ฉด ๋œ๋‹ค.

 

 

 

 

 

 

 

 

  • Neural networks๋Š” ๋ณดํ†ต ๋งค์šฐ ํฌ๊ธฐ์— ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ๋ฏธ๋ถ„๊ฐ’์„ ์ง์ ‘ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
  • Backpropagation์€ ๋ชจ๋“  ์ž…๋ ฅ/ํŒŒ๋ผ๋ฏธํ„ฐ/์ค‘๊ฐ„ ๋‹จ๊ณ„์˜ ๋ฏธ๋ถ„๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด computational graph์—์„œ chain rule์„ ์žฌ๊ท€์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • forward : ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  gradient ๊ณ„์‚ฐ์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ค‘๊ฐ„ ๊ฐ’๋“ค์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•œ๋‹ค.
  • backward : ์ž…๋ ฅ์ด loss function์— ๋ฏธ์น˜๋Š” ๋ฏธ๋ถ„๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด chain rule์„ ์ ์šฉํ•œ๋‹ค.

 

 

Neural Networks

Neural Networks

 

 

 

์‹ ๊ฒฝ๋ง์€ ํ•จ์ˆ˜๋“ค์˜ ์ง‘ํ•ฉ(class)์œผ๋กœ

๋น„์„ ํ˜•์˜ ๋ณต์žกํ•œ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜๋“ค์„ ๊ณ„์ธต์ ์œผ๋กœ ์—ฌ๋Ÿฌ๊ฐœ ์Œ“์•„์˜ฌ๋ฆฌ๋Š” ํ˜•ํƒœ์ด๋‹ค.

 

 

์„ธํฌ์ฒด(cell body)๋Š” ๋“ค์–ด์˜ค๋Š” ์‹ ํ˜ธ(input)์„ ์ข…ํ•ฉํ•˜์—ฌ ์ถ•์‚ญ(axon)์„ ํ†ตํ•ด ๋‹ค์Œ ์„ธํฌ์ฒด์— ์ „๋‹ฌํ•œ๋‹ค.

์ด๋Š” ์•ž์—์„œ ์‚ดํŽด๋ณธ computational node์˜ ๋™์ž‘๊ณผ ๋น„์Šทํ•˜๋‹ค.

 

๋‰ด๋Ÿฐ์€ ์ด์‚ฐ spike ์ข…๋ฅ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์‹ ํ˜ธ๋ฅผ ์ „๋‹ฌํ•˜๋Š”๋ฐ ์ด๋Š” ํ™œ์„ฑํ™”ํ•จ์ˆ˜์™€ ์œ ์‚ฌํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

์‚ฌ์‹ค ๋‰ด๋Ÿฐ์€ ์šฐ๋ฆฌ๊ฐ€ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ๋ณต์žกํ•˜๊ฒŒ ๋™์ž‘ํ•œ๋‹ค.

๋‹จ์ˆœํžˆ w0์ฒ˜๋Ÿผ ๋‹จ์ผ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์ง€์ง€ ์•Š๊ณ  ๋ณต์žกํ•œ ๋น„์„ ํ˜• ์‹œ์Šคํ…œ์„ ๊ฐ–๋Š”๋‹ค.

 

 

์™ผ์ชฝ๊ณผ ๊ฐ™์€ 2๋ ˆ์ด์–ด ์‹ ๊ฒฝ๋ง์€ ๋‘ ๊ฐœ์˜ ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์œผ๋กœ

ํ•˜๋‚˜์˜ ํžˆ๋“  ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง„ ๋„คํŠธ์›Œํฌ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

Example feed-forward computation of a neural network

class Neuron :
	def neuron_tick(inputs):
    	#assume inputs ans weights are 1-D numpy arrays and bias is a number
    	cell_body_sum = np.sum(inputs * self.weights) + self.bias
        firing_rate = 1/0 / (1.0 + math.exp(-cell_body_sum)) #sigmoid 
        return firing_rate

 

We can efficiently evaluate an entire layer of neurons

 

 

 

 

 

Summary

- We arrange neurons into fully-connected layers

- The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies)

- Neural networks are not really neural

- Next time: Convolutional Neural Networks

728x90
๋ฐ˜์‘ํ˜•

'์ธ๊ณต์ง€๋Šฅ ๐ŸŒŒ > CS231n' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

CS231n 5๊ฐ• Convolutional Neural Networks  (0) 2024.03.20
CS231n 3๊ฐ• Loss Functions and Optimization  (1) 2024.03.07
CS231n 2๊ฐ• Image Classification Pipeline  (0) 2024.03.07