Post

๐Ÿš€ Transformer ํŒŒ์ด์ฌ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ!

๐Ÿš€ Transformer ํŒŒ์ด์ฌ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ!

๐Ÿš€ Transformer ํŒŒ์ด์ฌ์œผ๋กœ ์™„์ „์ •๋ณต!

โ€œAttention is All You Needโ€ - 2017๋…„ ๊ตฌ๊ธ€์˜ ํ˜๋ช…์ ์ธ ๋…ผ๋ฌธ ๐ŸŽฏ
์ด์ œ ์šฐ๋ฆฌ๋„ Transformer๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ํŒŒ์ด์ฌ์œผ๋กœ ๊ตฌํ˜„ํ•ด๋ณด์ž! ๐Ÿ’ช


๐ŸŽฏ ๋ชฉ์ฐจ

  1. ๐Ÿ” Transformer ๊ฐœ์š”
    • ๐Ÿ“š Attention์˜ ์—ญ์‚ฌ
    • ๐Ÿ” ์ดˆ๊ธฐ Attention (SHCA: Single-head Cross Attention)
    • โšก 2017๋…„: Transformer์˜ ํ˜์‹ 
    • ๐Ÿง  Self-Attention์˜ ๊ธฐ๋ณธ ์›๋ฆฌ
    • ๐Ÿš€ SHCA โ†’ MHCA ํ˜๋ช…์  ๋ณ€ํ™”
    • ๐Ÿ”ฅ Transformer์˜ 3๋Œ€ ํ˜์‹ 
    • ๐Ÿ“ˆ ๋ฐœ์ „ ๊ณผ์ • ์š”์•ฝ
  2. ๐Ÿ—๏ธ Transformer ๋ธ”๋ก ๊ตฌํ˜„
  3. ๐ŸŽฏ ์š”์•ฝ ๋ฐ ๋งˆ๋ฌด๋ฆฌ
  4. ๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ

๐Ÿ” Transformer ๊ฐœ์š”

๐Ÿค” Transformer๊ฐ€ ๋ญ”๊ฐ€์š”?

Transformer๋Š” 2017๋…„ ๊ตฌ๊ธ€์—์„œ ๋ฐœํ‘œํ•œ โ€œAttention is All You Needโ€ ๋…ผ๋ฌธ์—์„œ ์ฒ˜์Œ ์†Œ๊ฐœ๋œ neural network ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค!

๐Ÿ“š Attention์˜ ์—ญ์‚ฌ

Attention ๊ฐœ๋… ์ž์ฒด๋Š” Transformer ๋“ฑ์žฅ ์ „๋ถ€ํ„ฐ ์ด๋ฏธ ์žˆ์—ˆ์–ด์š”! ๐Ÿ•ฐ๏ธ

๋‹ค๋งŒ ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ ์žˆ๋Š” Q / K / V ๊ฐ€ ์—†์—ˆ๋‹ค!!
Q/K/V๋Š” Transformer ์ดํ›„ ๋ช…ํ™•ํžˆ ์ •๋ฆฌ๋œ ํ‘œํ˜„์ด์•ผ!!

** ์ดˆ๊ธฐ Attention(SHCA: Single-head Cross Attention)** ๐Ÿ”

  • LSTM์˜ ์ธ์ฝ”๋”ฉ : ๊ธฐ์กด์˜ LSTM์€ ๋ฌธ์žฅ์„ ์ˆœ์„œ๋Œ€๋กœ ์ฝ๊ณ  hidden_state๋ฅผ ์ƒ์„ฑํ•œ๋‹ค(๋’ค์—์„œ ๋ณด๋ ค๊ณ )!!
  • LSTM์˜ ๋””์ฝ”๋”ฉ : hidden_state๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฒฐ๊ณผ ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค!
  • ์—ฌ๊ธฐ์„œ!! ๋ฌธ์žฅ์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ์•ž์˜ ๋‚ด์šฉ์„ ๊นŒ๋จน๋Š” ๋ฌธ์ œ๊ฐ€ ๋ง์ƒ!!
  • ๊ทธ๋ž˜์„œ Attention์„ ์ œ์‹œํ–ˆ๋‹ค!! โ†’ LSTM์—์„œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ• ๋–„ hidden_state๊ฐ€ ์•„๋‹ˆ๋ผ ๊ณผ๊ฑฐ ๋ฌธ์žฅ์˜ ์ „์ฒด๋ฅผ ๋ณด๊ณ  ์–ด๋””๋ฅผ ์ง‘์ค‘ํ• ์ง€ ๋ณธ๋‹ค!!

๊ทธ๋ž˜์„œ!! AI ๊ฐ€ ์ •๋ฆฌํ•œ ์ดˆ๊ธฐ Attention์˜ ํŠน์ง•:

  • ๐Ÿ”„ RNN๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ: LSTM/GRU encoder-decoder์™€ ๊ฒฐํ•ฉ
  • ๐Ÿ“ ๋‹จ๋ฐฉํ–ฅ: Decoder๊ฐ€ encoder๋ฅผ โ€œ๋ณด๋Š”โ€ ์šฉ๋„ (Cross Attention)
  • ๐ŸŽฏ ๋ฒˆ์—ญ ๋ฌธ์ œ ํ•ด๊ฒฐ: ๊ธด ๋ฌธ์žฅ์—์„œ ์ •๋ณด ์†์‹ค ๋ฐฉ์ง€
  • ๐ŸŽช Single-Head๋งŒ ์กด์žฌ: Multi-Head ๊ฐœ๋…์€ ์•„์ง ์—†์—ˆ์Œ!
  • ๐Ÿ‘ฅ ๋Œ€ํ‘œ์  ์—ฐ๊ตฌ: Bahdanau (2015), Luong (2015)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
import torch.nn.functional as F

class OldAttention(nn.Module):
    """๐Ÿ” 2015๋…„ Seq2Seq + Attention ๊ตฌ์กฐ ์˜ˆ์‹œ (Bahdanau-style)"""
    def __init__(self):
        super().__init__()
        # ๐Ÿ”น ์ธ์ฝ”๋”: ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” LSTM (hidden_size๋Š” ์ถœ๋ ฅ ์ฐจ์›)
        self.encoder_rnn = nn.LSTM(input_size, hidden_size)

        # ๐Ÿ”น ๋””์ฝ”๋”: ์ถœ๋ ฅ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” LSTM
        self.decoder_rnn = nn.LSTM(input_size, hidden_size)

        # ๐Ÿ”ธ ์–ดํ…์…˜ ์Šค์ฝ”์–ด ๊ณ„์‚ฐ์šฉ ์„ ํ˜• ๋ ˆ์ด์–ด
        # ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” hidden state๋ฅผ ์—ฐ๊ฒฐํ•ด ์ ์ˆ˜ํ™” (concat โ†’ score)
        self.attention = nn.Linear(hidden_size * 2, 1)

    def forward(self, encoder_outputs, decoder_hidden):
        """
        Args:
            encoder_outputs: (seq_len, hidden_size) - ์ธ์ฝ”๋”์˜ ์ „์ฒด ์ถœ๋ ฅ ์‹œํ€€์Šค
            decoder_hidden: (1, hidden_size) - ํ˜„์žฌ ๋””์ฝ”๋”์˜ hidden state
        Returns:
            context: (1, hidden_size) - ์ธ์ฝ”๋” ์ถœ๋ ฅ๋“ค์˜ weighted sum
            attention_weights: (seq_len,) - softmax attention weights
        """

        attention_scores = []

        # ๐Ÿ” ๊ฐ ์ธ์ฝ”๋” ์ถœ๋ ฅ ๋ฒกํ„ฐ์™€ ํ˜„์žฌ ๋””์ฝ”๋” ์ƒํƒœ๋ฅผ ๋น„๊ตํ•˜์—ฌ score ๊ณ„์‚ฐ
        for encoder_output in encoder_outputs:
            # ๐Ÿงฉ ์ธ์ฝ”๋” ์ถœ๋ ฅ๊ณผ ๋””์ฝ”๋” ์ƒํƒœ๋ฅผ ์—ฐ๊ฒฐ (concat)
            # decoder_hidden ๊ฐ€ Q์—ญํ• , encoder_output๊ฐ€ K V ์—ญํ• ์„ํ•œ๋‹ค!! 
            ## ์™œ๋ƒํ•˜๋ฉด decoder_hidden๋Š” ๋ฌด์—‡์„ ๋ณด๊ณ ์‹ถ์€๊ฐ€! Query ํ•˜๋Š”๊ฑฐ๊ณ ,
            ## encoder_output๊ฐ€ ์ธ์ฝ”๋” ์œ„์น˜์˜ ํŠน์„ฑ(K), ์‹ค์ œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๋ฒกํ„ฐ(V) ์—ญํ• ์„ ํ•œ๋‹ค!
            combined = torch.cat([encoder_output, decoder_hidden], dim=1)

            # ๐Ÿ“ ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผ์‹œ์ผœ scalar score ์ถœ๋ ฅ
            score = self.attention(combined)  # (1, 1)
            attention_scores.append(score)

        # ๐Ÿ”ƒ attention_scores: [(1,1), (1,1), ...] โ†’ (seq_len, 1)
        attention_scores = torch.stack(attention_scores, dim=0)

        # ๐Ÿ“Š softmax๋กœ ํ™•๋ฅ ํ™”ํ•˜์—ฌ attention weight ๊ณ„์‚ฐ
        attention_weights = F.softmax(attention_scores, dim=0)  # (seq_len, 1)

        # ๐Ÿงฎ ๊ฐ ์ธ์ฝ”๋” ์ถœ๋ ฅ์— attention weight ๊ณฑํ•ด์„œ ํ•ฉ์‚ฐ (๊ฐ€์ค‘ํ•ฉ)
        # encoder_outputs: (seq_len, hidden_size)
        # attention_weights: (seq_len, 1) โ†’ broadcasting
        context = torch.sum(attention_weights * encoder_outputs, dim=0)  # (1, hidden_size)

        return context, attention_weights  # context๋Š” ๋””์ฝ”๋”์— ์ „๋‹ฌ๋˜๋Š” "์š”์•ฝ ์ •๋ณด"

2017๋…„: Transformer์˜ ํ˜์‹  โšก

๐ŸŽญ Multi-Head Attention (MHCA) ๋“ฑ์žฅ!

์šฐ๋ฆฌ๊ฐ€ ์•„๋‹Œ โ€˜Attention is all you Need!โ€™
RNN ์—†์— Attention ๋งŒ์œผ๋กœ, ์—ฌ๊ธฐ์„œ Q,K,V ๊ฐœ๋… ์ •๋ฆฌ ๋ฐ Q=K=V self attention ๊ฐœ๋…๊ณผ
๋‹ค๊ฐ๋„์—์„œ ๋ณด๋Š” Multi-Head ๊ฐœ๋…์„ ๋„์ž…ํ•จ!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Transformer์˜ Multi-Head Self-Attention (2017๋…„)
class MultiHeadSelfAttention_2017(nn.Module):
    """ํ˜์‹ ์ ์ธ Multi-Head Self-Attention"""
    def __init__(self, d_model, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # ๊ฐ head์˜ ์ฐจ์›
        
        # ์—ฌ๋Ÿฌ ๊ฐœ์˜ attention head๋ฅผ ์œ„ํ•œ projection!
        self.W_q = nn.Linear(d_model, d_model)  # 8๊ฐœ head ๋™์‹œ์—
        self.W_k = nn.Linear(d_model, d_model)  # 8๊ฐœ head ๋™์‹œ์—  
        self.W_v = nn.Linear(d_model, d_model)  # 8๊ฐœ head ๋™์‹œ์—
        self.W_o = nn.Linear(d_model, d_model)  # output projection
        
    def forward(self, x):
        batch_size, seq_len, d_model = x.size()
        
        # Q, K, V๋ฅผ multiple heads๋กœ ๋ถ„ํ• 
        Q = self.W_q(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)  
        V = self.W_v(x).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        
        # ๊ฐ head์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ attention ๊ณ„์‚ฐ!
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        attention_weights = F.softmax(attention_scores, dim=-1)
        
        # Multi-head attention ์ ์šฉ
        context = torch.matmul(attention_weights, V)
        
        # Concatenate heads
        context = context.transpose(1, 2).contiguous().view(
            batch_size, seq_len, d_model
        )
        
        # Final output projection
        output = self.W_o(context)
        
        return output, attention_weights

# ๋น„๊ต: Single-Head vs Multi-Head
def compare_attention_mechanisms():
    """SHCA vs MHCA ๋น„๊ต"""
    
    # Single-Head (2015๋…„ ๋ฐฉ์‹)
    single_head_output = single_attention_head(x)  # 1๊ฐœ ๊ด€์ 
    
    # Multi-Head (2017๋…„ ๋ฐฉ์‹) 
    multi_head_output = []
    for head in range(8):  # 8๊ฐœ ๋‹ค๋ฅธ ๊ด€์ !
        head_output = attention_head(x, head_id=head)
        multi_head_output.append(head_output)
    
    # 8๊ฐœ head์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉ
    combined_output = concatenate_and_project(multi_head_output)
    
    return combined_output

๐Ÿง  Self-Attention์˜ ๊ธฐ๋ณธ ์›๋ฆฌ

Self-Attention์€ โ€œ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์žˆ๋Š”์ง€โ€๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ž…๋‹ˆ๋‹ค!

1
2
3
# ์˜ˆ์‹œ: "The cat sat on the mat"
# "cat"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์žˆ์„๊นŒ?
# cat -> The (0.1), cat (1.0), sat (0.8), on (0.2), the (0.1), mat (0.3)

๐Ÿ”‘ Query, Key, Value ๊ฐœ๋…

Think of it like a search engine! ๐Ÿ”

  • Query (Q): โ€œ๋‚ด๊ฐ€ ์ฐพ๋Š” ๊ฒƒโ€ - ํ˜„์žฌ ๋‹จ์–ด์˜ ๊ด€์‹ฌ์‚ฌ
  • Key (K): โ€œ๊ฒ€์ƒ‰ ํ‚ค์›Œ๋“œโ€ - ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์˜ ํŠน์„ฑ
  • Value (V): โ€œ์‹ค์ œ ๋‚ด์šฉโ€ - ๋‹จ์–ด์˜ ์‹ค์ œ ์ •๋ณด
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def simple_attention_example():
    """๊ฐ„๋‹จํ•œ Attention ์˜ˆ์‹œ"""
    
    # ์˜ˆ์‹œ ๋ฌธ์žฅ: "I love AI"
    # ๊ฐ ๋‹จ์–ด๋ฅผ 3์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„
    sentence = torch.tensor([
        [1.0, 0.0, 0.0],  # "I"
        [0.0, 1.0, 0.0],  # "love"  
        [0.0, 0.0, 1.0]   # "AI"
    ])
    
    # Query, Key, Value ๊ณ„์‚ฐ (๋‹จ์ˆœํ™”)
    Q = sentence  # Query: ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์„ ์ฐพ๊ณ  ์žˆ๋‚˜?
    K = sentence  # Key: ๊ฐ ๋‹จ์–ด์˜ ํŠน์„ฑ
    V = sentence  # Value: ๊ฐ ๋‹จ์–ด์˜ ์‹ค์ œ ์ •๋ณด
    
    # Attention Score ๊ณ„์‚ฐ
    attention_scores = torch.matmul(Q, K.transpose(-2, -1))
    print("Attention Scores:")
    print(attention_scores)
    
    # Softmax๋กœ ํ™•๋ฅ  ๋ณ€ํ™˜
    attention_weights = F.softmax(attention_scores, dim=-1)
    print("\nAttention Weights:")
    print(attention_weights)
    
    # ์ตœ์ข… ์ถœ๋ ฅ
    output = torch.matmul(attention_weights, V)
    print("\nFinal Output:")
    print(output)

# ์‹คํ–‰
simple_attention_example()

๐Ÿš€ SHCA โ†’ MHCA ํ˜๋ช…์  ๋ณ€ํ™”

SHCA (2015๋…„)MHCA (2017๋…„)
๐ŸŽฏ 1๊ฐœ ๊ด€์ ๐ŸŽญ 8๊ฐœ ๊ด€์ 
๐Ÿ“ Cross Attention๋งŒ๐Ÿ”„ Self + Cross
๐Ÿ”„ RNN ์˜์กด๐Ÿšซ RNN ์ œ๊ฑฐ
๐ŸŒ ์ˆœ์ฐจ ์ฒ˜๋ฆฌโšก ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
๐Ÿ“Š ๋‹จ์ˆœ ๊ฐ€์ค‘ํ•ฉ๐Ÿง  ๋ณตํ•ฉ ํ‘œํ˜„

๐ŸŽช Multi-Head์˜ ํšจ๊ณผ

๊ฐ head๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# ์˜ˆ์‹œ: "The cat sat on the mat" ๋ถ„์„
sentence = "The cat sat on the mat"

# Head 1: ๋ฌธ๋ฒ•์  ๊ด€๊ณ„ ํ•™์Šต
head_1_attention = [
    # "cat" โ†’ "The" (๊ด€์‚ฌ-๋ช…์‚ฌ ๊ด€๊ณ„)
    # "sat" โ†’ "cat" (์ฃผ์–ด-๋™์‚ฌ ๊ด€๊ณ„)  
    # "on" โ†’ "sat" (๋™์‚ฌ-์ „์น˜์‚ฌ ๊ด€๊ณ„)
]

# Head 2: ์˜๋ฏธ์  ๊ด€๊ณ„ ํ•™์Šต  
head_2_attention = [
    # "cat" โ†’ "mat" (๊ณ ์–‘์ด๊ฐ€ ๋งคํŠธ์™€ ๊ด€๋ จ)
    # "sat" โ†’ "on" (์•‰๋Š” ๋™์ž‘๊ณผ ์œ„์น˜)
]

# Head 3: ์œ„์น˜์  ๊ด€๊ณ„ ํ•™์Šต
head_3_attention = [
    # ์ธ์ ‘ํ•œ ๋‹จ์–ด๋“ค ๊ฐ„์˜ ๊ด€๊ณ„
    # "The" โ†’ "cat", "cat" โ†’ "sat" ๋“ฑ
]

# Head 4-8: ๋‹ค๋ฅธ ์ถ”์ƒ์  ๊ด€๊ณ„๋“ค...

๐Ÿ”ฅ Multi-Head์˜ ํ˜์‹ ์  ์žฅ์ :

  • ๐ŸŽฏ ๋‹ค์–‘ํ•œ ๊ด€์ : ๋ฌธ๋ฒ•, ์˜๋ฏธ, ์œ„์น˜ ๋“ฑ ๋™์‹œ ํ•™์Šต
  • ๐Ÿง  ํ’๋ถ€ํ•œ ํ‘œํ˜„: ๋ณต์žกํ•œ ์–ธ์–ด ํŒจํ„ด ํฌ์ฐฉ
  • โšก ๋ณ‘๋ ฌ ๊ณ„์‚ฐ: ๋ชจ๋“  head๊ฐ€ ๋™์‹œ์— ์ฒ˜๋ฆฌ
  • ๐Ÿš€ ์„ฑ๋Šฅ ํ–ฅ์ƒ: ์‹ค์ œ๋กœ ๋ฒˆ์—ญ/์ดํ•ด ์„ฑ๋Šฅ ๋Œ€ํญ ๊ฐœ์„ 

๐Ÿ’ก ๊ฒฐ๊ณผ: Single-Head์˜ ํ•œ๊ณ„๋ฅผ ์™„์ „ํžˆ ๊ทน๋ณต! ๐ŸŽ‰

๐Ÿ”ฅ Transformer์˜ 3๋Œ€ ํ˜์‹ 

๊ธฐ์กด Attention (2015)Transformer Attention (2017)
๐Ÿ”„ RNN ํ•„์ˆ˜๐Ÿšซ RNN ์—†์Œ
๐Ÿ“ Encoderโ†’Decoder๋งŒ๐Ÿ”„ Self-Attention
๐ŸŽฏ ๋‹จ์ผ Head๐ŸŽญ Multi-Head
๐ŸŒ ์ˆœ์ฐจ ์ฒ˜๋ฆฌโšก ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

ํ˜์‹  ํฌ์ธํŠธ:

  1. ๐Ÿง  Self-Attention: ๊ฐ™์€ ์‹œํ€€์Šค ๋‚ด์—์„œ ๋ชจ๋“  ์œ„์น˜๊ฐ€ ์„œ๋กœ ๊ด€๊ณ„ ํ•™์Šต
  2. ๐ŸŽญ Multi-Head: ์—ฌ๋Ÿฌ ๊ด€์ ์—์„œ ๋™์‹œ์— attention ๊ณ„์‚ฐ
  3. โšก ๋ณ‘๋ ฌํ™”: RNN ์—†์ด๋„ ์‹œํ€€์Šค ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

๐Ÿ“ˆ ๋ฐœ์ „ ๊ณผ์ • ์š”์•ฝ

timeline
    title Attention ๋ฐœ์ „์‚ฌ
    2014 : Neural Machine Translation
         : Encoder-Decoder ๋“ฑ์žฅ
    2015 : Bahdanau Attention
         : ์ฒซ ๋ฒˆ์งธ Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜
         : Luong Attention
    2017 : Transformer
         : "Attention is All You Need"
         : Self-Attention ํ˜๋ช…
    2018+ : BERT, GPT ๋“ฑ์žฅ
          : Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค

๐Ÿ’ก ๊ฒฐ๋ก : Attention์€ ๊ธฐ์กด์— ์žˆ๋˜ ๊ฐœ๋…์ด์ง€๋งŒ, Transformer๊ฐ€ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ ˆ๋ฒจ๋กœ ๋Œ์–ด์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค! ๐Ÿš€

ํ•ต์‹ฌ ์•„์ด๋””์–ด:

  • ๐Ÿšซ RNN/LSTM ์—†์ด๋„ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • โšก ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ์œผ๋กœ ํ•™์Šต ์†๋„ ๋Œ€ํญ ํ–ฅ์ƒ
  • ๐ŸŽฏ Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํ•ด๊ฒฐ

๐Ÿ—๏ธ Transformer ๋ธ”๋ก ๊ตฌํ˜„

์ด์ œ ์™„์ „ํ•œ Transformer ๋ธ”๋ก์„ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค! ๐ŸŽ‰

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-Head Attention
        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        
        # Feed-Forward Network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # Layer Normalization
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        """
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            mask: Optional attention mask
        Returns:
            output: Transformer block output
        """
        # 1. Multi-Head Attention + Residual Connection + Layer Norm
        attn_output, attn_weights = self.attention(x, x, x, mask)
        x = self.ln1(x + self.dropout(attn_output))
        
        # 2. Feed-Forward + Residual Connection + Layer Norm
        ff_output = self.feed_forward(x)
        x = self.ln2(x + self.dropout(ff_output))
        
        return x, attn_weights

# ์™„์ „ํ•œ Transformer ๋ชจ๋ธ! ๐Ÿš€
class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, n_heads=8, n_layers=6, 
                 d_ff=2048, max_seq_len=5000, dropout=0.1):
        super().__init__()
        
        self.d_model = d_model
        
        # Embedding layers
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        
        # Transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        # Final layer norm
        self.ln_final = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        """
        Args:
            x: Input token ids (batch_size, seq_len)
            mask: Optional attention mask
        Returns:
            output: Transformer output (batch_size, seq_len, d_model)
            attention_weights: List of attention weights from each layer
        """
        batch_size, seq_len = x.size()
        
        # Token embeddings
        token_emb = self.token_embedding(x)
        
        # Position embeddings
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
        pos_emb = self.position_embedding(positions)
        
        # Combine embeddings
        x = self.dropout(token_emb + pos_emb)
        
        # Apply transformer blocks
        attention_weights = []
        for block in self.transformer_blocks:
            x, attn_weights = block(x, mask)
            attention_weights.append(attn_weights)
        
        # Final layer norm
        x = self.ln_final(x)
        
        return x, attention_weights

# ์‹ค์ œ ์‚ฌ์šฉ ์˜ˆ์‹œ! ๐ŸŽฏ
def test_transformer():
    """Transformer ๋ชจ๋ธ ํ…Œ์ŠคํŠธ"""
    
    # ๋ชจ๋ธ ์ƒ์„ฑ
    vocab_size = 10000
    model = SimpleTransformer(vocab_size)
    
    # ๋”๋ฏธ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
    batch_size, seq_len = 2, 50
    input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
    
    # Forward pass
    output, attention_weights = model(input_ids)
    
    print(f"๐Ÿš€ Transformer Results:")
    print(f"Input shape: {input_ids.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Number of layers: {len(attention_weights)}")
    
    # ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๊ณ„์‚ฐ
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Total parameters: {total_params:,}")
    
    return model, output, attention_weights

# ์‹คํ–‰
model, output, attention_weights = test_transformer()

๐ŸŽฏ ์š”์•ฝ ๋ฐ ๋งˆ๋ฌด๋ฆฌ

๐Ÿ’ก ํ•ต์‹ฌ ํฌ์ธํŠธ

โ€œAttention is All You Needโ€ - ์ •๋ง๋กœ Attention๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ–ˆ์Šต๋‹ˆ๋‹ค! ๐ŸŽฏ

RNN/LSTM ์—†์ด๋„ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋กœ ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.

๐ŸŽ‰ ์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ด์ œ ์—ฌ๋Ÿฌ๋ถ„์€ Transformer๋ฅผ ์™„์ „ํžˆ ์ดํ•ดํ•˜๊ณ  ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ๐Ÿ’ช


๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ


This post is licensed under CC BY 4.0 by the author.