Post

๐Ÿ”คUnderstanding Tokenizers - Tokenizer ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ”คUnderstanding Tokenizers - Tokenizer ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  (ํ•œ๊ตญ์–ด) Tokenizer ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ” ํ…์ŠคํŠธ๋ฅผ AI๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ธฐ!!!

์šฐ๋ฆฌ๊ฐ€ ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ๋‚˜๋ˆ„์–ด ์ดํ•ดํ•˜๋“ฏ์ด,
AI ๋ชจ๋ธ๋„ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ†ตํ•ด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ๋‹จ์œ„๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค!

์ฃผ์š” ๋…ผ๋ฌธ๋“ค:


๐Ÿ’ก Tokenizer์˜ ํŠน์ง• ์š”์•ฝ!!

  1. ํ…์ŠคํŠธ์™€ AI ์‚ฌ์ด์˜ ๋‹ค๋ฆฌ ์—ญํ• 
    • ์ธ๊ฐ„์ด ์ฝ๋Š” ํ…์ŠคํŠธ๋ฅผ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ˆซ์ž ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜
  2. ์–ดํœ˜ ๋ฌธ์ œ ํ•ด๊ฒฐ
    • ์„œ๋ธŒ์›Œ๋“œ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜์œผ๋กœ ๋ฏธ๋“ฑ๋ก ๋‹จ์–ด(OOV) ๋ฌธ์ œ ํ•ด๊ฒฐ
  3. ์–ธ์–ด ๋ฌด๊ด€์„ฑ
    • ํ˜„๋Œ€ ํ† ํฌ๋‚˜์ด์ €๋Š” ๋‹ค์–‘ํ•œ ์–ธ์–ด์™€ ๋ฌธ์ž ์ฒด๊ณ„์—์„œ ์ž‘๋™

๐Ÿง  Tokenizer ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ

์‚ฌ๋žŒ์€ ๋‹จ์–ด๋ฅผ ์ฝ์ง€๋งŒ, ์ปดํ“จํ„ฐ๋Š” ์ˆซ์ž๊ฐ€ ํ•„์š”ํ•ด!
ํ† ํฌ๋‚˜์ด์ €๋Š” ํ…์ŠคํŠธ๋ฅผ AI๊ฐ€ ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•„์ˆ˜ ๋‹ค๋ฆฌ!!

ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ์˜ ์ง„ํ™”:

  • 1990๋…„๋Œ€: ๋‹จ์–ด ์ˆ˜์ค€ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ โ†’ ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๊ฑฐ๋Œ€ํ•œ ์–ดํœ˜
  • 2000๋…„๋Œ€: ๋ฌธ์ž ์ˆ˜์ค€ โ†’ ์ž‘์€ ์–ดํœ˜์ด์ง€๋งŒ ์˜๋ฏธ ์ƒ์‹ค
  • 2010๋…„๋Œ€: Sub-word ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ โ†’ ๋‘ ์„ธ๊ณ„์˜ ์žฅ์ ์„ ์œตํ•ฉ!

๐Ÿšจ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์˜ ํ•œ๊ณ„์ 

1๏ธโƒฃ ๋‹จ์–ด ์ˆ˜์ค€ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜์˜ ๋ฌธ์ œ ๐Ÿ“š

  • ์–ดํœ˜ ํญ๋ฐœ: โ€œ๋‹ฌ๋ฆฌ๋‹คโ€, โ€œ๋‹ฌ๋ฆฌ๊ณ โ€, โ€œ๋‹ฌ๋ ธ๋‹คโ€ โ†’ ๊ฐ๊ฐ ๋‹ค๋ฅธ ํ† ํฐ
  • OOV ๋ฌธ์ œ: ํ•™์Šต ์–ดํœ˜์— ์—†๋Š” ์ƒˆ๋กœ์šด ๋‹จ์–ด = <UNK>
  • ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์ : ์–ดํœ˜๊ฐ€ 10๋งŒ ๊ฐœ ์ด์ƒ ๋„๋‹ฌ ๊ฐ€๋Šฅ
  • ์–ธ์–ด ์˜์กด์„ฑ: ๊ฐ ์–ธ์–ด๋ณ„๋กœ ๋‹ค๋ฅธ ๊ทœ์น™ ํ•„์š”

์˜ˆ์‹œ:

1
2
3
์ž…๋ ฅ: "๋‹ฌ๋ฆฌ๋Š” ์‚ฌ๋žŒ์ด ๋น ๋ฅด๊ฒŒ ๋‹ฌ๋ฆฐ๋‹ค"
๋‹จ์–ด ํ† ํฐ: ["๋‹ฌ๋ฆฌ๋Š”", "์‚ฌ๋žŒ์ด", "๋น ๋ฅด๊ฒŒ", "๋‹ฌ๋ฆฐ๋‹ค"]
๋ฌธ์ œ: "๋‹ฌ๋ ค๊ฐ€๋Š”" โ†’ `<UNK>` (๊ธฐ์กด ์–ดํœ˜์— ์—†๊ธฐ์—!)

2๏ธโƒฃ ๋ฌธ์ž ์ˆ˜์ค€์˜ ํ•œ๊ณ„ ๐Ÿ”ค

  • ์˜๋ฏธ ์ƒ์‹ค: โ€œ๊ณ ์–‘์ดโ€ โ†’ [โ€œ๊ณ โ€, โ€œ์–‘โ€, โ€œ์ดโ€] ๋‹จ์–ด ์˜๋ฏธ ์†์‹ค
  • ๊ธด ์‹œํ€€์Šค: ๋” ๊ธด ์‹œํ€€์Šค = ๋” ๋งŽ์€ ์—ฐ์‚ฐ
  • ๋งฅ๋ฝ ์–ด๋ ค์›€: ๋ชจ๋ธ์ด ๋‹จ์–ด ์ˆ˜์ค€ ํŒจํ„ด ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์›€

์˜ˆ์‹œ:

1
2
3
์ž…๋ ฅ: "์•ˆ๋…• ์„ธ์ƒ"
๋ฌธ์ž ํ† ํฐ: ["์•ˆ", "๋…•", " ", "์„ธ", "์ƒ"]
๋ฌธ์ œ: 2๊ฐœ ๋‹จ์–ด์— 5๊ฐœ ํ† ํฐ!

๐Ÿ’ก ํ•ด๊ฒฐ์ฑ…: Sub Word ํ† ํฌ๋‚˜์ด์ œ์ด์…˜์ด ์–ดํœ˜ ํฌ๊ธฐ์™€ ์˜๋ฏธ์  ์˜๋ฏธ์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค!


๐Ÿ”ง ํ˜„๋Œ€ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๐Ÿ—๏ธ 1. BPE (Byte Pair Encoding) ๐Ÿงฉ

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ๋ฌธ์ž์—์„œ ์‹œ์ž‘ํ•ด์„œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ณ‘ํ•ฉ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„:

  1. ์ดˆ๊ธฐํ™”: ํ…์ŠคํŠธ๋ฅผ ๋ฌธ์ž๋กœ ๋ถ„ํ•  (๋ฌธ์ž์ˆ˜์ค€ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜)
  2. ์Œ ๊ณ„์ˆ˜: ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์ธ์ ‘ ๋ฌธ์ž ์Œ ์ฐพ๊ธฐ
  3. ๋ณ‘ํ•ฉ: ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ์„ ์ƒˆ ํ† ํฐ์œผ๋กœ ๊ต์ฒด
  4. ๋ฐ˜๋ณต: ์›ํ•˜๋Š” ์–ดํœ˜ ํฌ๊ธฐ๊นŒ์ง€

์˜ˆ์‹œ ๊ณผ์ •:

์ œ์ผ ๊ธฐ๋ณธ์ด๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ค‘์š”ํ•˜๋‹ˆ, ์ง์ ‘ ์˜ˆ์‹œ๋ฅผ ๋”ฐ๋ผํ•ด๋ณด์•˜์–ด์š”!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# ๊ฐ„๋‹จํ•œ BPE(Byte Pair Encoding) ๋ฏธ๋‹ˆ ์‹ค์Šต ์ฝ”๋“œ
from collections import Counter, defaultdict
import pandas as pd

# ์ƒ˜ํ”Œ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ
corpus = ["low", "lower", "newest", "widest"]

# Step 1: ๊ฐ ๋‹จ์–ด๋ฅผ ๋ฌธ์ž ๋‹จ์œ„๋กœ ๋ถ„ํ•ดํ•˜๊ณ  ๋์— ํŠน์ˆ˜๋ฌธ์ž ์ถ”๊ฐ€
def split_word(word):
    return list(word) + ["</w>"]  # ๋‹จ์–ด์˜ ๋ ํ‘œ์‹œ

# ์ดˆ๊ธฐ vocabulary
tokens = [split_word(word) for word in corpus]

# helper ํ•จ์ˆ˜: ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฌธ์ž์—ด๋กœ ํ•ฉ์นจ (๊ณต๋ฐฑ ๋‹จ์œ„๋กœ)
def get_vocab(tokens):
    vocab = Counter()
    for token in tokens:
        vocab[" ".join(token)] += 1
    return vocab

# ๊ฐ€์žฅ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” pair ์ฐพ๊ธฐ
def get_most_common_pair(vocab):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            pairs[pair] += freq
    return max(pairs, key=pairs.get), pairs

# ๋ณ‘ํ•ฉ ์ˆ˜ํ–‰
def merge_pair(pair, tokens):
    new_tokens = []
    bigram = " ".join(pair)
    replacement = "".join(pair)
    for token in tokens:
        token_str = " ".join(token)
        token_str = token_str.replace(bigram, replacement)
        new_tokens.append(token_str.split())
    return new_tokens

# BPE ํ•™์Šต ์ˆ˜ํ–‰ (5ํšŒ๋งŒ ๋ฐ˜๋ณต)
merge_history = []
for i in range(5):
    vocab = get_vocab(tokens)
    pair, all_pairs = get_most_common_pair(vocab)
    merge_history.append((pair, all_pairs[pair]))
    tokens = merge_pair(pair, tokens)

# ๊ฒฐ๊ณผ ์ •๋ฆฌ
merge_df = pd.DataFrame(merge_history, columns=["Merged Pair", "Frequency"])
tokens_final = [" ".join(token) for token in tokens]
print("Final tokens:", tokens_final)
# Output: ['low </w>', 'lower </w>', 'newest </w>', 'widest </w>']

# ๋‹จ๊ณ„๋ณ„ ๋ณ‘ํ•ฉ ๊ณผ์ • ๋ณด๊ธฐ
print("\n๋‹จ๊ณ„๋ณ„ ๋ณ‘ํ•ฉ ๊ณผ์ •:")
for i, (pair, freq) in enumerate(merge_history):
    print(f"๋‹จ๊ณ„ {i+1}: {pair[0]} + {pair[1]} โ†’ {pair[0]+pair[1]} (๋นˆ๋„: {freq})")
# Output:
# ๋‹จ๊ณ„ 1: e + s โ†’ es (๋นˆ๋„: 2)
# ๋‹จ๊ณ„ 2: es + t โ†’ est (๋นˆ๋„: 2)  
# ๋‹จ๊ณ„ 3: l + o โ†’ lo (๋นˆ๋„: 2)
# ๋‹จ๊ณ„ 4: lo + w โ†’ low (๋นˆ๋„: 2)
# ๋‹จ๊ณ„ 5: n + e โ†’ ne (๋นˆ๋„: 1)

BPE ์žฅ์ :

  • โœ… ํฌ๊ท€ ๋‹จ์–ด ์ฒ˜๋ฆฌ ์„œ๋ธŒ์›Œ๋“œ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด
  • โœ… ์ผ๊ด€๋œ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด
  • โœ… ์–ธ์–ด ๋ฌด๊ด€ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๐Ÿ—๏ธ 2. WordPiece ๐Ÿงฉ

ํ•ต์‹ฌ ์•„์ด๋””์–ด: BPE์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ๋ณ‘ํ•ฉ ๊ธฐ์ค€์— ์šฐ๋„(Likelihood) ์ตœ๋Œ€ํ™” ์‚ฌ์šฉ

  • WordPiece๋Š” BPE์™€ ๋‹ฌ๋ฆฌ ๋‹จ์ˆœํžˆ ๊ฐ€์žฅ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์Œ์„ ๋ณ‘ํ•ฉํ•˜์ง€ ์•Š๊ณ , ๋ณ‘ํ•ฉํ–ˆ์„ ๋•Œ ์ „์ฒด ๋ง๋ญ‰์น˜์˜ ์šฐ๋„๋ฅผ ์–ผ๋งˆ๋‚˜ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฒฐ์ •
  • ์šฐ๋„๊ฐ€๋ญ์ง€? ์–ด๋ ค์šฐ๋‹ˆ ์˜ˆ์‹œ๋กœ ๋ณด์Ÿˆ!!
    • "un"์ด๋ผ๋Š” ์„œ๋ธŒ์›Œ๋“œ๋ฅผ ๋ณ‘ํ•ฉํ•˜๋ฉด:
    • unhappiness
    • unhappy
    • unusual
    • unfit
    • unseen

    • ์ฆ‰, "un"์ด๋ผ๋Š” subword๊ฐ€ ์—ฌ๋Ÿฌ ๋‹จ์–ด์—์„œ ์˜๋ฏธ ์žˆ๋Š” ๋ฐ˜๋ณต ํŒจํ„ด์ด ๋œ๋‹ค๋ฉด
      โ†’ ๋ณ‘ํ•ฉ์„ ํ†ตํ•ด ์ „์ฒด ๋ฌธ์žฅ์—์„œ ๋ชจ๋ธ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ–ฅ์ƒ

๐Ÿ—๏ธ 3. SentencePiece ๐ŸŒ

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ์—†๋Š” ์–ธ์–ด ๋…๋ฆฝ์  ํ† ํฌ๋‚˜์ด์ œ์ด์…˜

  • ๊ธฐ์กด BPE๋Š” ๋จผ์žฌ ๊ณต๋ฐฑ์œผ๋กœ ์ž๋ฅด๊ณ , ๊ทธ ์•ˆ์—์„œ ์ชผ๊ฐœ๊ธฐ์— ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†์œผ๋ฉด? ๋ฌธ์ œ๊ฐ€๋จ!
  • ๊ทธ๋ž˜์„œ! ๊ณต๋ฐฑ๋„ ๋ฌธ์ž์ฒ˜๋Ÿผ ๋ณด๊ณ , ๋ชจ๋“  ๊ธ€์ž๋ฅผ ์ชผ๊ฐœ์„œ ์–ธ์–ด์™€ ์ƒ๊ด€์—†์ด ํ•™์Šต
  • ๊ณต๋ฐฑ์„ โ–๋กœ ์น˜ํ™˜!

์ฃผ์š” ์žฅ์ :

  • โœ… ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ๋ถˆํ•„์š” (๋‹จ์–ด ๊ฒฝ๊ณ„ ์—†์Œ)
  • โœ… ๋ชจ๋“  ์–ธ์–ด ์ฒ˜๋ฆฌ ์ค‘๊ตญ์–ด, ์ผ๋ณธ์–ด, ์•„๋ž์–ด ํฌํ•จ
  • โœ… ๊ฐ€์—ญ์ : ์›๋ณธ ํ…์ŠคํŠธ ์™„๋ฒฝ ์žฌ๊ตฌ์„ฑ ๊ฐ€๋Šฅ
  • โœ… T5, mT5, ALBERT์—์„œ ์‚ฌ์šฉ

SentencePiece ํŠน์ง•:

1
2
3
4
5
6
7
# ๊ณต๋ฐฑ์„ ์ผ๋ฐ˜ ๋ฌธ์ž๋กœ ์ฒ˜๋ฆฌ
์ž…๋ ฅ: "์•ˆ๋…• ์„ธ์ƒ"
ํ† ํฐ: ["โ–์•ˆ๋…•", "โ–์„ธ์ƒ"]  # โ–๋Š” ๊ณต๋ฐฑ์„ ๋‚˜ํƒ€๋ƒ„

# ๊ณต๋ฐฑ์ด ์—†๋Š” ์–ธ์–ด ์ฒ˜๋ฆฌ
์ž…๋ ฅ: "ใ“ใ‚“ใซใกใฏไธ–็•Œ"  # ์ผ๋ณธ์–ด: "์•ˆ๋…• ์„ธ์ƒ"
ํ† ํฐ: ["ใ“ใ‚“ใซ", "ใกใฏ", "ไธ–็•Œ"]

๐Ÿ“Š ํ† ํฌ๋‚˜์ด์ € ๋น„๊ต

๋ฐฉ๋ฒ•์–ดํœ˜ ํฌ๊ธฐOOV ์ฒ˜๋ฆฌ์–ธ์–ด ์ง€์›์‚ฌ์šฉ ๋ชจ๋ธ
๋‹จ์–ด ์ˆ˜์ค€5๋งŒ-10๋งŒ+โŒ ๋‚˜์จ๐Ÿ”ค ์–ธ์–ด๋ณ„ ํŠนํ™”์ „ํ†ต์  NLP
๋ฌธ์ž ์ˆ˜์ค€100-1์ฒœโœ… ์™„๋ฒฝ๐ŸŒ ๋ฒ”์šฉ์ดˆ๊ธฐ NMT
BPE3๋งŒ-5๋งŒโœ… ์ข‹์Œ๐ŸŒ ๋ฒ”์šฉGPT, RoBERTa
WordPiece3๋งŒ-5๋งŒโœ… ์ข‹์Œ๐ŸŒ ๋ฒ”์šฉBERT, DistilBERT
SentencePiece3๋งŒ-5๋งŒโœ… ํ›Œ๋ฅญํ•จ๐ŸŒ ๋ฒ”์šฉT5, mT5, ALBERT

๐Ÿ’ป ์‹ค์ œ ๊ตฌํ˜„

๐Ÿ”ง Hugging Face Tokenizers ์‚ฌ์šฉ

๋ชจ๋ธ๋“ค์˜ ํ† ํฌ๋‚˜์ด์ ธ๋ฅผ ๋‹ค์šด๋ฐ›์„์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from transformers import AutoTokenizer

# BPE (GPT-2)
bpe_tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "The runner runs quickly and efficiently"
bpe_tokens = bpe_tokenizer.tokenize(text)
print(f"BPE: {bpe_tokens}")
# Output: ['The', 'ฤ runner', 'ฤ runs', 'ฤ quickly', 'ฤ and', 'ฤ efficiently']

# WordPiece (BERT)
wp_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
wp_tokens = wp_tokenizer.tokenize(text)
print(f"WordPiece: {wp_tokens}")
# Output: ['the', 'runner', 'runs', 'quickly', 'and', 'efficiently']

# SentencePiece (T5)
sp_tokenizer = AutoTokenizer.from_pretrained("t5-small")
sp_tokens = sp_tokenizer.tokenize(text)
print(f"SentencePiece: {sp_tokens}")
# Output: ['โ–The', 'โ–runner', 'โ–runs', 'โ–quickly', 'โ–and', 'โ–efficiently']

๐Ÿ”ง ์ปค์Šคํ…€ BPE ํ† ํฌ๋‚˜์ด์ € ํ›ˆ๋ จ

  • ์•„๋ž˜ ๋ฐฉ์‹์€ ์ดˆ๊ธฐํ™”๋œ tokenizer๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š”๊ฒƒ!!
  • ์ œ์‹œ๋œ english_texts ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ† ํฌ๋‚˜์ด์ ธ ๋งŒ๋“ฌ!!

์•„๋ž˜ ์˜ˆ์‹œ๋Š” ์ตœ๋Œ€ ํ† ํฐ์ˆ˜๋Š” 5000๊ฐœ, 2๋ฒˆ ์ด์ƒ ๋“ฑ์žฅํ•ด์•ผ ํ•จ!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# English text data examples
english_texts = [
    "Hello world, how are you today?",
    "The weather is beautiful and sunny", 
    "We are building a custom tokenizer for English",
    "Natural language processing is a fascinating field",
    "Deep learning and machine learning are powerful technologies"
]

# Save texts to file
with open("english_corpus.txt", "w", encoding="utf-8") as f:
    for text in english_texts:
        f.write(text + "\n")

# Initialize BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer for English
trainer = BpeTrainer(
    vocab_size=5000,  # Larger vocab for English
    min_frequency=2,  # Minimum frequency threshold
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    show_progress=True
)

# Train on English text
tokenizer.train(["english_corpus.txt"], trainer)

# Test the tokenizer
test_text = "Natural language processing is fascinating!"
tokens = tokenizer.encode(test_text)
print(f"Input: {test_text}")
print(f"Token IDs: {tokens.ids}")
print(f"Tokens: {tokens.tokens}")

# Save tokenizer
tokenizer.save("my_english_bpe_tokenizer.json")
print("English BPE tokenizer saved successfully!")

ํ”„๋ฆฐํŠธ๋˜๋Š” ๊ฒฐ๊ณผ๊ฐ’์€!?

๋А๋‚Œํ‘œ๋Š” ์ฒ˜์Œ๋ณด๋Š”๊ฒƒ์ด๊ธฐ์— [UNK] ๊ฐ€ ๋Œ„๋‹ค~!

1
2
3
4
Input: Natural language processing is fascinating
Token IDs: [10, 39, 31, 28, 13, 23, 23, 38, 19, 31, 13, 19, 17, 27, 28, 26, 15, 46, 29, 37, 41, 18, 13, 29, 15, 35, 39, 37]
Tokens: ['N', 'at', 'u', 'r', 'a', 'l', 'l', 'an', 'g', 'u', 'a', 'g', 'e', 'p', 'r', 'o', 'c', 'es', 's', 'ing', 'is', 'f', 'a', 's', 'c', 'in', 'at', 'ing', 'w', 'e', 'at', 'h', 'er', '[UNK]']
English BPE tokenizer saved successfully!

๐Ÿงฉ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ์กฐ๊ธˆ๋” ์•Œ์•„๋ณด๊ธฐ!

1๏ธโƒฃ Special Tokens ๐ŸŽฏ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Example usage with BERT tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "The quick brown foxใ…"

encoded = tokenizer(text, 
                   add_special_tokens=True,
                   max_length=10,
                   padding="max_length",
                   truncation=True)

print(f"Input: {text}")
print(f"Token IDs: {encoded['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}")

# Expected output:
# Input: The quick brown fox
# Token IDs: [101, 1996, 4248, 2829, 4419, 102, 0, 0, 0, 0]
# Tokens: ['[CLS]', 'the', 'quick', 'brown', '[UNK]', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
  • ๊ฒฐ๊ณผ ์„ค๋ช…!!

    max_length = 10 ์ด์—ˆ๊ธฐ์—, ๊ธธ์ด๋ฅผ ๋งž์ถ”๊ณ ์ž PAD๊ฐ€ 3๊ฐœ ๋‚˜์˜ด! ใ… ๋Š” ํ•œ๊ธ€๋กœ ๋ชจ๋ฅด๊ธฐ์— [UNK]

ํ† ํฐ์„ค๋ช…
[PAD]๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒจ๋”ฉ ํ† ํฐ (์‹œํ€€์Šค ๊ธธ์ด ๋งž์ถค)
[UNK]๋ฏธ์ง€์˜ ๋‹จ์–ด(์–ดํœ˜์ง‘์— ์—†๋Š” OOV ๋‹จ์–ด)๋ฅผ ์œ„ํ•œ ํ† ํฐ
[CLS]๋ฌธ์žฅ ๋ถ„๋ฅ˜์šฉ ์‹œ์ž‘ ํ† ํฐ (BERT์—์„œ ์‚ฌ์šฉ)
[SEP]๋ฌธ์žฅ ๊ตฌ๋ถ„ ํ† ํฐ (๋ฌธ์žฅ ๊ฐ„ ๊ตฌ๋ถ„ ๋˜๋Š” ๋ฌธ์žฅ ๋ ํ‘œ์‹œ)
[MASK]๋งˆ์Šคํ‚น๋œ ํ† ํฐ (Masked Language Modeling์šฉ)
<s>์‹œํ€€์Šค ์‹œ์ž‘ ํ† ํฐ (GPT ๋“ฑ์—์„œ ์‚ฌ์šฉ)
</s>์‹œํ€€์Šค ์ข…๋ฃŒ ํ† ํฐ (GPT ๋“ฑ์—์„œ ์‚ฌ์šฉ)

2๏ธโƒฃ Tokenization vs Encoding ๐Ÿ”„

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from transformers import AutoTokenizer

# ์‚ฌ์ „ ํ•™์Šต๋œ BERT tokenizer ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ์ž…๋ ฅ ๋ฌธ์žฅ
text = "Hello, world!"

# 1. Tokenization: ๋ฌธ์žฅ์„ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
# ์˜ˆ์‹œ ์ถœ๋ ฅ: ['hello', ',', 'world', '!']

# 2. Encoding: ๋ฌธ์žฅ์„ ํ† ํฐ ID ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜
token_ids = tokenizer.encode(text, add_special_tokens=False)
print(f"Token IDs: {token_ids}")
# ์˜ˆ์‹œ ์ถœ๋ ฅ: [7592, 1010, 2088, 999]

# 3. Full encoding with special tokens
encoded = tokenizer(
    text,
    add_special_tokens=True,   # [CLS], [SEP] ์ถ”๊ฐ€
    padding=True,              # padding ์ถ”๊ฐ€ (๋‹จ์ผ ๋ฌธ์žฅ์—๋„ ๊ฐ€๋Šฅ)
    truncation=True,           # ๋„ˆ๋ฌด ๊ธด ๋ฌธ์žฅ์€ ์ž๋ฅด๊ธฐ
    return_tensors="pt"        # PyTorch tensor๋กœ ๋ฐ˜ํ™˜
)

# ์ถœ๋ ฅ
print("Full Encoding (with special tokens, padding):")
print(encoded)
print("Input IDs:", encoded['input_ids'])
print("Attention Mask:", encoded['attention_mask'])

๊ฒฐ๊ณผ๋ฌผ์€!?

Full encoding์—์„œ๋Š” [CLS] [SEP] ๋“ฑ์ด ์ถ”๊ฐ€๋œ๊ฑฐ์ž„!

1
2
3
4
5
6
Tokens ์œผ๋กœ ์ชผ๊ฐ ๊ฑฐ! : ['hello', ',', 'world', '!']
Token IDs: [7592, 1010, 2088, 999]
Full Encoding (with special tokens, padding):
{'input_ids': tensor([[ 101, 7592, 1010, 2088,  999,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
Input IDs: tensor([[ 101, 7592, 1010, 2088,  999,  102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1]])

3๏ธโƒฃ ์—ฌ๋Ÿฌ ๊ตญ๊ฐ€์˜ ์–ธ์–ด์ฒ˜๋ฆฌ! ๐ŸŒ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from transformers import AutoTokenizer

# Multilingual examples
texts = [
    "Hello world",           # English
    "Bonjour le monde",      # French
    "Hola mundo",           # Spanish  
    "Guten Tag Welt",       # German
    "Ciao mondo"            # Italian
]

# Different tokenizers handle multilingual text differently
multilingual_tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
english_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("=== Multilingual Tokenizer ===")
for text in texts:
    tokens = multilingual_tokenizer.tokenize(text)
    print(f"{text} โ†’ {tokens}")

print("\n=== English-only Tokenizer ===")
for text in texts:  # Only English and French
    tokens = english_tokenizer.tokenize(text)
    print(f"{text} โ†’ {tokens}")

๊ฒฐ๊ณผ๋ฌผ์€!?

๋‹จ์ˆœ ์˜์–ด ํ† ํฌ๋‚˜์ด์ €๋ž‘ ๋‹ค๊ตญ์–ด ํ•™์Šต๋œ๊ฑฐ๋ž‘ ๋‹ค๋ฅด์ง€์š”!?

1
2
3
4
5
6
7
8
9
10
11
12
13
=== Multilingual Tokenizer ===
Hello world โ†’ ['Hello', 'world']
Bonjour le monde โ†’ ['Bon', '##jou', '##r', 'le', 'monde']
Hola mundo โ†’ ['Ho', '##la', 'mundo']
Guten Tag Welt โ†’ ['Gut', '##en', 'Tag', 'Welt']
Ciao mondo โ†’ ['Ci', '##ao', 'mondo']

=== English-only Tokenizer ===
Hello world โ†’ ['hello', 'world']
Bonjour le monde โ†’ ['bon', '##jou', '##r', 'le', 'monde']
Hola mundo โ†’ ['ho', '##la', 'mundo']
Guten Tag Welt โ†’ ['gut', '##en', 'tag', 'we', '##lt']
Ciao mondo โ†’ ['cia', '##o', 'mon', '##do']

๐Ÿ“ˆ ์„ฑ๋Šฅ ์˜ํ–ฅ ๋ถ„์„

๐Ÿ† ์–ดํœ˜ ํฌ๊ธฐ ์˜ํ–ฅ

์–ดํœ˜ ํฌ๊ธฐ์žฅ์ ๋‹จ์ ์ตœ์  ์šฉ๋„
์†Œํ˜• (1K-5K)๐Ÿ’พ Memory efficient
โšก Fast training
๐Ÿ”„ Many subwords
๐Ÿ“ Long sequences
Resource-constrained
์ค‘ํ˜• (10K-30K)โš–๏ธ Balanced performance
โœ… Good coverage
๐Ÿ“Š Standard choiceMost applications
๋Œ€ํ˜• (50K+)๐ŸŽฏ Better semantic units
๐Ÿ“ Shorter sequences
๐Ÿ’พ Memory intensive
โฑ๏ธ Slower training
Large-scale models

๐Ÿ”ง Tokenization ์†๋„๋น„๊ต!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import time
from transformers import AutoTokenizer

text = "This is a sample text " * 1000  # Long text

tokenizers = {
    "BPE (GPT-2)": AutoTokenizer.from_pretrained("gpt2"),
    "WordPiece (BERT)": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "SentencePiece (T5)": AutoTokenizer.from_pretrained("t5-small")
}

for name, tokenizer in tokenizers.items():
    start_time = time.time()
    tokens = tokenizer.tokenize(text)
    end_time = time.time()
    
    print(f"{name}:")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Time: {end_time - start_time:.4f}s")
    print(f"  Speed: {len(tokens)/(end_time - start_time):.0f} tokens/s")

๊ฒฐ๊ณผ๋ฌผ์€!?

BPE๊ฐ€ ์ œ์ผ ๋น ๋ฅด๊ณ , T5๊ฐ€ ํ† ํฐ์ˆ˜๊ฐ€ ์ ค ๋งŽ์•„์œ !

Tokenizer TypeTokensTime (์ดˆ)Speed (tokens/s)
BPE (GPT-2)50010.0125401,257
WordPiece (BERT)50000.0142352,380
SentencePiece (T5)60000.0145413,069

โš ๏ธ ํ•œ๊ณ„์  & ๋„์ „๊ณผ์ œ

1๏ธโƒฃ ์ผ๊ด€๋˜์ง€ ์•Š์€ Tokenization ๐Ÿ”„

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from transformers import AutoTokenizer

# Same words in different contexts
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text1 = "run"
text2 = "running" 
text3 = "I like to run fast"
text4 = "The running water is clean"

print(f"'{text1}' โ†’ {tokenizer.tokenize(text1)}")
print(f"'{text2}' โ†’ {tokenizer.tokenize(text2)}")
print(f"'{text3}' โ†’ {tokenizer.tokenize(text3)}")
print(f"'{text4}' โ†’ {tokenizer.tokenize(text4)}")

# Output examples:
# 'run' โ†’ ['run']
# 'running' โ†’ ['running'] 
# 'I like to run fast' โ†’ ['I', 'ฤ like', 'ฤ to', 'ฤ run', 'ฤ fast']
# 'The running water is clean' โ†’ ['The', 'ฤ running', 'ฤ water', 'ฤ is', 'ฤ clean']
  • ์œ„์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด run ์—๋Œ€ํ•˜์—ฌ ์—ฌ๋Ÿฌ๋ฐฉ์‹์œผ๋กœ Tokenization ๋œ๋‹ค!!

2๏ธโƒฃ ์„œ๋ธŒ์›Œ๋“œ ๊ฒฝ๊ณ„ ๋ฌธ์ œ โœ‚๏ธ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Examples of problematic subword splitting
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

problematic_words = [
    "unhappiness",
    "preprocessor", 
    "antidisestablishmentarianism",
    "biocompatibility"
]

for word in problematic_words:
    tokens = tokenizer.tokenize(word)
    print(f"'{word}' โ†’ {tokens}")

# Output examples:
# 'unhappiness' โ†’ ['un', '##hap', '##piness']  # Loses "happy" semantic unit
# 'preprocessor' โ†’ ['pre', '##proc', '##ess', '##or']  # Splits "process"
# 'antidisestablishmentarianism' โ†’ ['anti', '##dis', '##esta', '##blish', '##ment', '##arian', '##ism']
# 'biocompatibility' โ†’ ['bio', '##com', '##pat', '##ibility']  # Loses "compatibility"

๋ฌธ์ œ์ : ์˜๋ฏธ์žˆ๋Š” ๋‹จ์œ„๊ฐ€ ์ž˜๋ชป ๋ถ„ํ• ๋จ (pre / processor๋กœ ๊ตฌ๋ถ„๋˜๋Š”๊ฒŒ ์ œ์ผ ์ข‹๊ฒ ์ง€๋งŒ!?..)
์˜ํ–ฅ: ๋ชจ๋ธ์ด ๋‹จ์–ด ๊ด€๊ณ„์™€ ํ˜•ํƒœํ•™์„ ์ดํ•ดํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์„ ์ˆ˜ ์žˆ์Œ

3๏ธโƒฃ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ์ง€์‹ ๋ถ€์กฑ์œผ๋กœ ํ•œ๊ณ„! ๐Ÿฅ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from transformers import AutoTokenizer

# Medical text examples
medical_texts = [
    "Patient presents with acute myocardial infarction and requires immediate intervention",
    "Blood pressure elevated, prescribing ACE inhibitors for hypertension management",
    "CT scan reveals suspicious pulmonary nodules, scheduling biopsy procedure"
]

# General vs Medical tokenizer comparison
general_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# medical_tokenizer = AutoTokenizer.from_pretrained("clinical-bert")  # Hypothetical

for text in medical_texts[:1]:  # First example only
    general_tokens = general_tokenizer.tokenize(text)
    print(f"General Tokenizer:")
    print(f"  Input: {text}")
    print(f"  Tokens: {general_tokens}")
    print(f"  Token count: {len(general_tokens)}")
    
    # Medical tokenizer would handle medical terms as single tokens
    print(f"\nMedical Tokenizer (Expected):")
    print(f"  Tokens: ['patient', 'presents', 'with', 'acute_myocardial_infarction', 'and', 'requires', 'immediate', 'intervention']")
    print(f"  Token count: 8 (significant reduction)")
    
# General: ['patient', 'presents', 'with', 'acute', 'my', '##oc', '##ard', '##ial', 'in', '##far', '##ction', 'and', 'requires', 'immediate', 'intervention']
# Medical: ['patient', 'presents', 'with', 'acute_myocardial_infarction', 'and', 'requires', 'immediate', 'intervention']
  • ์˜ํ•™์šฉ์–ด๋ฅผ ์ž˜ ์ชผ๊ฐœ์ง€ ๋ชปํ•œ๋‹ค!!

4๏ธโƒฃ ์–ธ์–ด๋ณ„ ํŠนํ™”ํ•˜๋Š”๊ฒƒ์— ๋Œ€ํ•œ ๋ฌธ์ œ ๐ŸŒ

  • ํ˜•ํƒœํ•™์  ๋ณต์žก์„ฑ: ํ’๋ถ€ํ•œ ํ˜•ํƒœํ•™์„ ๊ฐ€์ง„ ์–ธ์–ด๋“ค (๋…์ผ์–ด, ํ„ฐํ‚ค์–ด, ํ•€๋ž€๋“œ์–ด)
  • ๊ต์ฐฉ์–ด์  ํŠน์„ฑ: ํ˜•ํƒœ์†Œ ๊ฒฐํ•ฉ์œผ๋กœ ๋‹จ์–ด๊ฐ€ ํ˜•์„ฑ๋˜๋Š” ์–ธ์–ด๋“ค (์ผ๋ณธ์–ด, ํ•œ๊ตญ์–ด, ํ—๊ฐ€๋ฆฌ์–ด)
  • ๋ฌธ์ž ์ฒด๊ณ„ ํ˜ผ์šฉ: ํ•œ ํ…์ŠคํŠธ์—์„œ ์—ฌ๋Ÿฌ ๋ฌธ์ž ์ฒด๊ณ„ ์‚ฌ์šฉ (์ผ๋ณธ์–ด: ํžˆ๋ผ๊ฐ€๋‚˜ + ๊ฐ€ํƒ€์นด๋‚˜ + ํ•œ์ž)
  • ๋ณตํ•ฉ์–ด: ๋…์ผ์–ด โ€œDonaudampfschifffahrtsgesellschaftskapitรคnโ€ (๋‹ค๋‰ด๋ธŒ ์ฆ๊ธฐ์„  ํšŒ์‚ฌ ์„ ์žฅ)

๐Ÿ”ฎ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ

1๏ธโƒฃ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ๐Ÿง 

  • ๊ฐœ๋…: ์—”๋“œํˆฌ์—”๋“œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜
  • ์žฅ์ : ํƒœ์Šคํฌ๋ณ„ ์ตœ์ ํ™”
  • ๋„์ „๊ณผ์ œ: ๊ณ„์‚ฐ ๋ณต์žก์„ฑ

2๏ธโƒฃ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ ๐Ÿ–ผ๏ธ

ํ…์ŠคํŠธ๋Š” ๋ฌธ์ž๋ฅผ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜ํ•˜๋ฉด์„œ ์ˆซ์ž๋กœ ๋ฐ”๊พธ์ง€๋งŒ!!
์ด๋ฏธ์ง€๋Š” ๊ทธ๋Ÿฐ๊ฑฐ ์—†์ด ViT๋กœ ๋ฐ”๋กœ ๋ฒกํ„ฐํ™”ํ•ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Future concept: unified text-image tokenization
multimodal_input = {
    "text": "A cat sitting on a chair",
    "image": cat_image_tensor
}

unified_tokens = multimodal_tokenizer.tokenize(multimodal_input)
# Outputs both text and visual tokens in same space

# Example output:
# [
#   ("A", "text_token"),
#   ("cat", "text_token"), 
#   ("<img_patch_1>", "visual_token"),
#   ("sitting", "text_token"),
#   ("<img_patch_2>", "visual_token"),
#   ("on", "text_token"),
#   ("chair", "text_token")
# ]
  • ์ด๋Ÿฐ๋ฐฉ์‹์œผ๋กœํ•ด์„œ ๊ตฌ๊ธ€์ด Multi-modal large Model ์ธ Gemini ๋ฅผ ๊ณต๊ฐœํ–ˆ์ง€์œ !!
  • ์ด๋ฏธ์ง€๋ฅผ ํ† ํฐํ™”ํ•˜๋Š”๋ฒ•!? ์šฐ์„  ViT ์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•˜๋ฉด ์•Œ ์ˆ˜ ์žˆ์–ด์š”!
  • ์–ธ์–ด์˜ ๊ฒฝ์šฐ๋Š” ์•„๋ฌด๋ฆฌ ๋‹จ์–ด๊ฐ€ ๋งŽ์•„๋„ ํ•œ๋„๊ฐ€ ์žˆ๋Š”๋ฐ,
  • ๊ทธ๋Ÿฐ๋ฐ!?
    • ๋ฌธ์ œ1. ์ด๋ฏธ์ง€๋Š” ํ† ํฐ์ด ๋ฌด์ œํ•œ ๋‚˜์˜ฌ์ˆ˜ ์žˆ๋‹ค!! ๊ทธ๋Ÿผ ์–ด๋–ปํ•˜์ง€!?
    • ๋ฌธ์ œ2. ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํ† ํฐ์˜ ์ •๋ ฌ์€ ์–ด๋–ป๊ฒŒํ• ๊นŒ!?

๋ฌธ์ œ1. ํ† ํฐ ๋‹ค์–‘์„ฑ ๋ฌธ์ œ ๐ŸŒˆ

  • ํ…์ŠคํŠธ ํ† ํฐ: ๊ณ ์ •๋œ ์–ดํœ˜ ์‚ฌ์ „ (์˜ˆ: 50,000๊ฐœ)
  • ์ด๋ฏธ์ง€ ํ† ํฐ: ๋ฌดํ•œ๋Œ€์— ๊ฐ€๊นŒ์šด ๋‹ค์–‘์„ฑ (๊ฐ ํŒจ์น˜๋งˆ๋‹ค ์™„์ „ํžˆ ๋‹ค๋ฅธ ๊ฐ’)!!!!
  • ๊ฒฐ๊ณผ: ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€ ํ† ํฐ์„ ํ•™์Šตํ•˜๊ณ  ์ผ๋ฐ˜ํ™”ํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ต๊ณ ,์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์•„์ง
    ๐Ÿ’ก ํ•ด๊ฒฐ๋ฒ•๋“ค:

    ์ด๋ฏธ์ง€ ํ† ํฐ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ์œ„ํ•ด ๋…ธ๋ ฅํ•˜๊ฑฐ๋‚˜ (1,2,3) ํšจ์œจํ™”๋ฅผ ํ†ตํ•ด์„œ ํ•ด๊ฒฐํ•˜๊ณ ์žํ•ฉ๋‹ˆ๋‹ค!(4,5)

    1. ๐Ÿ”ฒ ๊ณ ์ • ํŒจ์น˜ ๋ถ„ํ• : ViT ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์™€ ์ƒ๊ด€์—†์ด ์ผ์ •ํ•œ ํ† ํฐ ์ˆ˜ ๋ณด์žฅ (224ร—224 โ†’ 196๊ฐœ ํ† ํฐ)
    2. ๐Ÿ“‰ ์ ์‘์  ์••์ถ•: ๋ชฉํ‘œ ํ† ํฐ ์ˆ˜์— ๋งž๊ฒŒ ์ด๋ฏธ์ง€ ์••์ถ• (512ร—512 โ†’ 100๊ฐœ ํ† ํฐ)
    3. ๐ŸŽฏ ์„ ํƒ์  ํ† ํฐํ™”: ์ค‘์š”ํ•œ ์˜์—ญ๋งŒ ํ† ํฐํ™” (๊ด€์‹ฌ ์˜์—ญ ๊ธฐ๋ฐ˜์œผ๋กœ 128๊ฐœ๋งŒ ์„ ํƒ)
    4. โšก ํšจ์œจ์  ์–ดํ…์…˜: Flash Attention, Sparse Attention์œผ๋กœ ์—ฐ์‚ฐ๋Ÿ‰ 2-4๋ฐฐ ์ค„์ž„
    5. ๐Ÿ—๏ธ ๊ณ„์ธต์  ์ฒ˜๋ฆฌ: ์ดˆ๊ธฐ์—๋Š” ๋ณ„๋„ ์ฒ˜๋ฆฌ, ์ ์ง„์ ์œผ๋กœ ํฌ๋กœ์Šค ๋ชจ๋‹ฌ ์–ดํ…์…˜ ๋„์ž…

๋ฌธ์ œ2. ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ Token์˜ ์ •๋ ฌ!! ํฌ๋กœ์Šค ๋ชจ๋‹ฌ ๋งค์นญ ๋ฌธ์ œ ๐Ÿ”—

  • ํ…์ŠคํŠธ โ€œ๊ณ ์–‘์ดโ€์™€ ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ค ๊ฐ„์˜ ์˜๋ฏธ์  ์—ฐ๊ฒฐ์ด ์–ด๋ ค์›€
  • ๊ฐ ์ด๋ฏธ์ง€ ํŒจ์น˜๊ฐ€ ์–ด๋–ค ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ๊ด€๋ จ์žˆ๋Š”์ง€ ๋ถˆ๋ถ„๋ช…
    ๐Ÿ’ก ํ•ด๊ฒฐ๋ฒ•๋“ค:

    ์—ฌ๊ธฐ๋Š” ์ด์ œ ํ† ํฐํ™”๋ผ๊ธฐ๋ณด๋‹จ Transformer ๋ถ€๋ถ„์—์„œ ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด์•„์š”!

    1. ๐ŸŽฏ ํ†ตํ•ฉ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„: ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ๋™์ผํ•œ ์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ (Gemini: 2048์ฐจ์› ํ†ตํ•ฉ)
    2. ๐Ÿ”„ ํฌ๋กœ์Šค ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜: ์ด๋ฏธ์ง€ ํ† ํฐ์ด ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ์ง์ ‘ ์ƒํ˜ธ์ž‘์šฉํ•˜๋„๋ก ์„ค๊ณ„
    3. ๐Ÿ“š ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ์ˆ˜์‹ญ์–ต ๊ฐœ๋กœ ์ •๋ ฌ ๊ด€๊ณ„ ํ•™์Šต
    4. ๐Ÿงฉ ํ† ํฐ ๋ ˆ๋ฒจ ์ •๋ ฌ: CLIP ๋ฐฉ์‹์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€ ์˜์—ญ๊ณผ ํ…์ŠคํŠธ ๋‹จ์–ด๋ฅผ ์ง์ ‘ ๋งค์นญ
    5. ๐ŸŽจ ์˜๋ฏธ์  ๊ทธ๋ฃนํ•‘: ๋น„์Šทํ•œ ์˜๋ฏธ์˜ ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ค์„ ํ•˜๋‚˜๋กœ ๋ฌถ์–ด ํ…์ŠคํŠธ์™€ ๋งค์นญ

๐ŸŒŸ ์‹ค์ œ ๋ชจ๋ธ๋“ค์˜ ํ•ด๊ฒฐ ์ „๋žต:

๋ชจ๋ธ๋ฌธ์ œ1 ํ•ด๊ฒฐ๋ฒ•๋ฌธ์ œ2 ํ•ด๊ฒฐ๋ฒ•ํŠน์ง•
GPT-4V์ ์‘์  ํ† ํฐํ™”๊ณ„์ธต์  ํฌ๋กœ์Šค ์–ดํ…์…˜์ด๋ฏธ์ง€ ๋ณต์žก๋„์— ๋”ฐ๋ผ ์กฐ์ •
Gemini Ultra๊ณ ํšจ์œจ ์••์ถ•ํ†ตํ•ฉ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์™„์ „ ํ†ตํ•ฉ
Claude 3์„ ํƒ์  ์ฒ˜๋ฆฌ์˜๋ฏธ์  ๊ทธ๋ฃนํ•‘์ค‘์š”ํ•œ ์˜์—ญ๋งŒ ์ง‘์ค‘ ์ฒ˜๋ฆฌ
LLaVA๊ณ ์ • ํŒจ์น˜ ๋ถ„ํ• CLIP ๊ธฐ๋ฐ˜ ์ •๋ ฌ576๊ฐœ ํ† ํฐ์œผ๋กœ ๊ณ ์ •

๐ŸŽฏ ํ•ต์‹ฌ ๊นจ๋‹ฌ์Œ:

  • ๋ฌธ์ œ1์€ ํšจ์œจ์„ฑ์˜ ๋ฌธ์ œ โ†’ ์Šค๋งˆํŠธํ•œ ํ† ํฐ ๊ด€๋ฆฌ๋กœ ํ•ด๊ฒฐ
  • ๋ฌธ์ œ2๋Š” ์ •๋ ฌ์˜ ๋ฌธ์ œ โ†’ ๋Œ€๊ทœ๋ชจ ํ•™์Šต๊ณผ ํ†ตํ•ฉ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ํ•ด๊ฒฐ
  • ๋‘ ๋ฌธ์ œ ๋ชจ๋‘ โ€œ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐโ€๊ฐ€ ์•„๋‹Œ โ€œ๋” ๋˜‘๋˜‘ํ•œ ๋ฐฉ๋ฒ•โ€์œผ๋กœ ํ•ด๊ฒฐ! ๐Ÿš€

๐Ÿงฉ ์‹ค์„ธ ์‚ฌ์šฉ๋˜๊ณ ์žˆ๋Š” ์ฃผ์š” ๋ชจ๋ธ๋ณ„ Tokenizer ์š”์•ฝ

๋ชจ๋ธTokenizer ์ข…๋ฅ˜ํŠน์ง•
GPT-2 / GPT-3 / GPT-4BPE (OpenAI GPT Tokenizer)tiktoken ์‚ฌ์šฉ, ์˜์–ด์— ์ตœ์ ํ™”๋จ
LLaMA / LLaMA2 / LLaMA3SentencePiece + BPEMeta๊ฐ€ ์ง์ ‘ ํ•™์Šตํ•œ SentencePiece ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ ์‚ฌ์šฉ
Gemini (Google)SentencePiece ๊ธฐ๋ฐ˜ ์ถ”์ •PaLM/Flan ๊ณ„์—ด๊ณผ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ, ์„ธ๋ถ€ ํ† ํฌ๋‚˜์ด์ € ๋ฏธ๊ณต๊ฐœ
Claude (Anthropic)BPE ๋ณ€ํ˜•์„ธ๋ถ€ ๊ตฌ์กฐ๋Š” ๋น„๊ณต๊ฐœ, ์ž์ฒด ํ† ํฌ๋‚˜์ด์ € ๊ตฌ์กฐ ์‚ฌ์šฉ
Qwen (Alibaba)GPT-style BPE์ค‘๊ตญ์–ด ์ตœ์ ํ™”, ์˜์–ด๋„ ์ง€์›, Tokenizer ๊ณต๊ฐœ๋จ
Mistral / MixtralSentencePieceopen-source ๋ชจ๋ธ, HuggingFace tokenizer ๊ตฌ์กฐ ๋”ฐ๋ฆ„
Qwen-VL (๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ)GPT-style BPE + Vision ํŠนํ™”ํ…์ŠคํŠธ๋Š” Qwen๊ณผ ๋™์ผ, ์ด๋ฏธ์ง€ ์ž…๋ ฅ์€ CLIP-style ํŒจ์น˜ ๋ถ„ํ•  ์‚ฌ์šฉ
Gemini (๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ)SentencePiece + Vision์ •ํ™•ํ•œ ๊ตฌ์กฐ ๋ฏธ๊ณต๊ฐœ, Flamingo-like ๊ตฌ์กฐ๋กœ ์ถ”์ •
Grok (xAI)๋น„๊ณต๊ฐœ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๊ตฌ์กฐ ๋Œ€๋ถ€๋ถ„ ๋น„๊ณต๊ฐœ, ์˜์–ด ๊ธฐ๋ฐ˜ ์ถ”์ •
This post is licensed under CC BY 4.0 by the author.