Post

๐Ÿ–ฅ๏ธ Image classification using ViT with Python - ํŒŒ์ด์ฌ์œผ๋กœ ViT ๋ชจ๋ธ์„ ํ™œ์šฉ, ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

๐Ÿ–ฅ๏ธ Image classification using ViT with Python - ํŒŒ์ด์ฌ์œผ๋กœ ViT ๋ชจ๋ธ์„ ํ™œ์šฉ, ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

(English) Exploring Image Classification with ViT Model in Python

Hello everyone! ๐Ÿ˜Š

In the previous post, we delved into the theory behind ViT based on the original paper! Today, we will actually download this ViT model and perform image classification in a Python environment!!

1. Importing ViT Model from torchvision! (The Simplest Way)

You can easily import the Vision Transformer (ViT) model through torchvision, a core library for image-related tasks in the PyTorch ecosystem.

What kind of package is torchvision that provides models?

torchvision is a package developed and maintained by the PyTorch team, providing commonly used datasets, image transformations (transforms), and pre-trained model architectures in the field of computer vision.

torchvision provides models for the following reasons:

  • Convenience: It supports researchers and developers in easily utilizing models with verified performance without the hassle of implementing image-related deep learning models from scratch.
  • Rapid Prototyping: Pre-trained models allow for quick experimentation with new ideas and development of prototypes.
  • Saving Learning Resources: Using models pre-trained on large-scale datasets saves the time and computational resources required for direct training.
  • Leveraging Learned Representations: Pre-trained models have already learned general image features, enabling good performance on specific tasks with less data (transfer learning).

Types and Features of ViT Models Provided by torchvision

torchvision provides various CNN-based models as well as ViT models. Currently (as of April 28, 2025), the main types and features of ViT models provided by torchvision are as follows:

NamePatch SizeModel NameFeatures
ViT-Base16x16vit_b_16Offers a balanced size and performance.
ViT-Base32x32vit_b_32Larger patch size can reduce computation but may miss fine-grained features.
ViT-Large16x16vit_l_16Has more layers and a larger hidden dimension than the Base model, aiming for higher performance. Requires more computational resources.
ViT-Large32x32vit_l_32A Large model with a larger patch size.
ViT-Huge14x14vit_h_14One of the largest ViT models, aiming for top-level performance but requires very significant computational resources.

These models all come with pre-trained weights on the ImageNet dataset, allowing for immediate use in image classification tasks.
The letters โ€˜bโ€™, โ€˜lโ€™, and โ€˜hโ€™ in the model names indicate the Base, Large, and Huge model sizes, respectively, and the number following indicates the image patch size.
A larger patch size means the model looks at the image in larger chunks, which can lead to faster processing but potentially lower accuracy.


2. Todayโ€™s Image!! ๐Ÿถ Letโ€™s Start Classifying!

dog

Today, we will use a cute dog image to see how the ViT model classifies it. The ViT model we will use today is pre-trained on the ImageNet dataset!

What is imagenet_classes?

imagenet_classes is a list of 1000 image classes used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The pre-trained ViT models provided by torchvision are trained on this ImageNet dataset, so the modelโ€™s output will be prediction probabilities for these 1000 classes. imagenet_classes serves to map these numerical prediction results to human-readable class names (e.g., โ€œgolden retrieverโ€, โ€œpoodleโ€).

imagenet_classes.json: A JSON file containing imagenet_classes information.

Since torchvision itself does not directly include the ImageNet class name list, you need to prepare a separate JSON file containing this information. You can obtain the imagenet_classes.json file in the following way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests
import json

# Read JSON file directly from URL
url = "https://raw.githubusercontent.com/anishathalye/imagenet-simple-labels/master/imagenet-simple-labels.json"


response = requests.get(url)
response.raise_for_status()  # Raise an error for bad status codes

# Load JSON data
imagenet_labels = response.json()

with open("imagenet_classes.json", "w") as f:
    json.dump(imagenet_labels, f)

3. Letโ€™s Begin the Code!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image
import json

# 1. Load ViT model (ViT-Base, patch size 16)
vit_b_16 = models.vit_b_16(pretrained=True)
vit_b_16.eval()  # Set the model to evaluation mode

# 2. Define image preprocessing
# Resize images to 256 and then center crop to 224.
# Normalize using the mean and standard deviation of the ImageNet dataset.
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 3. Load the dog image (replace with your image file path)
image_path = "dog.jpg"
try:
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0) # Add batch dimension
except FileNotFoundError:
    print(f"Error: Image file '{image_path}' not found.")
    exit()

# 4. Perform prediction
with torch.no_grad():
    output = vit_b_16(input_tensor)

# 5. Post-process the prediction results and print the class names
try:
    with open("imagenet_classes.json", "r") as f:
        imagenet_classes = json.load(f)

    _, predicted_idx = torch.sort(output, dim=1, descending=True)
    top_k = 5
    print(f"Top {top_k} prediction results:")
    for i in range(top_k):
        class_idx = predicted_idx[0, i].item()
        confidence = torch.softmax(output, dim=1)[0, class_idx].item()
        print(f"- {imagenet_classes[class_idx]}: {confidence:.4f}")
except FileNotFoundError:
    print("Error: 'imagenet_classes.json' file not found. Please prepare the file in step 2.")
    print("Predicted class indices:", predicted_idx[0, :5].tolist())
except Exception as e:
    print(f"Error during prediction processing: {e}")

When you run the code above!!! You can see the Top 5 prediction results as below~!

1
2
3
4
5
6
Top 5 Prediction Results:
- Golden Retriever: 0.9126
- Labrador Retriever: 0.0104
- Kuvasz: 0.0032
- Airedale Terrier: 0.0014
- tennis ball: 0.0012

We can see that the Golden Retriever is predicted with the highest probability of 91.26%.

4. Getting and Running the Model Directly from Hugging Face! + Analysis (Less Simple, But Customizable)

This time, letโ€™s try importing the model directly from the Hugging Face ViT model and proceed!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image
import json

# 1. Load ViT model (ViT-Base, patch size 16)
vit_b_16 = models.vit_b_16(pretrained=True)
vit_b_16.eval()  # Set the model to evaluation mode

# 2. Define image preprocessing
# Resize images to 256 and then center crop to 224.
# Normalize using the mean and standard deviation of the ImageNet dataset.
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 3. Load the dog image (replace with your image file path)
image_path = "dog.jpg"
try:
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0) # Add batch dimension
except FileNotFoundError:
    print(f"Error: Image file '{image_path}' not found.")
    exit()

# 4. Perform prediction
with torch.no_grad():
    output = vit_b_16(input_tensor)

# 5. Post-process the prediction results and print the class names
with open("imagenet_classes.json", "r") as f:
        imagenet_classes = json.load(f)

_, predicted_idx = torch.sort(output, dim=1, descending=True)
top_k = 5
print(f"Top {top_k} results:")
for i in range(top_k):
        class_idx = predicted_idx[0, i].item()
        confidence = torch.softmax(output, dim=1)[0, class_idx].item()
        print(f"- {imagenet_classes[class_idx]}: {confidence:.4f}")

Similarly, it was classified as number 207, Golden Retriever!!!
But! Letโ€™s look at the differences from the existing torchvision and model customization here!

a. Image Preprocessing Method!!

Looking at the preprocessing part below, ViTFeatureExtractor already knows the preprocessing method used when the model was trained, allowing you to perform image preprocessing simply without writing a complex transforms.Compose process directly!

1
2
3
4
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')

# 3. preprocess : no need to  crop and resize
inputs = feature_extractor(images=image, return_tensors="pt")

b. Viewing the CLS Token!!

In the previous theoretical learning post, we learned that it consists of 196 patches + 1 CLS token, totaling 197 patches! We confirmed that the overall information of the image is contained in this first CLS token! You can see the CLS Token with the following code!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from transformers import ViTModel, ViTImageProcessor
import torch
from PIL import Image

# 1. ViTModel (Pure model without classification head)
model = ViTModel.from_pretrained('google/vit-base-patch16-224')
model.eval()

# Feature Extractor โ†’ Updated to ViTImageProcessor
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')

# 2. Load Image
image = Image.open("dog.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt")

# 3. Model Inference
with torch.no_grad():
    outputs = model(**inputs)

# 4. Extract CLS Token
last_hidden_state = outputs.last_hidden_state  # (batch_size, num_tokens, hidden_dim)
cls_token = last_hidden_state[:, 0, :]  # The 0th token is CLS

# 5. Print CLS Token
print("CLS token shape:", cls_token.shape)  # torch.Size([1, 768])
print("CLS token values (first 5):", cls_token[0, :5])

If you run the code above, you can see the 768-dimensional CLS token as expected! Subsequent research uses this token for various other information!

1
2
CLS token shape: torch.Size([1, 768])
CLS token values (first 5): tensor([-0.5934, -0.3203, -0.0811,  0.3146, -0.7365])

c. ViTโ€™s CAM!! Attention Rollout

In traditional CNN-based image classification, a CAM (Class Activation Map) was placed at the end of the model to visualize which parts became important!!!

CAM Theory Summary!!
CAM Practice!!

Our ViT model is different from CAM, so itโ€™s difficult to proceed in the same way! However, you can visualize which of the remaining 196 patches the most important CLS package paid attention to using a method called Attention Rollout!

Looking at the structure!!

As shown below, Attention is the process by which [CLS] assigns weights to each patch like โ€œyouโ€™re importantโ€ or โ€œyouโ€™re not important,โ€ and visualizing these attentions is Attention Rollout!

1
2
3
4
5
[CLS]   โ†’ Patch_1   (Attention weight: 0.05)
[CLS]   โ†’ Patch_2   (Attention weight: 0.02)
[CLS]   โ†’ Patch_3   (Attention weight: 0.01)
...
[CLS]   โ†’ Patch_196 (Attention weight: 0.03)

In the end!! You can see a visualization of which patches were considered important as below!

  • Red areas โ†’ Patches that [CLS] paid much attention to.
  • Blue areas โ†’ Patches that [CLS] paid less attention to.

Looking at the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from transformers import ViTModel, ViTFeatureExtractor
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
import numpy as np

# 1. Load model and Feature Extractor
model = ViTModel.from_pretrained('google/vit-base-patch16-224', output_attentions=True)
model.eval()

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')

# 2. Load Image
image = Image.open("dog.jpg").convert('RGB')
inputs = feature_extractor(images=image, return_tensors="pt")

# 3. Model Inference (output attention)
with torch.no_grad():
    outputs = model(**inputs)
    attentions = outputs.attentions  # list of (batch, heads, tokens, tokens)

# 4. Calculate Attention Rollout
def compute_rollout(attentions):
    # Multiply attention matrices across layers
    result = torch.eye(attentions[0].size(-1))
    for attention in attentions:
        attention_heads_fused = attention.mean(dim=1)[0]  # (tokens, tokens)
        attention_heads_fused += torch.eye(attention_heads_fused.size(-1))
        attention_heads_fused /= attention_heads_fused.sum(dim=-1, keepdim=True)
        result = torch.matmul(result, attention_heads_fused)
    return result

rollout = compute_rollout(attentions)

# 5. Extract Attention from [CLS] token to image patches
mask = rollout[0, 1:].reshape(14, 14).detach().cpu().numpy()

# 6. Visualization
def show_mask_on_image(img, mask):
    img = img.resize((224, 224))
    mask = (mask - mask.min()) / (mask.max() - mask.min())
    fig, ax = plt.subplots()
    ax.imshow(img)
    ax.imshow(mask, cmap='jet', alpha=0.5)
    ax.axis('off')
    plt.show()

show_mask_on_image(image, mask)

And the result is!!!??

patch

Does it look right~?


5. ๐Ÿ’ก Conclusion: Simple and Fast ViT

How was it? You ran the code directly, and it was possible to execute the code easily and quickly!

Like this, ViT, which was theoretically significant! Since models trained on large-scale datasets can also be easily implemented in code, research based on Transformers has exploded in the field of computer vision ever since!

In the future, we will also explore and practice various Vision Transformer-based models such as DINO, DeiT, CLIP, Swin Transformer, etc.! ^^

Thank you!!! ๐Ÿš€๐Ÿ”ฅ


(ํ•œ๊ตญ์–ด) ํŒŒ์ด์ฌ์œผ๋กœ ViT ๋ชจ๋ธ์„ ํ™œ์šฉ, ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋ณด๊ธฐ

์•ˆ๋…•ํ•˜์„ธ์š”! ๐Ÿ˜Š

์ง€๋‚œ ํฌ์ŠคํŒ… ์—์„œ๋Š” ViT์˜ Paper๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ก ์„ ์•Œ์•„๋ณด์•˜๋Š”๋ฐ์š”!
์˜ค๋Š˜์€ ์‹ค์ œ ์ด ViT๋ธ์„ ๋‹ค์šด๋ฐ›์•„ Python ํ™˜๊ฒฝ์—์„œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!

1. ViT ๋ชจ๋ธ!! torchvision ์—์„œ ์ž„ํฌํŠธ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ! (์ œ์ผ ๊ฐ„๋‹จ)

PyTorch ์ƒํƒœ๊ณ„์—์„œ ์ด๋ฏธ์ง€ ๊ด€๋ จ ์ž‘์—…์„ ์œ„ํ•œ ํ•ต์‹ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ค‘ ํ•˜๋‚˜์ธ torchvision์„ ํ†ตํ•ด Vision Transformer (ViT) ๋ชจ๋ธ์„ ๊ฐ„ํŽธํ•˜๊ฒŒ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

torchvision ์€ ๋ฌด์Šจ ํŒจํ‚ค์ง€์ด๊ธธ๋ž˜ ๋ชจ๋ธ์„ ์ œ๊ณตํ•ด์ฃผ๋‚˜?

torchvision์€ PyTorch ํŒ€์—์„œ ๊ฐœ๋ฐœํ•˜๊ณ  ์œ ์ง€ ๊ด€๋ฆฌํ•˜๋Š” ํŒจํ‚ค์ง€๋กœ, ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹, ์ด๋ฏธ์ง€ ๋ณ€ํ™˜(transforms), ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ฆฌ ํ•™์Šต๋œ(pre-trained) ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

torchvision์—์„œ ์ œ๊ณตํ•˜๋Š” ViT ๋ชจ๋ธ ์ข…๋ฅ˜์™€ ๊ฐ ๋ชจ๋ธ์˜ ํŠน์ง•

torchvision์€ ๋‹ค์–‘ํ•œ CNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ViT ๋ชจ๋ธ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
ํ˜„์žฌ (2025๋…„ 4์›” ๊ธฐ์ค€) torchvision์—์„œ ์ œ๊ณตํ•˜๋Š” ์ฃผ์š” ViT ๋ชจ๋ธ ์ข…๋ฅ˜์™€ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด๋ฆ„ํŒจ์น˜ ์‚ฌ์ด์ฆˆ๋ชจ๋ธ๋ช…ํŠน์ง•
ViT-Base16x16vit_b_16๊ท ํ˜• ์žกํžŒ ํฌ๊ธฐ์™€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
ViT-Base32x32vit_b_32๋” ํฐ ํŒจ์น˜ ํฌ๊ธฐ๋กœ ์ธํ•ด ๊ณ„์‚ฐ๋Ÿ‰์ด ์ค„์–ด๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ, ์„ธ๋ฐ€ํ•œ ํŠน์ง•์„ ๋†“์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ViT-Large16x16vit_l_16Base ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋งŽ์€ ๋ ˆ์ด์–ด์™€ ํฐ hidden dimension์„ ๊ฐ€์ ธ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ์ปดํ“จํŒ… ์ž์›์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค.
ViT-Large32x32vit_l_32Large ๋ชจ๋ธ์— ํฐ ํŒจ์น˜ ํฌ๊ธฐ๋ฅผ ์ ์šฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
ViT-Huge14x14vit_h_14๊ฐ€์žฅ ํฐ ViT ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋กœ, ์ตœ๊ณ  ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ชฉํ‘œ๋กœ ํ•˜์ง€๋งŒ ๋งค์šฐ ๋งŽ์€ ์ปดํ“จํŒ… ์ž์›์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ ๋ชจ๋‘ ImageNet ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์ œ๊ณต๋˜์–ด,
์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์— ๋ฐ”๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ชจ๋ธ ์ด๋ฆ„์˜ b, l, h๋Š” ๊ฐ๊ฐ Base, Large, Huge ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ,
๋’ค์˜ ์ˆซ์ž๋Š” ์ด๋ฏธ์ง€ ํŒจ์น˜์˜ ํฌ๊ธฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
ํŒจ์น˜ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก ์ด๋ฏธ์ง€๋ฅผ ํฌ๊ฒŒํฌ๊ฒŒ ๋ณด๋Š”๊ฒƒ์ด๋‹ˆ ์†๋„๋Š” ๋น ๋ฅด์ง€๋งŒ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ๊ฒ ์ง€์š”!?


2. ์˜ค๋Š˜์˜ ์ด๋ฏธ์ง€!! ๐Ÿถ ๋ถ„๋ฅ˜ ์‹œ์ž‘!

dog

์˜ค๋Š˜์€ ๊ท€์—ฌ์šด ๊ฐ•์•„์ง€ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ViT ๋ชจ๋ธ์ด ์–ด๋–ป๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์˜ค๋Š˜์˜ ViT ๋ชจ๋ธ์€ Imagenet์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์ˆฉ๋œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค!!

imagenet_classes ์ด๋ž€?

imagenet_classes๋Š” ImageNet Large Scale Visual Recognition Challenge (ILSVRC)์—์„œ ์‚ฌ์šฉ๋œ 1000๊ฐœ์˜ ์ด๋ฏธ์ง€ ํด๋ž˜์Šค ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
torchvision์—์„œ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ViT ๋ชจ๋ธ์€ ์ด ImageNet ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์€ ์ด 1000๊ฐœ์˜ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์˜ˆ์ธก ํ™•๋ฅ ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. imagenet_classes๋Š” ์ด๋Ÿฌํ•œ ์ˆซ์ž ํ˜•ํƒœ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ๋žŒ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํด๋ž˜์Šค ์ด๋ฆ„(์˜ˆ: โ€œgolden retrieverโ€, โ€œpoodleโ€)์œผ๋กœ ๋งคํ•‘ํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

imagenet_classes.json : imagenet_classes ์ •๋ณด๋ฅผ ์ €์žฅํ•œ json ์ž…๋‹ˆ๋‹ค.

torchvision ์ž์ฒด์—๋Š” ImageNet ํด๋ž˜์Šค ์ด๋ฆ„ ๋ชฉ๋ก์ด ์ง์ ‘ ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š๊ธฐ์—,
ํ•ด๋‹น ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” JSON ํŒŒ์ผ์„ ๋ณ„๋„๋กœ ์ค€๋น„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ๋ฐฉ๋ฒ•์œผ๋กœ imagenet_classes.json ํŒŒ์ผ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests
import json

# URL์—์„œ ์ง์ ‘ JSON ํŒŒ์ผ ์ฝ๊ธฐ
url = "https://raw.githubusercontent.com/anishathalye/imagenet-simple-labels/master/imagenet-simple-labels.json"

response = requests.get(url)
response.raise_for_status()  # ์š”์ฒญ ์‹คํŒจ ์‹œ ์—๋Ÿฌ ๋ฐœ์ƒ

# JSON ๋ฐ์ดํ„ฐ ๋กœ๋“œ
imagenet_labels = response.json()


with open("imagenet_classes.json", "r") as f:
    imagenet_classes = json.load(f)

3. ์ฝ”๋“œ ๋ณธ๊ฒฉ ์‹œ์ž‘!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image
import json

# 1. ViT ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (ViT-Base, ํŒจ์น˜ ํฌ๊ธฐ 16 ์‚ฌ์šฉ)
vit_b_16 = models.vit_b_16(pretrained=True)
vit_b_16.eval()  # ์ถ”๋ก  ๋ชจ๋“œ๋กœ ์„ค์ •

# 2. ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ ์ •์˜
# ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ๋‹ค ๋‹ค๋ฅด๋‹ˆ 256์œผ๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆํ•˜๊ณ  224๋กœ ์ค‘์•™ ๋ถ€๋ถ„์„ ํŒจ์น˜ํ•ฉ๋‹ˆ๋‹ค.
# ๊ทธ๋ฆฌ๊ณ  ImageNet ๋ฐ์ดํ„ฐ์…‹์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 3. ๊ฐ•์•„์ง€ ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (๋ณธ์ธ์˜ ์ด๋ฏธ์ง€ ํŒŒ์ผ ๊ฒฝ๋กœ๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์„ธ์š”)
image_path = "dog.jpg"
try:
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0) # ๋ฐฐ์น˜ ์ฐจ์› ์ถ”๊ฐ€
except FileNotFoundError:
    print(f"Error: ์ด๋ฏธ์ง€ ํŒŒ์ผ '{image_path}'์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.")
    exit()

# 4. ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜์—ฌ ์˜ˆ์ธก ์ˆ˜ํ–‰
with torch.no_grad():
    output = vit_b_16(input_tensor)

# 5. ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ›„์ฒ˜๋ฆฌ ๋ฐ ํด๋ž˜์Šค ์ด๋ฆ„ ์ถœ๋ ฅ
try:
    with open("imagenet_classes.json", "r") as f:
        imagenet_classes = json.load(f)

    _, predicted_idx = torch.sort(output, dim=1, descending=True)
    top_k = 5
    print(f"Top {top_k} ์˜ˆ์ธก ๊ฒฐ๊ณผ:")
    for i in range(top_k):
        class_idx = predicted_idx[0, i].item()
        confidence = torch.softmax(output, dim=1)[0, class_idx].item()
        print(f"- {imagenet_classes[class_idx]}: {confidence:.4f}")

except FileNotFoundError:
    print("Error: 'imagenet_classes.json' ํŒŒ์ผ์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. 2๋‹จ๊ณ„์—์„œ ํŒŒ์ผ์„ ์ค€๋น„ํ•ด์ฃผ์„ธ์š”.")
    print("์˜ˆ์ธก๋œ ํด๋ž˜์Šค ์ธ๋ฑ์Šค:", predicted_idx[0, :5].tolist())
except Exception as e:
    print(f"Error during prediction processing: {e}")

์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด!!!
์•„๋ž˜์™€ ๊ฐ™์ด Top 5๊ฐœ์˜ ์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ๋ณผ์ˆ˜ ์žˆ๋Š”๋ฐ์š”~!

1
2
3
4
5
6
Top 5 ์˜ˆ์ธก ๊ฒฐ๊ณผ:
- Golden Retriever: 0.9126
- Labrador Retriever: 0.0104
- Kuvasz: 0.0032
- Airedale Terrier: 0.0014
- tennis ball: 0.0012

๊ณจ๋“ ๋ฆฌํŠธ๋ฆฌ๋ฒ„๋ฅผ 91.26%๋กœ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ๋กœ ์˜ˆ์ธกํ•จ์„ ๋ณผ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค

4. Huggingface ์—์„œ ์ง์ ‘ ๋ชจ๋ธ์„ ๋ฐ›์•„์„œ ์‹คํ–‰ํ•˜๊ธฐ! + ๋ถ„์„ (๋œ ๊ฐ„๋‹จ, but ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ๊ฐ€๋Šฅ)

์ด๋ฒˆ์—๋Š” ์ง์ ‘ ํ—ˆ๊น…ํŽ˜์ด์Šค์˜ ViT ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์ง์ ‘
๋ชจ๋ธ์„ ์ž„ํฌํŠธํ•˜์—ฌ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค~!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image
import json

# 1. ViT ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (ViT-Base, ํŒจ์น˜ ํฌ๊ธฐ 16 ์‚ฌ์šฉ)
vit_b_16 = models.vit_b_16(pretrained=True)
vit_b_16.eval()  # ์ถ”๋ก  ๋ชจ๋“œ๋กœ ์„ค์ •

# 2. ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ ์ •์˜
# ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ๋‹ค ๋‹ค๋ฅด๋‹ˆ 256์œผ๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆํ•˜๊ณ  224๋กœ ์ค‘์•™ ๋ถ€๋ถ„์„ ํŒจ์น˜ํ•ฉ๋‹ˆ๋‹ค.
# ๊ทธ๋ฆฌ๊ณ  ImageNet ๋ฐ์ดํ„ฐ์…‹์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 3. ๊ฐ•์•„์ง€ ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (๋ณธ์ธ์˜ ์ด๋ฏธ์ง€ ํŒŒ์ผ ๊ฒฝ๋กœ๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์„ธ์š”)
image_path = "dog.jpg"
try:
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0) # ๋ฐฐ์น˜ ์ฐจ์› ์ถ”๊ฐ€
except FileNotFoundError:
    print(f"Error: ์ด๋ฏธ์ง€ ํŒŒ์ผ '{image_path}'์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.")
    exit()

# 4. ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜์—ฌ ์˜ˆ์ธก ์ˆ˜ํ–‰
with torch.no_grad():
    output = vit_b_16(input_tensor)

# 5. ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ›„์ฒ˜๋ฆฌ ๋ฐ ํด๋ž˜์Šค ์ด๋ฆ„ ์ถœ๋ ฅ
with open("imagenet_classes.json", "r") as f:
       imagenet_classes = json.load(f)

_, predicted_idx = torch.sort(output, dim=1, descending=True)
top_k = 5
print(f"Top {top_k} ์˜ˆ์ธก ๊ฒฐ๊ณผ:")
for i in range(top_k):
       class_idx = predicted_idx[0, i].item()
       confidence = torch.softmax(output, dim=1)[0, class_idx].item()
       print(f"- {imagenet_classes[class_idx]}: {confidence:.4f}")


์—ญ์‹œ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ~!! 207๋ฒˆ, ๊ณจ๋“  ๋ฆฌํŠธ๋ฆฌ๋ฒ„๋กœ ๊ตฌ๋ถ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค!!!
๊ทธ๋Ÿฐ๋ฐ! ์—ฌ๊ธฐ์„œ์˜ ๊ธฐ์กด torchvision๊ณผ ์ฐจ์ด & ๋ชจ๋ธ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ๋“ฑ์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!

a. ์ด๋ฏธ์ง€์˜ ์ „์ฒ˜๋ฆฌ๋ฐฉ์‹!!

์•„๋ž˜์˜ ์ „์ฒ˜๋ฆฌ ๋ถ€๋ถ„์„ ๋ณด๋ฉด, ViTFeatureExtractor๋Š” ํ•ด๋‹น ๋ชจ๋ธ์ด ํ•™์Šต๋  ๋•Œ ์‚ฌ์šฉํ–ˆ๋˜ ์ „์ฒ˜๋ฆฌ ๋ฐฉ์‹์„ ๋ฏธ๋ฆฌ ์•Œ๊ณ  ์žˆ์–ด,
๋ณต์žกํ•œ transforms.Compose ๊ณผ์ •์„ ์ง์ ‘ ์ž‘์„ฑํ•˜์ง€ ์•Š๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ต๋‹ˆ๋‹ค~!!

1
2
3
4
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')

# 3. ์ „์ฒ˜๋ฆฌ : ์ง์ ‘ crop ๋ฐ resize ํ•  ํ•„์š”๊ฐ€ ์—†์–ด์š”!
inputs = feature_extractor(images=image, return_tensors="pt")

b. CLS ํ† ํฐ ๋ณด๊ธฐ!!

์ง€๋‚œ ์ด๋ก  ํ•™์Šต๊ธ€์—์„œ 196๊ฐœ์˜ ํŒจ์น˜ + 1๊ฐœ์˜ CLS ํ† ํฐ์œผ๋กœ 197๊ฐœ์˜ ํŒจ์น˜๋กœ ๊ตฌ์„ฑ๋จ์„ ์•Œ์•„๋ณด์•˜๋Š”๋ฐ์š”~!
์ด ์ฒซ๋ฒˆ์จฐ์˜ CLS ํ† ํฐ์— ์ด๋ฏธ์ง€์˜ ์ „์ฒด์ ์ธ ์ •๋ณด๊ฐ€ ํฌํ•จ๋จ์„ ํ™•์ธํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค!!
์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๋กœ CLS Token์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from transformers import ViTModel, ViTImageProcessor
import torch
from PIL import Image

# 1. ViTModel (Classification head ์—†๋Š” ์ˆœ์ˆ˜ ๋ชจ๋ธ)
model = ViTModel.from_pretrained('google/vit-base-patch16-224')
model.eval()

# Feature Extractor โ†’ ViTImageProcessor๋กœ ์ตœ์‹ ํ™”
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')

# 2. ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
image = Image.open("dog.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt")

# 3. ๋ชจ๋ธ ์ถ”๋ก 
with torch.no_grad():
    outputs = model(**inputs)

# 4. CLS ํ† ํฐ ์ถ”์ถœ
last_hidden_state = outputs.last_hidden_state  # (batch_size, num_tokens, hidden_dim)
cls_token = last_hidden_state[:, 0, :]  # 0๋ฒˆ์งธ ํ† ํฐ์ด CLS

# 5. CLS ํ† ํฐ ์ถœ๋ ฅ
print("CLS token shape:", cls_token.shape)  # torch.Size([1, 768])
print("CLS token values (์•ž 5๊ฐœ):", cls_token[0, :5])

์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ด๋ณด๋ฉด, ์˜ˆ์ƒํ•œ๋Œ€๋กœ 768 ์ฐจ์›์˜CLS ํ† ํฐ์„ ๋ณผ์ˆ˜ ์žˆ์ง€์š”~~
์ดํ›„ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์€ ์ด ํ† ํฐ์„ ํ™œ์šฉํ•ด์„œ ๋‹ค๋ฅธ ์ •๋ณด๋กœ ํ™œ์šฉํ•˜๊ธฐ๋„ํ•ฉ๋‹ˆ๋‹ค!

1
2
CLS token shape: torch.Size([1, 768])
CLS token values (์•ž 5๊ฐœ): tensor([-0.5934, -0.3203, -0.0811,  0.3146, -0.7365])

c. ViT์˜ CAM!! Attention Rollout

๊ธฐ์กด CNN ๋ฐฉ์‹์˜ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋Š” ๋ชจ๋ธ์˜ ๋งˆ์ง€๋ง‰๋‹จ์— CAM(Class Activation Map)์„ ๋‘์–ด์„œ ์–ด๋–ค ๋ถ€๋ถ„์ด ์ค‘์š”ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€ ์‹œ๊ฐํ™” ํ• ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค!!!

CAM์˜ ์ด๋ก  ์ •๋ฆฌ!!
CAM ์‹ค์Šต!!

์šฐ๋ฆฌ์˜ ViT ๋ชจ๋ธ์€ CAM๊ณผ๋Š” ๋‹ค๋ฅด๊ธฐ์— ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰์€ ์–ด๋ ต์ง€๋งŒ!!
Attention Rollout ์ด๋ผ๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐ€์žฅ ์ค‘์š”ํ•œ CLS ํŒจํ‚ค์น˜๊ฐ€ ๋‚˜๋จธ์ง€ 196๊ฐœ ํŒจ์น˜์ค‘ ์–ด๋””๋ฅผ ์ค‘์š”ํ•˜๊ฒŒ ๋ดค๋Š”์ง€!! ์‹œ๊ฐํ™”ํ• ์ˆ˜ ์žˆ์–ด์š”!!

๊ตฌ์กฐ๋ฅผ ๋ณด์ž๋ฉด!!

์•„๋ž˜์™€ ๊ฐ™์ด [CLS]๊ฐ€ ๊ฐ ํŒจ์น˜์— ๋Œ€ํ•ด โ€œ๋„ˆ ์ค‘์š”ํ•ดโ€, โ€œ๋„ˆ ๋ณ„๋กœ์•ผโ€ ๊ฐ™์€ ์‹์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฑธ Attention์ด๋ผ๊ณ ํ•˜๊ณ , ๊ทธ ์–ดํ…์…˜๋“ค์„ ์‹œ๊ฐํ™”ํ•˜๋Š”๊ฒƒ์ด์ง€์š”!

1
2
3
4
5
[CLS]   โ†’ Patch_1   (Attention weight: 0.05)
[CLS]   โ†’ Patch_2   (Attention weight: 0.02)
[CLS]   โ†’ Patch_3   (Attention weight: 0.01)
...
[CLS]   โ†’ Patch_196 (Attention weight: 0.03)

๊ฒฐ๊ตญ!! ์–ด๋–ค ํŒจ์น˜๊ฐ€ ์ค‘์š”ํ•˜๊ฒŒ ๊ฐ„์ฃผ๋˜์—ˆ๋Š”์ง€ ์•„๋ž˜์™€ ๊ฐ™์ด ์‹œ๊ฐํ™”๊ฐ€ ๋˜์ง€์š”~!!

  • ๋นจ๊ฐ›๊ฒŒ ๋ณด์ด๋Š” ์˜์—ญ โ†’ [CLS]๊ฐ€ ๋งŽ์ด ์ฃผ๋ชฉํ•œ ํŒจ์น˜,
  • ํŒŒ๋ž—๊ฒŒ ๋ณด์ด๋Š” ์˜์—ญ โ†’ [CLS]๊ฐ€ ๋œ ์ฃผ๋ชฉํ•œ ํŒจ์น˜

์ฝ”๋“œ๋กœ ๋ณด๋ฉด~~

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from transformers import ViTModel, ViTFeatureExtractor
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
import numpy as np

# 1. ๋ชจ๋ธ๊ณผ Feature Extractor ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = ViTModel.from_pretrained('google/vit-base-patch16-224', output_attentions=True)
model.eval()

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')

# 2. ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
image = Image.open("dog.jpg").convert('RGB')
inputs = feature_extractor(images=image, return_tensors="pt")

# 3. ๋ชจ๋ธ ์ถ”๋ก  (attention ์ถœ๋ ฅ)
with torch.no_grad():
    outputs = model(**inputs)
    attentions = outputs.attentions  # list of (batch, heads, tokens, tokens)

# 4. Attention Rollout ๊ณ„์‚ฐ
def compute_rollout(attentions):
    # Multiply attention matrices across layers
    result = torch.eye(attentions[0].size(-1))
    for attention in attentions:
        attention_heads_fused = attention.mean(dim=1)[0]  # (tokens, tokens)
        attention_heads_fused += torch.eye(attention_heads_fused.size(-1))
        attention_heads_fused /= attention_heads_fused.sum(dim=-1, keepdim=True)
        result = torch.matmul(result, attention_heads_fused)
    return result

rollout = compute_rollout(attentions)

# 5. [CLS] ํ† ํฐ์—์„œ ์ด๋ฏธ์ง€ ํŒจ์น˜๋กœ ๊ฐ€๋Š” Attention ์ถ”์ถœ
mask = rollout[0, 1:].reshape(14, 14).detach().cpu().numpy()

# 6. ์‹œ๊ฐํ™”
def show_mask_on_image(img, mask):
    img = img.resize((224, 224))
    mask = (mask - mask.min()) / (mask.max() - mask.min())
    fig, ax = plt.subplots()
    ax.imshow(img)
    ax.imshow(mask, cmap='jet', alpha=0.5)
    ax.axis('off')
    plt.show()

show_mask_on_image(image, mask)

์ด๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋Š”!!!??

patch

์ž…๋‹ˆ๋‹ค~! ๋งž๋Š”๊ฒƒ ๊ฐ™๋‚˜์š”~?


5. ๐Ÿ’ก ๊ฒฐ๋ก  : ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ViT

์–ด๋–ค๊ฐ€์š”? ์ฝ”๋“œ๋ฅผ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณด์•˜๋Š”๋ฐ~!!
ํฐ ์–ด๋ ค์›€์—†์ด, ๊ทธ๋ฆฌ๊ณ  ๋น ๋ฅด๊ฒŒ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ• ์ˆ˜ ์žˆ์—ˆ์ง€์š”!?

์ด์ฒ˜๋Ÿผ ์ด๋ก ์ ์œผ๋กœ๋„ ์œ ์˜๋ฏธํ–ˆ๋˜ ViT! ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ์ฝ”๋“œ๋กœ๋„ ์‰ฝ๊ฒŒ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•ด์„œ ์ดํ›„๋กœ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ Transformer ๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค๊ณ ํ•ฉ๋‹ˆ๋‹ค!!

์•ž์œผ๋กœ DINO, DeiT, CLIP, Swin Transformer ๋“ฑ ๋‹ค์–‘ํ•œ ๋น„์ „ Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋„ ์•Œ์•„๋ณด๋ฉฐ ์‹ค์Šตํ•ด๋ณผ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค~! ^^

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!!! ๐Ÿš€๐Ÿ”ฅ

This post is licensed under CC BY 4.0 by the author.