Post

๐Ÿ–ฅ๏ธ DINO Python Experiment!! Super Impressive!! - DINO ํŒŒ์ด์ฌ ์‹ค์Šต!! ์™„์ „ ์‹ ๊ธฐํ•ด!!

๐Ÿ–ฅ๏ธ DINO Python Experiment!! Super Impressive!! - DINO ํŒŒ์ด์ฌ ์‹ค์Šต!! ์™„์ „ ์‹ ๊ธฐํ•ด!!

(English) DINO Python Experiment!! So Cool!!

In the previous post, we learned the theory behind DINO!!
Today, letโ€™s actually run the DINO model and see how it performs~!

dino_result

  • Starting with the conclusion today!!!
  • It highlights important parts of the image with a tada~ using attention
  • Isnโ€™t that amazing!?
  • Letโ€™s explore how it works~!

1. What is timm?!!

In this post, weโ€™ll load the DINO model using timm.
Letโ€™s first understand what timm (Torch Image Models) is!

  • timm stands for Torch Image Models,
  • A library that provides a wide array of tools and pretrained models for handling image tasks in PyTorch!!

Main features of timm:

  • Offers various modern image models:
    • Includes ResNet, EfficientNet, Vision Transformer (ViT), Swin Transformer, and more โ€” easily usable for image classification, detection, semantic segmentation, etc.
  • Rich pretrained weights:
    • Provides weights pretrained on large datasets such as ImageNet, JFT, BeiT, which makes transfer learning easier without the need for training from scratch
  • Easy model creation:
    • With timm.create_model(), you can create your desired model by name + conveniently load pretrained weights
  • Modular design:
    • Easily access and modify components like backbone, pooling layer, classifier head โ€” highly flexible for building custom models or fine-tuning existing ones
  • Various utility functions:
    • Offers helpful tools for image transforms, dataset handling, optimizers, schedulers, etc.
  • Active community:
    • An open-source project actively maintained and continuously updated with new models and features
  • Example of using timm
1
2
3
4
5
import timm

# USE DINO-ViT MODEL (pretrained)
model = timm.create_model('vit_base_patch16_224_dino', pretrained=True)
model.eval()

This loads the DINO model structure!!
(Weโ€™ll focus on hands-on today โ€” for architecture details, please check the theory post~)


2. Encoding Images with ViT-based DINO!! (Into Vectors)

The core idea of ViT is turning an image into a vector using Transformer techniques!!

hold_fork

Letโ€™s start with this image of someone holding a fork!!
And now@!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch
import timm
import torchvision.transforms as T

# USE DINO-ViT MODEL (pretrained)
model = timm.create_model('vit_base_patch16_224_dino', pretrained=True)
model.eval()

# Load image 
image_path = "hold_fork.jpg" 
image = Image.open(image_path).convert('RGB')  

# Image preprocess
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0)

# Model output with attention weights
with torch.no_grad():
    outputs = model.forward_features(img_tensor)  # Shape: (batch_size, 197, feature_dim)

np.shape(outputs)

Breaking down the code above:

  • Load the model
  • Load the image as RGB vector
  • Preprocess and normalize the image to (224, 224)
  • Feed it into DINO โ†’ Get final output!!

And the output will be:

1
torch.Size([1, 197, 768])

So the output is a vector of shape 197 (1 CLS token + 196 patch tokens) ร— 768 (DINOโ€™s internal dimension)!!

Thatโ€™s the end of the image encoding process!!!!
You can now analyze each patch token or the CLS token depending on your purpose~~!!


3. Visualizing the Encoded Output!! (Decoding)

The result is in vector form โ€” great for computers,
but hard for us to interpret, right?
Letโ€™s decode it so we can actually see it!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch
import timm
import torchvision.transforms as T

# USE DINO-ViT MODEL (pretrained)
model = timm.create_model('vit_base_patch16_224_dino', pretrained=True)
model.eval()

# Load image 
image_path = "hold_fork.jpg" 
image = Image.open(image_path).convert('RGB')  

# Image preprocess
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0)

# Model output with attention weights
with torch.no_grad():
    # Get features including attention
    outputs = model.forward_features(img_tensor)  # Shape: (batch_size, 197, feature_dim)
    
    # Extract patch tokens (excluding CLS)
    patch_tokens = outputs[:, 1:, :]  # (batch_size, 196, feature_dim)
    
    # Attention map: compute importance using norm of patch tokens
    attn_map = torch.norm(patch_tokens, dim=-1).reshape(14, 14)  # (14x14)
    
    # Normalize (scale to range 0โ€“1)
    attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())

# Visualize full Attention Map
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

# Original image
ax[0].imshow(image)
ax[0].axis('off')
ax[0].set_title('Original Image')

# Attention Map
attn_map_resized = np.array(Image.fromarray(attn_map.numpy()).resize(image.size, resample=Image.Resampling.BILINEAR))
ax[1].imshow(image)
ax[1].imshow(attn_map_resized, cmap='jet', alpha=0.5)  # Attention Map > heat map
ax[1].axis('off')
ax[1].set_title('DINO-ViT Attention Map')

plt.tight_layout()
plt.show()

This code builds upon the previous one by adding visualization!
The most important part is:

1
2
3
4
5
6
7
8
9
10
11
12
# Model output with attention weights
with torch.no_grad():
    outputs = model.forward_features(img_tensor)  # Shape: (batch_size, 197, feature_dim)

    # Extract patch tokens (exclude CLS)
    patch_tokens = outputs[:, 1:, :]  # (batch_size, 196, feature_dim)
    
    # Attention map: compute importance via patch token norm
    attn_map = torch.norm(patch_tokens, dim=-1).reshape(14, 14)  # (14x14)
    
    # Normalize (scale to 0โ€“1)
    attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())

This part excludes 1 out of the 197 patch outputs.
Understanding why is essential!!

Thatโ€™s because we must exclude the CLS token!!

Once visualized, you get the same result as shown at the beginning of this post:

dino_result

The DINO model, trained without any labels,
intelligently identifies and highlights important regions in red,
while marking less important ones in blue!


4. Conclusion!!

With DINO, itโ€™s incredibly easy to turn images into vectors and visualize them!!
Building and training the model may have been tough,
but actually using it is super simple and impressive!!
Definitely something we should remember and leverage in future research~! ๐Ÿ˜Š

Also, big thanks to timm for making model usage so convenient!
It supports not just DINO, but many other models as well!

1
timm.list_models()

You can use this to see the long list of available models~
In my version, over 1,200 models are available!
I also saw resnet, swin, RegNet, EfficientNet โ€” looks like I need to study those too!!


(ํ•œ๊ตญ์–ด) DINO ํŒŒ์ด์ฌ ์‹ค์Šต!! ์™„์ „ ์‹ ๊ธฐํ•ด!!

์ง€๋‚œ ํฌ์ŠคํŒ… ์—์„œ ๋ฐฐ์› ๋˜ DINO ์ด๋ก !!
์˜ค๋Š˜์€ ๊ทธ DINO ๋ชจ๋ธ์„ ์‹ค์ œ๋กœ ๊ฐ€๋™ํ•ด๋ณด๊ณ  ๊ทธ ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‚˜์˜ค๋Š”์ง€ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค~!

dino_result

  • ์˜ค๋Š˜์€ ๊ฒฐ๋ก ๋ถ€ํ„ฐ!!!
  • ์ด๋ฏธ์ง€์— ๋Œ€ํ•˜์—ฌ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ ์งœ์ง  ํ•˜๊ณ  attention์„ ์ค๋‹ˆ๋‹ค~
  • ์‹ ๊ธฐํ•˜์ง€ ์•Š๋‚˜์š”!?
  • ๊ทธ ๊ณผ์ •์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค~!

1. timm ์ด๋ž€?!!

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ์˜ DINO ๋ชจ๋ธ์€ timm ์œผ๋กœ๋ถ€ํ„ฐ ๋กœ๋“œํ•˜๊ณ ์žํ•ฉ๋‹ˆ๋‹ค. ๊ทธ timm (Torch Image Models) ์ด ๋ฌด์—‡์ธ์ง€ ์•Œ์•„๋ด…์‹œ๋‹ค!~!

  • timm์€ Torch Image Models์˜ ์•ฝ์ž๋กœ,
  • PyTorch์—์„œ ์ด๋ฏธ์ง€ ๋ชจ๋ธ์„ ๋‹ค๋ฃจ๋Š” ๋ฐ ์œ ์šฉํ•œ ๋‹ค์–‘ํ•œ ๋„๊ตฌ์™€ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ!!

timm์˜ ์ฃผ์š” ํŠน์ง•:

  • ๋‹ค์–‘ํ•œ ์ตœ์‹  ์ด๋ฏธ์ง€ ๋ชจ๋ธ ์ œ๊ณต:
    • ResNet, EfficientNet, Vision Transformer (ViT), Swin Transformer ๋“ฑ ์ตœ์‹  CNN ๋ฐ Transformer ๊ธฐ๋ฐ˜์˜ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ๊ฒ€์ถœ, ์˜๋ฏธ๋ก ์  ๋ถ„ํ•  ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์‰ฝ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ํ’๋ถ€ํ•œ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜ (Pretrained Weights):
    • ImageNet, JFT, BeiT ๋“ฑ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ์ œ๊ณต, ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋ถ€๋‹ด์„ ์ค„์ด๊ณ  ์ „์ด ํ•™์Šต(Transfer Learning)์„ ์šฉ์ดํ•˜๊ฒŒ ํ•จ
  • ๊ฐ„ํŽธํ•œ ๋ชจ๋ธ ์ƒ์„ฑ:
    • timm.create_model() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ ์ด๋ฆ„๋งŒ์œผ๋กœ ์›ํ•˜๋Š” ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ์ƒ์„ฑ ๊ฐ€๋Šฅ!! + ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋ฅผ ๋กœ๋“œํ•˜๋Š” ์˜ต์…˜๋„ ๊ฐ„ํŽธํ•˜๊ฒŒ ์ œ๊ณต
  • ๋ชจ๋“ˆํ™”๋œ ์„ค๊ณ„:
    • ๋ชจ๋ธ์˜ ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ (๋ฐฑ๋ณธ, ํ’€๋ง ๋ ˆ์ด์–ด, ๋ถ„๋ฅ˜ ํ—ค๋” ๋“ฑ)๋ฅผ ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•˜๊ณ  ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์–ด, ์‚ฌ์šฉ์ž ์ •์˜ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ฑฐ๋‚˜ ๊ธฐ์กด ๋ชจ๋ธ์„ fine-tuningํ•˜๋Š” ๋ฐ ์œ ์—ฐ์„ฑ ์ œ๊ณต
  • ๋‹ค์–‘ํ•œ ์œ ํ‹ธ๋ฆฌํ‹ฐ ํ•จ์ˆ˜:
    • ์ด๋ฏธ์ง€ ๋ณ€ํ™˜(transform), ๋ฐ์ดํ„ฐ์…‹ ์ฒ˜๋ฆฌ, ์ตœ์ ํ™”๊ธฐ(optimizer), ์Šค์ผ€์ค„๋Ÿฌ(scheduler) ๋“ฑ ์ด๋ฏธ์ง€ ๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ‰๊ฐ€์— ํ•„์š”ํ•œ ๋‹ค์–‘ํ•œ ์œ ํ‹ธ๋ฆฌํ‹ฐ ํ•จ์ˆ˜ ์ œ๊ณต
  • ํ™œ๋ฐœํ•œ ์ปค๋ฎค๋‹ˆํ‹ฐ:
    • ์˜คํ”ˆ ์†Œ์Šค ํ”„๋กœ์ ํŠธ๋กœ ํ™œ๋ฐœํ•œ ์ปค๋ฎค๋‹ˆํ‹ฐ ์ง€์›์„ ๋ฐ›์œผ๋ฉฐ ์ง€์†์ ์œผ๋กœ ์ƒˆ๋กœ์šด ๋ชจ๋ธ๊ณผ ๊ธฐ๋Šฅ ์ง€์† ์ถ”๊ฐ€
  • ์•ž์œผ๋กœ ์‚ฌ์šฉํ•  timm ์˜ˆ์‹œ
1
2
3
4
5
import timm

# USE DINO-ViT MODEL (pretrained)
model = timm.create_model('vit_base_patch16_224_dino', pretrained=True)
model.eval()

์œ„์˜ ๋ชจ๋ธ ๋กœ๋“œ๋ฅผ ํ†ตํ•˜์—ฌ dino ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋ฅผ ๋ณผ์ˆ˜ ์žˆ์ง€์š”~~
(์˜ค๋Š˜์€ ์‹ค์Šต์œผ๋กœ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด๋ก  ํฌ์ŠคํŒ…์—์„œ ํ™•์ธํ•ด์ฃผ์„ธ์š”~~)

2. ViT ์ธ DINO๋กœ ์ด๋ฏธ์ง€ ์ธ์ฝ”ํŒ…!! (๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค๊ธฐ)

ViT์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์€ ์ด๋ฏธ์ง€๋ฅผ Transformer ๋ฐฉ์‹์„ ํ†ตํ•ด ๋ฒกํ„ฐ๋กœ ๋งŒ๋“œ๋Š”๊ฒƒ!!

hold_fork

์šฐ์„  ์œ„์™€ ๊ฐ™์ด ํฌํฌ๋ฅผ ์ฅ๊ณ ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค!! ๊ทธ๋ฆฌ๊ณ @!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch
import timm
import torchvision.transforms as T

# USE DINO-ViT MODEL (pretrained)
model = timm.create_model('vit_base_patch16_224_dino', pretrained=True)
model.eval()

# Load image 
image_path = "hold_fork.jpg" 
image = Image.open(image_path).convert('RGB')  

# Image preprocess
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0)

# Model output with attention weights
with torch.no_grad():
    # Get features including attention
    outputs = model.forward_features(img_tensor)  # Shape: (batch_size, 197, feature_dim)
    
np.shape(outputs)

์œ„์˜ ์ฝ”๋“œ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ถ„์„ํ•ด๋ณด๋ฉด,

  • ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๊ณ 
  • ์ด๋ฏธ์ง€๋ฅผ RGB ๋ฒกํ„ฐ ๊ฐ’์œผ๋กœ ๋กœ๋“œํ•˜๊ณ !
  • DINO์— ๋„ฃ์„์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊ฟ”์ฃผ๊ณ ! - (224,224) ์‚ฌ์ด์ฆˆ์— ์ •๊ทœํ™”!
  • ๋ชจ๋ธ์— ๋„ฃ์–ด์„œ!!! ์ตœ์ข… output ๋งŒ๋“ค๊ธฐ!!

๊ทธ๋Ÿผ ๊ทธ output์€!!

1
torch.Size([1, 197, 768])

๋กœ์„œ, 197 (1๊ฐœ์˜ CLS + 196๊ฐœ์˜ ํŒจ์น˜) X 768(DINO ์ž์ฒด์˜ ์ฐจ์› ) ์˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ๋ฒกํ„ฐ๋กœ ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค!!

์ด๊ฒŒ ๋ฐ”๋กœ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ์˜ ๋!!!!
์ดํ›„ ์ด ๋ฒกํ„ฐ์˜ ๊ฐ๊ฐ์˜ ํŒจ์น˜ ํ˜น์€ CLS ๊ฐ’์œผ๋กœ ๋ถ„์„์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์ง€์š”~~!!

3. ์ธ์ฝ”๋”ฉ๊ฒฐ๊ณผ๋ฌผ(๋ฒกํ„ฐ)์˜ ์‹œ๊ฐํ™”!!(๋””์ฝ”๋”ฉ)

๋ฒกํ„ฐ๋กœ๋งŒ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋ฉด, ์ปดํ“จํ„ฐ๋Š” ์ดํ•ดํ• ์ˆ˜ ์žˆ์ง€๋งŒ,
์šฐ๋ฆฌ๋Š” ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ํž˜๋“ค์ฃ ?
๋””์ฝ”๋”ฉ ํ•˜์—ฌ ์šฐ๋ฆฌ๊ฐ€ ๋ณผ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด ๋ด…์‹œ๋‹ค!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch
import timm
import torchvision.transforms as T

# USE DINO-ViT MODEL (pretrained)
model = timm.create_model('vit_base_patch16_224_dino', pretrained=True)
model.eval()

# Load image 
image_path = "hold_fork.jpg" 
image = Image.open(image_path).convert('RGB')  

# Image preprocess
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0)

# Model output with attention weights
with torch.no_grad():
    # Get features including attention
    outputs = model.forward_features(img_tensor)  # Shape: (batch_size, 197, feature_dim)
    
    # Extract patch tokens (CLS ์ œ์™ธ)
    patch_tokens = outputs[:, 1:, :]  # (batch_size, 196, feature_dim)
    
    # Attention map: ํŒจ์น˜ ํ† ํฐ์˜ ๋…ธ๋ฆ„(norm)์„ ์‚ฌ์šฉํ•ด ์ค‘์š”๋„ ๊ณ„์‚ฐ
    attn_map = torch.norm(patch_tokens, dim=-1).reshape(14, 14)  # (14x14)
    
    # ์ •๊ทœํ™” (0~1 ๋ฒ”์œ„๋กœ ์Šค์ผ€์ผ๋ง)
    attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())

# Attention Map ์ „์ฒด ์‹œ๊ฐํ™”
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

# Original image
ax[0].imshow(image)
ax[0].axis('off')
ax[0].set_title('Original Image')

# Attention Map
attn_map_resized = np.array(Image.fromarray(attn_map.numpy()).resize(image.size, resample=Image.Resampling.BILINEAR))
ax[1].imshow(image)
ax[1].imshow(attn_map_resized, cmap='jet', alpha=0.5)  # Attention Map > heat map
ax[1].axis('off')
ax[1].set_title('DINO-ViT Attention Map')

plt.tight_layout()
plt.show()

์ด๋ฒˆ ์ฝ”๋“œ๋Š”, ์•ž์˜ ์ฝ”๋“œ์— ์ด์–ด ์‹œ๊ฐํ™” ๋ถ€๋ถ„์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค~! ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ๊ฒƒ์€!!!

1
2
3
4
5
6
7
8
9
10
11
12
# Model output with attention weights
with torch.no_grad():
    # Get features including attention
    outputs = model.forward_features(img_tensor)  # Shape: (batch_size, 197, feature_dim)
      # Extract patch tokens (exclude CLS )
    patch_tokens = outputs[:, 1:, :]  # (batch_size, 196, feature_dim)
    
    # Attention map: ํŒจ์น˜ ํ† ํฐ์˜ ๋…ธ๋ฆ„(norm)์„ ์‚ฌ์šฉํ•ด ์ค‘์š”๋„ ๊ณ„์‚ฐ
    attn_map = torch.norm(patch_tokens, dim=-1).reshape(14, 14)  # (14x14)
    
    # ์ •๊ทœํ™” (0~1 ๋ฒ”์œ„๋กœ ์Šค์ผ€์ผ๋ง)
    attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())

์œ— ๋ถ€๋ถ„์œผ๋กœ์„œ output์˜ 197๊ฐœ patch ์—์„œ 1๊ฐœ๋ฅผ ์ œ์™ธํ•˜๊ฒŒ ๋˜์ง€์š”~!
์™œ์ธ์ง€ ์ดํ•ดํ•˜๋Š”๊ฒƒ์€ ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค!!

๋ฐ”๋กœ CLS๋ฅผ ์ œ์™ธํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์ด์—์š”!!

์ด๋ ‡๊ฒŒ ์‹œ๊ฐํ™”ํ•ด๋ณด๋ฉด,
ํฌ์ŠคํŒ…์— ์ฒ˜์Œ์—์„œ ๋ณด์•˜๋˜, ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฌผ์„ ๋ณผ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค~

dino_result

๋ณ„๋„์˜ ๋ผ๋ฒจ์—†์ด ํ•™์Šต๋œ DINO ๋ชจ๋ธ์ด,
์ด๋ฏธ์ง€์˜ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ ํŒ๋‹จํ•˜์—ฌ ๋ถ‰์€ ์ƒ‰์œผ๋กœ,
๋œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ ํ‘ธ๋ฅธ์ƒ‰์œผ๋กœ ์‹œ๊ฐํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค!!

4. ๊ฒฐ๋ก !!

DINO๋กœ ์ด๋ฏธ์ง€๋ฅผ ์‰ฝ๊ฒŒ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค๊ณ  ์‹œ๊ฐํ™”ํ• ์ˆ˜ ์žˆ๋„ค์š”!!
๋ชจ๋ธ์„ ์—ฐ๊ตฌํ•˜๊ณ  ๋งŒ๋“œ๋Š”๋ฐ๋Š” ์‰ฝ์ง€ ์•Š์•˜๊ฒ ์ง€๋งŒ ์‹œ์šฉ์ด ์ •๋ง ์‰ฝ๋‹ค๋Š”๊ฒƒ์„ ๋А๋ผ๊ณ !!
์ด๋Ÿฐ ๋ชจ๋ธ์„ ๋‹ค๋ฅธ ์—ฐ๊ตฌ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ž˜ ๊ธฐ์–ตํ•ด๋‘์–ด์•ผ๊ฒ ์Šต๋‹ˆ๋‹ค~!^^

๋˜ํ•œ ๋ชจ๋ธ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์“ธ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€ timm์— ์ •๋ง ๊ฐ์‚ฌํ•˜๋„ค์š”~! ๋‹จ์ˆœํžˆ DINO ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ์“ธ ์ˆ˜ ์žˆ๋Š”๋ฐ,

1
timm.list_models()

์„ ํ†ตํ•ด ๊ฐ€๋Šฅํ•œ ์ˆ˜๋งŽ์€ ๋ชจ๋ธ๋“ค์„ ํ™•์ธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค~!
์ œ ๋ฒ„์ ผ์—์„œ๋Š” 1,200 ๊ฐœ์˜ ๋ชจ๋ธ์„ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋„ค์š”!
*๊ทธ ์™ธ์—๋„ resnet, swin, RegNet, EfficientNet ๋“ฑ์ด ๋ณด์ด๋Š”๋ฐ ์ด๋Ÿฐ ๋ชจ๋ธ๋“ค๋„ ๊ณต๋ถ€ํ•ด๋ด์•ผ ๊ฒ ์–ด์š”!! *

This post is licensed under CC BY 4.0 by the author.