Post

๐Ÿ–ฅ๏ธ FG-Clip Practice!! : FG-Clip ์‹ค์Šต!! with python

๐Ÿ–ฅ๏ธ FG-Clip Practice!! : FG-Clip ์‹ค์Šต!! with python

๐Ÿฆ– FG-CLIP Practice!!

FG-CLIP : Fine-Grained Visual and Textual Alignment

Today, weโ€™ll walk through a hands-on session with the hot new model from ICML 2025:
FG-CLIP!


๐Ÿงฑ 1. Clone the FG-CLIP Git Repository

1
git@github.com:360CVGroup/FG-CLIP.git

๐Ÿ“ฆ 2. Install Required Packages in Virtual Environment

I used a conda virtual environment to install the required packages.
I followed the instructions from the official GitHub repo!

1
2
3
conda create -n FGCLIP python=3.10 -y  
conda activate FGCLIP  
cd FG-CLIP && pip install -e .  

In addition, I installed the following packages separately
(because I ran into some errorsโ€ฆ):

1
2
3
pip install Pillow  
pip install matplotlib  
pip install torchvision --extra-index-url https://download.pytorch.org/whl/{insert-your-cu-version-here}  

๐ŸงŠ 3. Test the FG-CLIP Model!

I tested it using the notebook environment provided by FG-CLIP.
The code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)

model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

Now the model downloads and loads successfully!

Letโ€™s try with a sample image provided in the repo: cat_dfclor.jpg

cat_dfclor.jpg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog", "a photo of a animal"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T 
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)

It outputs the similarity between the image and the three captions:
[โ€œa photo of a catโ€, โ€œa photo of a dogโ€, โ€œa photo of a animalโ€]

1
tensor([[9.6813e-01, 3.2603e-05, 3.1839e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)

As expected, โ€œcatโ€ scored the highest, followed by โ€œanimalโ€, and โ€œdogโ€ the lowest.

But just seeing numbers isnโ€™t enoughโ€”we want to visualize the similarity!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt

img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    cap = "cat"
    captions = [cap]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))

original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 

plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig(f"{cap}_{img_root}.png")

This highlights the region associated with โ€œcatโ€!
Youโ€™ll see something like this:

black cat โ†’ It really highlights the black cat only!
blackcat

keyboard โ†’ Nice!
keyboard

blanket and chair โ†’ These work to some extent too!
etc

Then I tested a baseball stadium image:

hold a bat โ†’ Works well!
holdbat

player โ†’ Seems pretty accurate!?
player

catch โ†’ Is it rightโ€ฆ?
catch


๐Ÿ”ฒ Letโ€™s Try Bounding Boxes!

Numbers and heatmaps are nice, but how about bounding boxes?
Hereโ€™s the code I used for BBox generation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from scipy.ndimage import label  # Used to find adjacent regions

# --- 1. Load model and tokenizer ---
model_root = "qihoo360/fg-clip-base"
image_size = 224
model = AutoModelForCausalLM.from_pretrained(model_root, trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

# --- 2. Load image and set caption ---
img_root = "baseball_bat_000106.jpg"
cap = "handle"

try:
    image = Image.open(img_root).convert("RGB")
except FileNotFoundError:
    print(f"'{img_root}' not found. Generating a black image for fallback.")
    image = Image.new('RGB', (image_size, image_size), color = 'black')

image = image.resize((image_size, image_size))

# --- 3. Feature extraction and similarity calculation ---
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = [cap]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input, walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()

# --- 4. Group regions above average and calculate BBox coordinates ---
patch_size_in_grid = int(math.sqrt(similarity.shape[0]))
pixel_per_patch = image_size // patch_size_in_grid

# 1) Reshape similarity into a 2D grid
similarity_map = similarity.reshape((patch_size_in_grid, patch_size_in_grid))

# 2) Set a threshold using the average value
threshold = 0.22  # You can also try: np.mean(similarity) * 1.4

# 3) Create a binary mask where values above the threshold are True
print(f"threshold : {threshold}")
print(f"similarity_map : {similarity_map}")

mask = similarity_map > threshold
print("Mask shape:", mask.shape)

# 4) Label adjacent True regions (clusters) using scipy.ndimage.label
labeled_array, num_features = label(mask)

# --- 5. Draw BBox for each grouped region ---
fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(image)

# Initialize overall bounding box coordinates
all_x1, all_y1 = float('inf'), float('inf')
all_x2, all_y2 = float('-inf'), float('-inf')

for i in range(1, num_features + 1):
    
    # Find all (row, col) indices for current label
    rows, cols = np.where(labeled_array == i)
    print(rows, cols)
    
    # Get min/max for current cluster
    min_row, max_row = np.min(rows), np.max(rows)
    min_col, max_col = np.min(cols), np.max(cols)
    
    # Compute top-left and size of BBox
    bbox_x = min_col * pixel_per_patch
    bbox_y = min_row * pixel_per_patch
    bbox_w = (max_col - min_col + 1) * pixel_per_patch
    bbox_h = (max_row - min_row + 1) * pixel_per_patch
    print(bbox_x, bbox_y, bbox_w, bbox_h)

    # Create rectangle patch
    rect = patches.Rectangle(
        (bbox_x, bbox_y), bbox_w, bbox_h,
        linewidth=2, edgecolor='cyan', facecolor='none'
    )
    ax.add_patch(rect)

    cluster_sim_values = similarity_map[rows, cols]
    mean_similarity = np.mean(cluster_sim_values)

    # ๐ŸŽฏ Display label at the center
    center_col = bbox_x + bbox_w * 0.5
    center_row = bbox_y + bbox_h * 0.5
    ax.text(center_col, center_row,  f"{mean_similarity:.3f}", color='black', ha='center', va='center', fontsize=10, weight='bold')
    
print(f"num_features : {num_features}")
for i in range(1, num_features + 1):
    rows, cols = np.where(labeled_array == i)
    if len(rows) == 0:
        continue

    min_row, max_row = np.min(rows), np.max(rows)
    min_col, max_col = np.min(cols), np.max(cols)

    bbox_x1 = min_col * pixel_per_patch
    bbox_y1 = min_row * pixel_per_patch
    bbox_x2 = (max_col + 1) * pixel_per_patch
    bbox_y2 = (max_row + 1) * pixel_per_patch

    # Update overall BBox range
    all_x1 = min(all_x1, bbox_x1)
    all_y1 = min(all_y1, bbox_y1)
    all_x2 = max(all_x2, bbox_x2)
    all_y2 = max(all_y2, bbox_y2)

# Draw final merged BBox
final_w = all_x2 - all_x1
final_h = all_y2 - all_y1

rect = patches.Rectangle(
    (all_x1, all_y1), final_w, final_h,
    linewidth=3, edgecolor='red', facecolor='none'
)
ax.add_patch(rect)

ax.set_title(f"Regions with Similarity > Average for: '{cap}' // threshold :{threshold}", fontsize=14)
ax.axis('off')
plt.savefig(f"bbox/bbox_multi_{cap}_{img_root}.png")
print(f"bbox/bbox_multi_{cap}_{img_root}.png")

This code lets you highlight the region corresponding to your captionโ€”dynamically!

For example:

handle โ†’ Beautiful result!
handle

hit the ball โ†’ Somewhat localized to the bat area!
hittheball

frisbee โ†’ Kind ofโ€ฆ interesting!
frisbee


๐ŸŽ‰ Wrapping Up

Today, we explored and tested the FG-CLIP model!
In the next post, Iโ€™ll dive into how it actually works under the hood. Stay tuned!


๐Ÿฆ–(ํ•œ๊ตญ์–ด) FG-Clip ์‹ค์Šต!!

FG-CLIP : Fine-Grained Visual and Textual Alignment

์˜ค๋Š˜์€ 2025๋…„ ICML์—์„œ ๊ณต๊ฐœ๋œ ๋”ฐ๋ˆ๋”ฐ๋ˆํ•œ ์‹ ๋ชจ๋ธ,
FG-CLIP ์— ๋Œ€ํ•˜์—ฌ ์‹ค์Šต์„ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!


๐Ÿงฑ 1. FG-CLIP Git Clone

1
git@github.com:360CVGroup/FG-CLIP.git

๐Ÿ“ฆ 2. ๊ฐ€์ƒํ™˜๊ฒฝ์—์„œ์˜ ํ•„์š” ํŒจํ‚ค์ง€ ์„ค์น˜!!

์ €๋Š” conda ๊ฐ€์ƒํ™˜๊ฒฝ์—์„œ ํ•„์š” ํŒจํ‚ค์ง€๋“ค์„ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค!!
๊ณต์‹ git ์—์„œ ์„ค๋ช…ํ•œ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ”์ง€์š”~!

1
2
3
conda create -n FGCLIP python=3.10 -y
conda activate FGCLIP
cd FG-CLIP && pip install -e .

๊ทธ ์™ธ์—๋„ ์•„๋ž˜๋Š” ์ œ๊ฐ€ ๋ณ„๋„๋กœ ์„ค์น˜ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค!!
(์—๋Ÿฌ๊ฐ€ ๋‚˜๋”๋ผ๊ตฌ์š”!,,)

1
2
3
pip install Pillow
pip install matplotlib
pip install torchvision --extra-index-url https://download.pytorch.org/whl/{์—ฌ๊ธฐ๋Š” ๊ฐ์ž ์•Œ๋งž์€ cu ๋ฒ„์ ผ์œผ๋กœ!}

๐ŸงŠ 3. FG-CLIP ๋ชจ๋ธ ํ…Œ์ŠคํŠธ!!

์ €๋Š” FGClip์˜ ๋…ธํŠธ๋ถ ํ™˜๊ฒฝ์—์„œ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค!!
d์„  ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๋กœ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)

model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

๊ทธ๋Ÿผ!! ๋ชจ๋ธ์„ ๋‹ค์šด๋ฐ›์•„ load ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค!!

์ด์ œ! ๊ธฐ๋ณธ์ ์œผ๋กœ ์ œ๊ณต๋œ ์ด๋ฏธ์ง€๋กœ ํ…Œ์ŠคํŠธํ•ด๋ณผ๊นŒ์š”!?

cat_dfclor.jpg

cat_dfclor.jpg ๋ผ๋Š”, repo์— ์žˆ๋˜ ์ด๋ฏธ์ง€๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ์ง„ํ–‰ํ•ด๋ณด๋ฉด!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog", "a photo of a animal"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T 
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)

์„ธ๊ฐœ์˜ ํ”„๋กฌํฌํŠธ โ€œ[โ€œa photo of a catโ€, โ€œa photo of a dogโ€, โ€œa photo of a animalโ€]โ€ ์— ๋Œ€ํ•˜์—ฌ
์•„๋ž˜์™€ ๊ฐ™์ด ๊ฐ๊ฐ์˜ ์—ฐ๊ด€์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค!

1
tensor([[9.6813e-01, 3.2603e-05, 3.1839e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)

cat์ด๋ผ๋Š”๊ฒŒ ๊ฐ€์žฅ ๋†’๊ณ , ๊ทธ๋‹ค์Œ์€ ๋™๋ฌผ, ๊ฐ•์•„์ง€๋Š” ๊ฐ€์žฅ ์ˆ˜์น˜๊ฐ€ ์ž‘๊ฒŒ ๋‚ด์™”์Šต๋‹ˆ๋‹ค!

๊ทธ๋Ÿฐ๋ฐ! ์ด๋ ‡๊ฒŒ ์ˆซ์ž๋กœ ๋ณด๋Š”๊ฒƒ ๋ง๊ณ , ์ด๋ฏธ์ง€๋กœ ๋ฐ”๋ฐ”์•ผ๊ฒ ์ง€์š”!?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    cap = "cat"
    captions = [cap]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig(f"{cap}_{img_root}.png")

๊ทธ๋Ÿผ!!

catcam

์œ„์™€ ๊ฐ™์ด ๊ณ ์–‘์ด ๋ถ€๋ถ„์ด ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค!!

๊ฐ™์€ ๊ทธ๋ฆผ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ์ถ”๊ฐ€ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณด์•˜์–ด์š”!

black cat : ์ •๋ง ๊ฒ€์ •๊ณ ์–‘์ด ๋ถ€๋ถ„๋งŒ ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค!!
blackcat

keyboard : ๊ตณ!!
keyboard

๊ทธ ์™ธ์— ๋ฐฐ๊ฒฝ์—์žˆ๋˜ blanket๊ณผ chair. ์–ด๋А์ •๋„ ํ•˜๋Š”๊ฒƒ ๊ฐ™๋„ค์š”!!
etc

์ด๋ฒˆ์—” ๋งŽ์ด ๋ดค๋˜ ์•ผ๊ตฌ์žฅ ์‚ฌ์ง„์œผ๋กœ!

hold a bat ! ์ž˜ ํ•˜๋„ค์š”!!
holdbat

player ์–ด๋А์ •๋„ ๋งž๋Š”๊ฒƒ ๊ฐ™์€!?
player

catch ๋งž๋Š”๊ฑธ๊นŒ์š”!?..
catch

๊ทธ๋Ÿฐ๋ฐ!! ์ด๋ ‡๊ฒŒ๋งŒ ๋ณด๋Š”๊ฒƒ์€ ๋„ˆ๋ฌด ๋‹ต๋‹ตํ–ˆ์Šต๋‹ˆ๋‹ค~
ํ•œ๋ฒˆ bbox๋ฅผ ํ•ด๋ณด๋Š”๊ฒƒ์€ ์–ด๋–จ๊นŒ์š”?
์ด๋ฅผ์œ„ํ•ด ์ €๋Š” ๋ณ„๋„์˜ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from scipy.ndimage import label # ์ธ์ ‘ํ•œ ์˜์—ญ์„ ์ฐพ๋Š” ๋ฐ ์‚ฌ์šฉ

# --- 1. ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ ---
model_root = "qihoo360/fg-clip-base"
image_size = 224
model = AutoModelForCausalLM.from_pretrained(model_root, trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

# --- 2. ์ด๋ฏธ์ง€ ๋กœ๋“œ ๋ฐ ์บก์…˜ ์„ค์ • ---
img_root = "baseball_bat_000106.jpg"
cap = "handle"

try:
    image = Image.open(img_root).convert("RGB")
except FileNotFoundError:
    print(f"'{img_root}' ํŒŒ์ผ์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์‹คํ–‰์„ ์œ„ํ•ด ์ž„์˜์˜ ๊ฒ€์€์ƒ‰ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.")
    image = Image.new('RGB', (image_size, image_size), color = 'black')

image = image.resize((image_size, image_size))

# --- 3. ํ”ผ์ฒ˜ ์ถ”์ถœ ๋ฐ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ---
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = [cap]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input, walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()

# --- 4. ํ‰๊ท  ์ด์ƒ ์˜์—ญ์„ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  BBox ์ขŒํ‘œ ๊ณ„์‚ฐ ---
patch_size_in_grid = int(math.sqrt(similarity.shape[0]))
pixel_per_patch = image_size // patch_size_in_grid

# 1) ์œ ์‚ฌ๋„๋ฅผ 2D ๊ทธ๋ฆฌ๋“œ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
similarity_map = similarity.reshape((patch_size_in_grid, patch_size_in_grid))

# 2) ํ‰๊ท ๊ฐ’์„ ์ž„๊ณ„๊ฐ’(threshold)์œผ๋กœ ์„ค์ •
threshold = 0.22#np.mean(similarity) * 1.4

# 3) ์ž„๊ณ„๊ฐ’๋ณด๋‹ค ๋†’์€ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ํŒจ์น˜๋งŒ True๋กœ ํ‘œ์‹œํ•˜๋Š” ๋งˆ์Šคํฌ ์ƒ์„ฑ
print(f"threshold : {threshold}")
print(f"similarity_map : {similarity_map}")

mask = similarity_map > threshold
print("Mask shape:", mask.shape)
# 4) scipy.ndimage.label์„ ์‚ฌ์šฉํ•ด ์ธ์ ‘ํ•œ True ์˜์—ญ(ํด๋Ÿฌ์Šคํ„ฐ)์— ๊ณ ์œ  ๋ฒˆํ˜ธ(๋ ˆ์ด๋ธ”)๋ฅผ ๋ถ™์ž„
labeled_array, num_features = label(mask)

# --- 5. ๊ทธ๋ฃนํ™”๋œ ๊ฐ ์˜์—ญ์— BBox ๊ทธ๋ ค์„œ ์‹œ๊ฐํ™” ---
fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(image)


# ๋ˆ„์ ์šฉ ์ „์ฒด bbox ์ขŒํ‘œ ์ดˆ๊ธฐํ™”
all_x1, all_y1 = float('inf'), float('inf')
all_x2, all_y2 = float('-inf'), float('-inf')


for i in range(1, num_features + 1):
    
    # ํ˜„์žฌ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹นํ•˜๋Š” ๋ชจ๋“  ํ”ฝ์…€์˜ (ํ–‰, ์—ด) ์ขŒํ‘œ ์ฐพ๊ธฐ
    rows, cols = np.where(labeled_array == i)
    print(rows, cols)
    
    # ํ•ด๋‹น ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ฐ์‹ธ๋Š” ์ตœ์†Œ/์ตœ๋Œ€ ์ขŒํ‘œ ์ฐพ๊ธฐ
    min_row, max_row = np.min(rows), np.max(rows)
    min_col, max_col = np.min(cols), np.max(cols)
    
    # BBox์˜ ์ขŒ์ธก ์ƒ๋‹จ ํ”ฝ์…€ ์ขŒํ‘œ์™€ ๋„ˆ๋น„/๋†’์ด ๊ณ„์‚ฐ
    bbox_x = min_col * pixel_per_patch
    bbox_y = min_row * pixel_per_patch
    bbox_w = (max_col - min_col + 1) * pixel_per_patch
    bbox_h = (max_row - min_row + 1) * pixel_per_patch
    print(bbox_x, bbox_y, bbox_w, bbox_h)
    # BBox ์‚ฌ๊ฐํ˜• ์ƒ์„ฑ
    rect = patches.Rectangle(
        (bbox_x, bbox_y), bbox_w, bbox_h,
        linewidth=2, edgecolor='cyan', facecolor='none'
    )
    ax.add_patch(rect)
    cluster_sim_values = similarity_map[rows, cols]
    mean_similarity = np.mean(cluster_sim_values)
    # ๐ŸŽฏ ์ค‘์‹ฌ์ ์— ๋ผ๋ฒจ ํ…์ŠคํŠธ ํ‘œ์‹œ
    center_col = bbox_x + bbox_w*0.5
    center_row = bbox_y+ bbox_h*0.5
    ax.text(center_col, center_row,  f"{mean_similarity:.3f}", color='black', ha='center', va='center', fontsize=10, weight='bold')
    
print(f"num_features : {num_features}")
for i in range(1, num_features + 1):
    rows, cols = np.where(labeled_array == i)
    if len(rows) == 0:
        continue

    min_row, max_row = np.min(rows), np.max(rows)
    min_col, max_col = np.min(cols), np.max(cols)

    bbox_x1 = min_col * pixel_per_patch
    bbox_y1 = min_row * pixel_per_patch
    bbox_x2 = (max_col + 1) * pixel_per_patch
    bbox_y2 = (max_row + 1) * pixel_per_patch

    # ์ „์ฒด ๋ฒ”์œ„ ํ™•์žฅ
    all_x1 = min(all_x1, bbox_x1)
    all_y1 = min(all_y1, bbox_y1)
    all_x2 = max(all_x2, bbox_x2)
    all_y2 = max(all_y2, bbox_y2)

# ์ตœ์ข… ํ•ฉ์ณ์ง„ BBox ๊ทธ๋ฆฌ๊ธฐ
final_w = all_x2 - all_x1
final_h = all_y2 - all_y1

rect = patches.Rectangle(
    (all_x1, all_y1), final_w, final_h,
    linewidth=3, edgecolor='red', facecolor='none'
)
ax.add_patch(rect)

ax.set_title(f"Regions with Similarity > Average for: '{cap}' // threshold :{threshold}", fontsize=14)
ax.axis('off')
plt.savefig(f"bbox/bbox_multi_{cap}_{img_root}.png")
print(f"bbox/bbox_multi_{cap}_{img_root}.png")

์œ„์˜ ์ฝ”๋“œ์—์„œ bbox๋ฅผ ์œ„ํ•œ threshold๋Š” ์ œ๊ฐ€,, ์ž˜ ๋งž๊ฒŒ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค!!

๊ทธ ๊ฒฐ๊ณผ!!

handle : ์•„์ฃผ ์ข‹์ฃ !? handle

hit the ball : ์–ด๋А์ •๋„ ๋ฐฉ๋ง์ด ๋ถ€๋ถ„์ธ๋“ฏ!? hittheball

frisbee frisbee


๐ŸŽ‰ ๋งˆ๋ฌด๋ฆฌ

์˜ค๋Š˜์€ ์ด๋ ‡๊ฒŒ FG-CLIP์— ๋Œ€ํ•˜์—ฌ ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค!
๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ ์›๋ฆฌ์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

This post is licensed under CC BY 4.0 by the author.