๐ฅ๏ธ FG-Clip Practice!! : FG-Clip ์ค์ต!! with python
๐ฆ FG-CLIP Practice!!
FG-CLIP : Fine-Grained Visual and Textual Alignment
Today, weโll walk through a hands-on session with the hot new model from ICML 2025:
FG-CLIP!
๐งฑ 1. Clone the FG-CLIP Git Repository
- Clone the repo from the official GitHub page!
1
git@github.com:360CVGroup/FG-CLIP.git
๐ฆ 2. Install Required Packages in Virtual Environment
I used a conda
virtual environment to install the required packages.
I followed the instructions from the official GitHub repo!
1
2
3
conda create -n FGCLIP python=3.10 -y
conda activate FGCLIP
cd FG-CLIP && pip install -e .
In addition, I installed the following packages separately
(because I ran into some errorsโฆ):
1
2
3
pip install Pillow
pip install matplotlib
pip install torchvision --extra-index-url https://download.pytorch.org/whl/{insert-your-cu-version-here}
๐ง 3. Test the FG-CLIP Model!
I tested it using the notebook environment provided by FG-CLIP.
The code looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
from PIL import Image
from transformers import (
AutoImageProcessor,
AutoTokenizer,
AutoModelForCausalLM,
)
model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)
Now the model downloads and loads successfully!
Letโs try with a sample image provided in the repo: cat_dfclor.jpg
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog", "a photo of a animal"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
with torch.no_grad():
image_feature = model.get_image_features(image_input)
text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
It outputs the similarity between the image and the three captions:
[โa photo of a catโ, โa photo of a dogโ, โa photo of a animalโ]
1
tensor([[9.6813e-01, 3.2603e-05, 3.1839e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
As expected, โcatโ scored the highest, followed by โanimalโ, and โdogโ the lowest.
But just seeing numbers isnโt enoughโwe want to visualize the similarity!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import math
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
dense_image_feature = model.get_image_dense_features(image_input)
cap = "cat"
captions = [cap]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
text_feature = model.get_text_features(caption_input,walk_short_pos=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)
similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))
original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape)
plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')
plt.savefig(f"{cap}_{img_root}.png")
This highlights the region associated with โcatโ!
Youโll see something like this:
Then I tested a baseball stadium image:
๐ฒ Letโs Try Bounding Boxes!
Numbers and heatmaps are nice, but how about bounding boxes?
Hereโs the code I used for BBox generation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import torch
from PIL import Image
from transformers import (
AutoImageProcessor,
AutoTokenizer,
AutoModelForCausalLM,
)
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from scipy.ndimage import label # Used to find adjacent regions
# --- 1. Load model and tokenizer ---
model_root = "qihoo360/fg-clip-base"
image_size = 224
model = AutoModelForCausalLM.from_pretrained(model_root, trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)
# --- 2. Load image and set caption ---
img_root = "baseball_bat_000106.jpg"
cap = "handle"
try:
image = Image.open(img_root).convert("RGB")
except FileNotFoundError:
print(f"'{img_root}' not found. Generating a black image for fallback.")
image = Image.new('RGB', (image_size, image_size), color = 'black')
image = image.resize((image_size, image_size))
# --- 3. Feature extraction and similarity calculation ---
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
dense_image_feature = model.get_image_dense_features(image_input)
captions = [cap]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
text_feature = model.get_text_features(caption_input, walk_short_pos=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)
similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
# --- 4. Group regions above average and calculate BBox coordinates ---
patch_size_in_grid = int(math.sqrt(similarity.shape[0]))
pixel_per_patch = image_size // patch_size_in_grid
# 1) Reshape similarity into a 2D grid
similarity_map = similarity.reshape((patch_size_in_grid, patch_size_in_grid))
# 2) Set a threshold using the average value
threshold = 0.22 # You can also try: np.mean(similarity) * 1.4
# 3) Create a binary mask where values above the threshold are True
print(f"threshold : {threshold}")
print(f"similarity_map : {similarity_map}")
mask = similarity_map > threshold
print("Mask shape:", mask.shape)
# 4) Label adjacent True regions (clusters) using scipy.ndimage.label
labeled_array, num_features = label(mask)
# --- 5. Draw BBox for each grouped region ---
fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(image)
# Initialize overall bounding box coordinates
all_x1, all_y1 = float('inf'), float('inf')
all_x2, all_y2 = float('-inf'), float('-inf')
for i in range(1, num_features + 1):
# Find all (row, col) indices for current label
rows, cols = np.where(labeled_array == i)
print(rows, cols)
# Get min/max for current cluster
min_row, max_row = np.min(rows), np.max(rows)
min_col, max_col = np.min(cols), np.max(cols)
# Compute top-left and size of BBox
bbox_x = min_col * pixel_per_patch
bbox_y = min_row * pixel_per_patch
bbox_w = (max_col - min_col + 1) * pixel_per_patch
bbox_h = (max_row - min_row + 1) * pixel_per_patch
print(bbox_x, bbox_y, bbox_w, bbox_h)
# Create rectangle patch
rect = patches.Rectangle(
(bbox_x, bbox_y), bbox_w, bbox_h,
linewidth=2, edgecolor='cyan', facecolor='none'
)
ax.add_patch(rect)
cluster_sim_values = similarity_map[rows, cols]
mean_similarity = np.mean(cluster_sim_values)
# ๐ฏ Display label at the center
center_col = bbox_x + bbox_w * 0.5
center_row = bbox_y + bbox_h * 0.5
ax.text(center_col, center_row, f"{mean_similarity:.3f}", color='black', ha='center', va='center', fontsize=10, weight='bold')
print(f"num_features : {num_features}")
for i in range(1, num_features + 1):
rows, cols = np.where(labeled_array == i)
if len(rows) == 0:
continue
min_row, max_row = np.min(rows), np.max(rows)
min_col, max_col = np.min(cols), np.max(cols)
bbox_x1 = min_col * pixel_per_patch
bbox_y1 = min_row * pixel_per_patch
bbox_x2 = (max_col + 1) * pixel_per_patch
bbox_y2 = (max_row + 1) * pixel_per_patch
# Update overall BBox range
all_x1 = min(all_x1, bbox_x1)
all_y1 = min(all_y1, bbox_y1)
all_x2 = max(all_x2, bbox_x2)
all_y2 = max(all_y2, bbox_y2)
# Draw final merged BBox
final_w = all_x2 - all_x1
final_h = all_y2 - all_y1
rect = patches.Rectangle(
(all_x1, all_y1), final_w, final_h,
linewidth=3, edgecolor='red', facecolor='none'
)
ax.add_patch(rect)
ax.set_title(f"Regions with Similarity > Average for: '{cap}' // threshold :{threshold}", fontsize=14)
ax.axis('off')
plt.savefig(f"bbox/bbox_multi_{cap}_{img_root}.png")
print(f"bbox/bbox_multi_{cap}_{img_root}.png")
This code lets you highlight the region corresponding to your captionโdynamically!
For example:
๐ Wrapping Up
Today, we explored and tested the FG-CLIP model!
In the next post, Iโll dive into how it actually works under the hood. Stay tuned!
๐ฆ(ํ๊ตญ์ด) FG-Clip ์ค์ต!!
FG-CLIP : Fine-Grained Visual and Textual Alignment
์ค๋์ 2025๋
ICML์์ ๊ณต๊ฐ๋ ๋ฐ๋๋ฐ๋ํ ์ ๋ชจ๋ธ,
FG-CLIP ์ ๋ํ์ฌ ์ค์ต์ ์งํํด๋ณด๊ฒ ์ต๋๋ค!!
๐งฑ 1. FG-CLIP Git Clone
- ๊ณต์ Git ์ฌ์ดํธ์์ Repo๋ฅผ Clone ํฉ๋๋ค!!
1
git@github.com:360CVGroup/FG-CLIP.git
๐ฆ 2. ๊ฐ์ํ๊ฒฝ์์์ ํ์ ํจํค์ง ์ค์น!!
์ ๋ conda ๊ฐ์ํ๊ฒฝ์์ ํ์ ํจํค์ง๋ค์ ์ค์นํ์ต๋๋ค!!
๊ณต์ git ์์ ์ค๋ช
ํ๋๋ก ๋ฐ๋ผ๊ฐ์ง์~!
1
2
3
conda create -n FGCLIP python=3.10 -y
conda activate FGCLIP
cd FG-CLIP && pip install -e .
๊ทธ ์ธ์๋ ์๋๋ ์ ๊ฐ ๋ณ๋๋ก ์ค์นํด์ฃผ์์ต๋๋ค!!
(์๋ฌ๊ฐ ๋๋๋ผ๊ตฌ์!,,)
1
2
3
pip install Pillow
pip install matplotlib
pip install torchvision --extra-index-url https://download.pytorch.org/whl/{์ฌ๊ธฐ๋ ๊ฐ์ ์๋ง์ cu ๋ฒ์ ผ์ผ๋ก!}
๐ง 3. FG-CLIP ๋ชจ๋ธ ํ ์คํธ!!
์ ๋ FGClip์ ๋
ธํธ๋ถ ํ๊ฒฝ์์ ํ
์คํธ๋ฅผ ์งํํ์ต๋๋ค!!
d์ ์๋์ ๊ฐ์ ์ฝ๋๋ก
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
from PIL import Image
from transformers import (
AutoImageProcessor,
AutoTokenizer,
AutoModelForCausalLM,
)
model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)
๊ทธ๋ผ!! ๋ชจ๋ธ์ ๋ค์ด๋ฐ์ load ๊ฐ ๋ฉ๋๋ค!!
์ด์ ! ๊ธฐ๋ณธ์ ์ผ๋ก ์ ๊ณต๋ ์ด๋ฏธ์ง๋ก ํ ์คํธํด๋ณผ๊น์!?
cat_dfclor.jpg
๋ผ๋, repo์ ์๋ ์ด๋ฏธ์ง๋ก ์๋์ ๊ฐ์ด ์งํํด๋ณด๋ฉด!!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog", "a photo of a animal"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......
with torch.no_grad():
image_feature = model.get_image_features(image_input)
text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
์ธ๊ฐ์ ํ๋กฌํฌํธ โ[โa photo of a catโ, โa photo of a dogโ, โa photo of a animalโ]โ ์ ๋ํ์ฌ
์๋์ ๊ฐ์ด ๊ฐ๊ฐ์ ์ฐ๊ด์ฑ์ ๋ณด์ฌ์ค๋๋ค!
1
tensor([[9.6813e-01, 3.2603e-05, 3.1839e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
cat์ด๋ผ๋๊ฒ ๊ฐ์ฅ ๋๊ณ , ๊ทธ๋ค์์ ๋๋ฌผ, ๊ฐ์์ง๋ ๊ฐ์ฅ ์์น๊ฐ ์๊ฒ ๋ด์์ต๋๋ค!
๊ทธ๋ฐ๋ฐ! ์ด๋ ๊ฒ ์ซ์๋ก ๋ณด๋๊ฒ ๋ง๊ณ , ์ด๋ฏธ์ง๋ก ๋ฐ๋ฐ์ผ๊ฒ ์ง์!?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import math
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
dense_image_feature = model.get_image_dense_features(image_input)
cap = "cat"
captions = [cap]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
text_feature = model.get_text_features(caption_input,walk_short_pos=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)
similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))
original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape)
plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')
plt.savefig(f"{cap}_{img_root}.png")
๊ทธ๋ผ!!
์์ ๊ฐ์ด ๊ณ ์์ด ๋ถ๋ถ์ด ํ์ฑํ๋ฉ๋๋ค!!
๊ฐ์ ๊ทธ๋ฆผ์์ ์๋์ ๊ฐ์ด ์ถ๊ฐํ ์คํธ๋ฅผ ํด๋ณด์์ด์!
black cat : ์ ๋ง ๊ฒ์ ๊ณ ์์ด ๋ถ๋ถ๋ง ํ์ฑํ๋ฉ๋๋ค!!
๊ทธ ์ธ์ ๋ฐฐ๊ฒฝ์์๋ blanket๊ณผ chair. ์ด๋์ ๋ ํ๋๊ฒ ๊ฐ๋ค์!!
์ด๋ฒ์ ๋ง์ด ๋ดค๋ ์ผ๊ตฌ์ฅ ์ฌ์ง์ผ๋ก!
๊ทธ๋ฐ๋ฐ!! ์ด๋ ๊ฒ๋ง ๋ณด๋๊ฒ์ ๋๋ฌด ๋ต๋ตํ์ต๋๋ค~
ํ๋ฒ bbox๋ฅผ ํด๋ณด๋๊ฒ์ ์ด๋จ๊น์?
์ด๋ฅผ์ํด ์ ๋ ๋ณ๋์ ์๋ ์ฝ๋๋ฅผ ์ฌ์ฉํ์ต๋๋ค!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import torch
from PIL import Image
from transformers import (
AutoImageProcessor,
AutoTokenizer,
AutoModelForCausalLM,
)
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from scipy.ndimage import label # ์ธ์ ํ ์์ญ์ ์ฐพ๋ ๋ฐ ์ฌ์ฉ
# --- 1. ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋ ---
model_root = "qihoo360/fg-clip-base"
image_size = 224
model = AutoModelForCausalLM.from_pretrained(model_root, trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)
# --- 2. ์ด๋ฏธ์ง ๋ก๋ ๋ฐ ์บก์
์ค์ ---
img_root = "baseball_bat_000106.jpg"
cap = "handle"
try:
image = Image.open(img_root).convert("RGB")
except FileNotFoundError:
print(f"'{img_root}' ํ์ผ์ ์ฐพ์ ์ ์์ต๋๋ค. ์คํ์ ์ํด ์์์ ๊ฒ์์ ์ด๋ฏธ์ง๋ฅผ ์์ฑํฉ๋๋ค.")
image = Image.new('RGB', (image_size, image_size), color = 'black')
image = image.resize((image_size, image_size))
# --- 3. ํผ์ฒ ์ถ์ถ ๋ฐ ์ ์ฌ๋ ๊ณ์ฐ ---
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
dense_image_feature = model.get_image_dense_features(image_input)
captions = [cap]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
text_feature = model.get_text_features(caption_input, walk_short_pos=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)
similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
# --- 4. ํ๊ท ์ด์ ์์ญ์ ๊ทธ๋ฃนํํ๊ณ BBox ์ขํ ๊ณ์ฐ ---
patch_size_in_grid = int(math.sqrt(similarity.shape[0]))
pixel_per_patch = image_size // patch_size_in_grid
# 1) ์ ์ฌ๋๋ฅผ 2D ๊ทธ๋ฆฌ๋ ํํ๋ก ๋ณํ
similarity_map = similarity.reshape((patch_size_in_grid, patch_size_in_grid))
# 2) ํ๊ท ๊ฐ์ ์๊ณ๊ฐ(threshold)์ผ๋ก ์ค์
threshold = 0.22#np.mean(similarity) * 1.4
# 3) ์๊ณ๊ฐ๋ณด๋ค ๋์ ์ ์๋ฅผ ๊ฐ์ง ํจ์น๋ง True๋ก ํ์ํ๋ ๋ง์คํฌ ์์ฑ
print(f"threshold : {threshold}")
print(f"similarity_map : {similarity_map}")
mask = similarity_map > threshold
print("Mask shape:", mask.shape)
# 4) scipy.ndimage.label์ ์ฌ์ฉํด ์ธ์ ํ True ์์ญ(ํด๋ฌ์คํฐ)์ ๊ณ ์ ๋ฒํธ(๋ ์ด๋ธ)๋ฅผ ๋ถ์
labeled_array, num_features = label(mask)
# --- 5. ๊ทธ๋ฃนํ๋ ๊ฐ ์์ญ์ BBox ๊ทธ๋ ค์ ์๊ฐํ ---
fig, ax = plt.subplots(1, figsize=(8, 8))
ax.imshow(image)
# ๋์ ์ฉ ์ ์ฒด bbox ์ขํ ์ด๊ธฐํ
all_x1, all_y1 = float('inf'), float('inf')
all_x2, all_y2 = float('-inf'), float('-inf')
for i in range(1, num_features + 1):
# ํ์ฌ ๋ ์ด๋ธ์ ํด๋นํ๋ ๋ชจ๋ ํฝ์
์ (ํ, ์ด) ์ขํ ์ฐพ๊ธฐ
rows, cols = np.where(labeled_array == i)
print(rows, cols)
# ํด๋น ํด๋ฌ์คํฐ๋ฅผ ๊ฐ์ธ๋ ์ต์/์ต๋ ์ขํ ์ฐพ๊ธฐ
min_row, max_row = np.min(rows), np.max(rows)
min_col, max_col = np.min(cols), np.max(cols)
# BBox์ ์ข์ธก ์๋จ ํฝ์
์ขํ์ ๋๋น/๋์ด ๊ณ์ฐ
bbox_x = min_col * pixel_per_patch
bbox_y = min_row * pixel_per_patch
bbox_w = (max_col - min_col + 1) * pixel_per_patch
bbox_h = (max_row - min_row + 1) * pixel_per_patch
print(bbox_x, bbox_y, bbox_w, bbox_h)
# BBox ์ฌ๊ฐํ ์์ฑ
rect = patches.Rectangle(
(bbox_x, bbox_y), bbox_w, bbox_h,
linewidth=2, edgecolor='cyan', facecolor='none'
)
ax.add_patch(rect)
cluster_sim_values = similarity_map[rows, cols]
mean_similarity = np.mean(cluster_sim_values)
# ๐ฏ ์ค์ฌ์ ์ ๋ผ๋ฒจ ํ
์คํธ ํ์
center_col = bbox_x + bbox_w*0.5
center_row = bbox_y+ bbox_h*0.5
ax.text(center_col, center_row, f"{mean_similarity:.3f}", color='black', ha='center', va='center', fontsize=10, weight='bold')
print(f"num_features : {num_features}")
for i in range(1, num_features + 1):
rows, cols = np.where(labeled_array == i)
if len(rows) == 0:
continue
min_row, max_row = np.min(rows), np.max(rows)
min_col, max_col = np.min(cols), np.max(cols)
bbox_x1 = min_col * pixel_per_patch
bbox_y1 = min_row * pixel_per_patch
bbox_x2 = (max_col + 1) * pixel_per_patch
bbox_y2 = (max_row + 1) * pixel_per_patch
# ์ ์ฒด ๋ฒ์ ํ์ฅ
all_x1 = min(all_x1, bbox_x1)
all_y1 = min(all_y1, bbox_y1)
all_x2 = max(all_x2, bbox_x2)
all_y2 = max(all_y2, bbox_y2)
# ์ต์ข
ํฉ์ณ์ง BBox ๊ทธ๋ฆฌ๊ธฐ
final_w = all_x2 - all_x1
final_h = all_y2 - all_y1
rect = patches.Rectangle(
(all_x1, all_y1), final_w, final_h,
linewidth=3, edgecolor='red', facecolor='none'
)
ax.add_patch(rect)
ax.set_title(f"Regions with Similarity > Average for: '{cap}' // threshold :{threshold}", fontsize=14)
ax.axis('off')
plt.savefig(f"bbox/bbox_multi_{cap}_{img_root}.png")
print(f"bbox/bbox_multi_{cap}_{img_root}.png")
์์ ์ฝ๋์์ bbox๋ฅผ ์ํ threshold๋ ์ ๊ฐ,, ์ ๋ง๊ฒ ์์ ํ์ต๋๋ค!!
๊ทธ ๊ฒฐ๊ณผ!!
๐ ๋ง๋ฌด๋ฆฌ
์ค๋์ ์ด๋ ๊ฒ FG-CLIP์ ๋ํ์ฌ ํ
์คํธํด๋ณด์์ต๋๋ค!
๋ค์ ํฌ์คํ
์์ ์๋ฆฌ์ ๋ํ์ฌ ๊ณต๋ถํด๋ณด๊ฒ ์ต๋๋ค!