๐ GEM: Grounding Everything in Vision-Language Transformers
๐ GEM: Unlocking the Latent Localization Ability of VLMs!
- Title: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
- Conference: CVPR 2024
- Code/Checkpoints: GitHub โ GEM
- Keywords:
Training-Free
,Grounding
,Vision-Language Transformer
,Self-Self Attention
,Zero-Shot
- Summary: Feels like an extended version of CLIP Surgery! Proposes GEM, a framework that leverages the inherent attention structure of pretrained Vision-Language Transformers (VLMs) to perform object localization and segmentation in a training-free manner!
๐ GEM Key Summary
One-liner: CLIP Surgery + (1) Attention Expansion + (2) Regularization
1) Self-Self Attention Expansion
- CLIP Surgery only used valueโvalue (vโv) attention
- GEM extends this to include queryโquery (qโq), keyโkey (kโk) โ utilizes full selfโself attention
2) Regularization
- CLIP Surgery had no normalization concept
- GEM introduces three components for more stable and generalized localization:
i) Adaptive temperature: adaptively adjusts softmax temperature for each dataset/model
ii) L2 normalization: removes influence from token magnitude differences
iii) Iterative selfโself attention: repeats clustering multiple times for reinforcement
3) Training-Free Grounding with Zero-Shot Localization & Segmentation
- Directly extracts localization ability from pretrained VLMs
- Achieves performance comparable to fine-tuned detectors
- Open-vocabulary grounding without additional training
๐ Flow of Existing Research
1. Localization-first approaches
- Idea: first detect regions or masks, then label them using VL models
- Examples:
- OpenSeg: fine-tuned with class-agnostic masks + image-text pairs
- OVSeg: segmentation model + CLIP for mask classification
- MaskCLIP(3): mask proposal network + CLIP encoder
- GroundingSAM: GroundingDINO (detector) + SAM (masking)
2. Modifying VL model architecture/training
- Idea: alter ViT to encourage localization properties
- Examples:
- SegCLIP, GroupViT: insert grouping blocks
- ViL-Seg, OVSegmentor: clustering / Slot Attention
- ReCo: retrieval-based fine supervision
- PACL: add a decoder with grounding loss on top of CLIP
3. Training-free adaptation
- Idea: adapt pretrained VL models for localization without training
- Examples:
- MaskCLIP: remove final MLP, use value projection
- CLIP Surgery: add surgery pathway to ViT backbone (vโv attention with residual)
โก๏ธ GEMโs core concept: extending the training-free CLIP Surgery approach!
๐งฑ GEM Architecture
Easier to understand through code than images!
$$$python class SelfSelfAttention(nn.Module): def init(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., ss_attn_iter=1, ss_attn_temp=None): super().init() self.num_heads = num_heads head_dim = dim // num_heads self.scale = qk_scale or head_dim ** -0.5 self.ss_attn_iter = ss_attn_iter self.ss_attn_temp = ss_attn_temp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, attn_bias=None, prev_attn=None):
x = x.transpose(0, 1)
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
self.v_values = v
# original self-attention for the original path
attn_ori_return = (q @ k.transpose(-2, -1)) * self.scale
attn_ori = attn_ori_return.softmax(dim=-1)
attn_ori = self.attn_drop(attn_ori)
x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
x_ori = self.proj_drop(self.proj(x_ori))
# GEM
xs1 = v
xs2 = k
xs3 = q
# >>> i) Adaptive temperature: `inv_temp`
if self.ss_attn_temp is None:
pre_norm = torch.norm(x, dim=-1).mean(dim=-1, keepdim=True).unsqueeze(1).unsqueeze(-1)
inv_temp = pre_norm * self.scale
else:
inv_temp = self.ss_attn_temp
# >>> iii) Iterative selfโself attention
for it in range(self.ss_attn_iter):
# >>> ii) L2 normalization
xs1 = F.normalize(xs1, dim=-1)
xs2 = F.normalize(xs2, dim=-1)
xs3 = F.normalize(xs3, dim=-1)
attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp
attn1 = (attn_return1).softmax(dim=-1)
attn2 = (attn_return2).softmax(dim=-1)
attn3 = (attn_return3).softmax(dim=-1)
xs1 = attn1 @ xs1
xs2 = attn2 @ xs2
xs3 = attn3 @ xs3
# Assignment to V
xs1 = F.normalize(xs1, dim=-1)
xs2 = F.normalize(xs2, dim=-1)
xs3 = F.normalize(xs3, dim=-1)
attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp
attn1 = (attn_return1).softmax(dim=-1)
attn2 = (attn_return2).softmax(dim=-1)
attn3 = (attn_return3).softmax(dim=-1)
xs1 = attn1 @ v
xs2 = attn2 @ v
xs3 = attn3 @ v
## >>> iiii : qkv ensemble!!!
xs = (xs1 + xs2 + xs3) / 3
x = xs.transpose(1, 2).reshape(B, N, C)
x = self.proj_drop(self.proj(x))
return [x.transpose(0, 1), x_ori.transpose(0, 1)] $$$
1) Self-Self Attention Expansion
- As seen in the code:
xs1
= v-v,xs2
= k-k,xs3
= q-q
2) Regularization
i) Adaptive temperature
ii) L2 normalization
iii) Iterative selfโself attention
iiii) qkv-Ensemble (averaging all results)
๐งช Experiments & Results
๐ฏ Segmentation & Localization Benchmarks
- On complex datasets like PascalContext, ADE20K:
- Significantly better than previous training-free methods
- Comparable or superior to fine-tuned approaches
๐ฏ Zero-Shot Point Prediction (OpenImages V7)
- First training-free SOTA
- Demonstrates localization without LLM/VLM hybrid models
- Downside: inference FPS is quite slow
๐ Qualitative Results
- Methods trained with localization (GroundingSAM, OVSeg)
- Strength: high-quality masks if object correctly identified (e.g., Cat, Squirrel, Jet Ski)
- Weakness: fail to detect entities not in datasets (e.g., Boxer, Violin)
- Cause: reliance on handcrafted segmentation annotation โ limited scope
- Segmentation-specialized training methods (GroupViT, SegCLIP)
- Strength: accurate for common objects (e.g., Cat, Squirrel, Lizard)
- Weakness: fail on rare objects (e.g., Jet Ski, Logo, Flag)
- Cause: limited curated vocab โ reduced diversity
- Training-free methods (MaskCLIP, CLIPSurgery, GEM)
- Strength: leverage millions of image-text pairs from VLM pretraining โ recognize diverse entities
- Weakness: masks less sharp than GroundingSAM
- GEMโs extra achievement:
- Sharper segmentation compared to other training-free approaches (clearer contours, fewer holes)
- Detects objects missed by MaskCLIP & CLIPSurgery (e.g., Logo)
Failure cases show strong dependence on text prompts!
Image
๐งช Ablation Analysis
Comparison with CLIP Surgery: adding k-k, q-q, normalization, etc. leads to improvements
Effect of normalization: clear benefit, requires proper 1/T value
Effect of iterations:
- More iterations โ beneficial for datasets with fewer classes (VOC)
- Fewer iterations โ better for complex datasets with many classes (Context)
โ Conclusion
- GEM reveals the latent localization ability of Vision-Language Transformers
- Potential to replace fine-tuned detectors when combined with larger VLMs
- Introduces a new paradigm for open-world recognition, segmentation, and grounding!
๐ (ํ๊ตญ์ด) GEM: VLM์ด ๊ฐ์ง ์ ์ฌ์ Localization ๋ฅ๋ ฅ์ ๋์ด๋ด๋ค!
- ์ ๋ชฉ: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
- ํํ: CVPR 2024
- ์ฝ๋/์ฒดํฌํฌ์ธํธ: GitHub โ GEM
- ํต์ฌ ํค์๋:
Training-Free
,Grounding
,Vision-Language Transformer
,Self-Self Attention
,Zero-Shot
- ์์ฝ: CLIP Surgery์ ํ์ฅํ๋๋!! ์ฌ์ ํ์ต๋ Vision-Language Transformer(VLM)์ ๋ด์ฌ๋ attention ๊ตฌ์กฐ๋ฅผ ํ์ฉํด, ์ถ๊ฐ ํ์ต ์์ด(training-free) ๊ฐ์ฒด ์์น ์ธ์๊ณผ ๋ถํ ๊น์ง ์ํํ๋ ํ๋ ์์ํฌ GEM ์ ์!
๐ GEM ํต์ฌ ์์ฝ
ํ ์ค ์์ฝ: CLIP Surgery์ 1. Attention ํ์ฅ ๋ฐ 2. Regularization ๋์
1) Self-Self Attention ํ์ฅ
- CLIP Surgery ์์๋ ์ค์ง valueโvalue (vโv) attention๋ง ์ฌ์ฉ
- ๊ทธ๋ฐ๋ฐ GEM์!! vโv๋ฟ๋ง ์๋๋ผ queryโquery (qโq), keyโkey (kโk)๊น์ง ํ์ฅ โ selfโself attention ์ ๋ฐ ํ์ฉ
2) Regularization ๋์
- CLIP Surgery์๋ ์ ๊ทํ ๊ฐ๋ ์ด ์์
- GEM์ ์์ ์ ์ด๊ณ ์ผ๋ฐํ๋ localization์ ์ํด ์ธ๊ฐ์ง ์์๋ฅผ ๋์
!!
i) Adaptive temperature: ๋ฐ์ดํฐ์ /๋ชจ๋ธ๋ง๋ค ์ ์์ ์ผ๋ก softmax ์จ๋ ์กฐ์
ii) L2 ์ ๊ทํ: ํ ํฐ์ ํฌ๊ธฐ(norm) ์ฐจ์ด๋ก ์๊ธฐ๋ ์ํฅ ์ ๊ฑฐ
iii) Iterative selfโself attention: ํ์ ์ ์ฌ๋ฌ ๋ฒ ๋ฐ๋ณตํ์ฌ ํด๋ฌ์คํฐ๋ง ๊ฐํ
3) Training-Free Grounding ์ด๋ฉด์ Zero-Shot Localization & Segmentation
- ์ฌ์ ํ์ต๋ VLM์์ ๋ฐ๋ก localization ์ฑ๋ฅ ์ถ์ถ
- fine-tuned detector ์์ค์ ๋ง๋จน๋ ์ฑ๋ฅ
- ์ถ๊ฐ ํ์ต ์์ด open-vocabulary grounding ๋ฌ์ฑ
๐ ๊ธฐ์กด ์ฐ๊ตฌ์ ํ๋ฆ
1. Localization-first ์ ๊ทผ
- ์์ด๋์ด: ๋จผ์ ์์ญ(Region)์ด๋ ๋ง์คํฌ๋ฅผ ์ฐพ์ ๋ค VL ๋ชจ๋ธ๋ก ๋ผ๋ฒจ๋ง
- ์์:
- OpenSeg: class-agnostic mask + image-text pair๋ก ํ์ธํ๋
- OVSeg: segmentation model + CLIP์ผ๋ก ๋ง์คํฌ ๋ถ๋ฅ
- MaskCLIP(3): ๋ง์คํฌ ์ ์ ๋คํธ์ํฌ + CLIP ์ธ์ฝ๋
- GroundingSAM: GroundingDINO(๊ฒ์ถ) + SAM(๋ง์คํฌ)
2. VL ๋ชจ๋ธ ๊ตฌ์กฐ/ํ์ต ์์ ์ ๊ทผ
- ์์ด๋์ด: ViT ๊ตฌ์กฐ๋ฅผ ๋ฐ๊พธ์ด localization ํน์ฑ์ ์ ๋
- ์์:
- SegCLIP, GroupViT: grouping block ์ฝ์
- ViL-Seg, OVSegmentor: clustering/Slot Attention ํ์ฉ
- ReCo: retrieval ๊ธฐ๋ฐ ๋ฏธ์ธ ๊ฐ๋
- PACL: CLIP ์์ decoder + grounding loss
3. Training-free ์ ์ ์ ๊ทผ
- ์์ด๋์ด: ํ์ต ์์ด ๊ธฐ์กด VL ๋ชจ๋ธ์ localization์ ๋ง๊ฒ ๋ณํ
- ์์:
- MaskCLIP: ๋ง์ง๋ง MLP ์ ๊ฑฐ, value projection ํ์ฉ
- CLIP Surgery: ViT ๋ฐฑ๋ณธ์ surgery pathway ์ถ๊ฐ (valueโvalue attention ์ฌ์ฉ, residual๋ก ๋์ )
- ์ด ์ค์์ Training-Free์ธ 3๋ฒ์ CLIP Surgery๋ฅผ ํ์ฅํ๋ ๊ฒ์ด GEM์ ๊ธฐ๋ณธ๊ฐ๋ !!
๐งฑ GEM ๊ตฌ์กฐ (Architecture)
์ฌ๊ธด ์ด๋ฏธ์ง๋ณด๋ค ์ฝ๋๋ก ๋ณด๋๊ฐ ํธํจ!!!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class SelfSelfAttention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., ss_attn_iter=1,
ss_attn_temp=None):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = qk_scale or head_dim ** -0.5
self.ss_attn_iter = ss_attn_iter
self.ss_attn_temp = ss_attn_temp
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, attn_bias=None, prev_attn=None):
x = x.transpose(0, 1)
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
self.v_values = v
# original self-attention for the original path
attn_ori_return = (q @ k.transpose(-2, -1)) * self.scale
attn_ori = attn_ori_return.softmax(dim=-1)
attn_ori = self.attn_drop(attn_ori)
x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
x_ori = self.proj_drop(self.proj(x_ori))
# GEM
xs1 = v
xs2 = k
xs3 = q
# >>> i) Adaptive temperature: `inv_temp`
if self.ss_attn_temp is None:
pre_norm = torch.norm(x, dim=-1).mean(dim=-1, keepdim=True).unsqueeze(1).unsqueeze(-1)
inv_temp = pre_norm * self.scale
else:
inv_temp = self.ss_attn_temp
# >>> iii) Iterative selfโself attention: ๋ฐ๋ณตํจ!!
for it in range(self.ss_attn_iter):
# >>> ii) L2 ์ ๊ทํ: ํ ํฐ์ ํฌ๊ธฐ(norm) ์ฐจ์ด๋ก ์๊ธฐ๋ ์ํฅ ์ ๊ฑฐ
xs1 = F.normalize(xs1, dim=-1)
xs2 = F.normalize(xs2, dim=-1)
xs3 = F.normalize(xs3, dim=-1)
attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp
attn1 = (attn_return1).softmax(dim=-1)
attn2 = (attn_return2).softmax(dim=-1)
attn3 = (attn_return3).softmax(dim=-1)
xs1 = attn1 @ xs1
xs2 = attn2 @ xs2
xs3 = attn3 @ xs3
# Assigment to V
xs1 = F.normalize(xs1, dim=-1)
xs2 = F.normalize(xs2, dim=-1)
xs3 = F.normalize(xs3, dim=-1)
attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp
attn1 = (attn_return1).softmax(dim=-1)
attn2 = (attn_return2).softmax(dim=-1)
attn3 = (attn_return3).softmax(dim=-1)
xs1 = attn1 @ v
xs2 = attn2 @ v
xs3 = attn3 @ v
## >>> iiii : qkv ensemble!!!
xs = (xs1 + xs2 + xs3) / 3
x = xs.transpose(1, 2).reshape(B, N, C)
x = self.proj_drop(self.proj(x))
return [x.transpose(0, 1), x_ori.transpose(0, 1)]
1) Self-Self Attention ํ์ฅ
- ์๋ ์ฝ๋์์ ๋ณผ์ ์๋ฏ~!!
xs1
= v-v,xs2
= k-k ,xs3
= q-q
2) Regularization ๋์
i) Adaptive temperature: ๋ฐ์ดํฐ์
/๋ชจ๋ธ๋ง๋ค ์ ์์ ์ผ๋ก softmax ์จ๋ ์กฐ์
ii) L2 ์ ๊ทํ: ํ ํฐ์ ํฌ๊ธฐ(norm) ์ฐจ์ด๋ก ์๊ธฐ๋ ์ํฅ ์ ๊ฑฐ
iii) Iterative selfโself attention: ํ์ ์ ์ฌ๋ฌ ๋ฒ ๋ฐ๋ณตํ์ฌ ํด๋ฌ์คํฐ๋ง ๊ฐํ
iiii) qkv-Ensemble: ๋ง์ง๋ง์ ๋ค ๋ํด์ ํ๊ท ๋!!
๐งช ์คํ ๋ฐ ๊ฒฐ๊ณผ ๋ถ์
๐ฏ Segmentation & Localization Benchmarks
- PascalContext, ADE20K ๋ฑ ๋ณต์กํ ๋ ์ด๋ธ๋ง ๋ฐ์ดํฐ์
์์
- ๊ธฐ์กด training-free ๋ฐฉ๋ฒ ๋๋น ์๋ฑํ ์ฑ๋ฅ
- fine-tuned ๋ฐฉ๋ฒ์ ๊ทผ์ ํ๊ฑฐ๋ ๋ฅ๊ฐ
๐ฏ Zero-Shot Point Prediction (OpenImages V7)
- ์ต์ด์ training-free SOTA ์ฑ๋ฅ ๋ฌ์ฑ
- LLM/VLM ์กฐํฉ ์์ด๋ localization ๊ฐ๋ฅ์ฑ ํ์ธ
- ๋ค๋ง fps๋ ๋๋ฌด ๋๋ฆฌ๋ค..
๐ ๊ฒฐ๊ณผ ์ด๋ฏธ์ง ๋ณด๊ธฐ
- Localization ์ ๋ณด๋ก ํ์ตํ ๋ฐฉ๋ฒ (GroundingSAM, OVSeg)
- ์ฅ์ : ๊ฐ์ฒด๋ฅผ ์ ํํ ์ธ์ํ๋ฉด ๊ณ ํ์ง ๋ง์คํฌ ์ถ๋ ฅ (์: Cat, Squirrel, Jet Ski)
- ํ๊ณ: ๋ฐ์ดํฐ์ ์ ์ ์ ๋์ค๋ ๊ฐ์ฒด(Boxer, Violin) ํ์ง ๋ถ๊ฐ
- ์์ธ: ์์์ segmentation annotation ์์กด โ ๋ฒ์ ์ ํ
- Segmentation ํนํ ํ์ต ๋ฐฉ๋ฒ (GroupViT, SegCLIP)
- ์ฅ์ : ํํ ๊ฐ์ฒด ์ ๋ถํ (์: Cat, Squirrel, Lizard)
- ํ๊ณ: ๋๋ฌธ ๊ฐ์ฒด(Jet Ski, Logo, Flag) ๋ถํ ์คํจ
- ์์ธ: ์ ํ๋ vocab curation์ผ๋ก ํ์ต โ ์ดํ ๋ค์์ฑ ๋ถ์กฑ
- Training-free ๋ฐฉ๋ฒ (MaskCLIP, CLIPSurgery, GEM)
- ์ฅ์ : VL ๋ชจ๋ธ์ด ํ์ตํ ์๋ฐฑ๋ง ๊ฐ ์ด๋ฏธ์ง-ํ ์คํธ ์ ํ์ฉ โ ๋ค์ํ ๊ฐ์ฒด ์ธ์ ๊ฐ๋ฅ
- ๋จ์ : ๋ง์คํฌ ๊ฒฝ๊ณ๊ฐ ๋ ์นด๋กญ์ง ๋ชปํจ (GroundingSAM๋ณด๋ค ๋ ์ ๋ฐ)
- GEM์ ์ถ๊ฐ ์ฑ๊ณผ:
- ๊ธฐ์กด training-free ๋๋น ๋ ์ ๋ช ํ segmentation (๊ฒฝ๊ณ ๋๋ ท, ๊ตฌ๋ฉ ์ ์)
- MaskCLIP, CLIPSurgery๊ฐ ๋์น ๊ฐ์ฒด๊น์ง ํ์ง (์: Logo)
CLIP ๋ฟ๋ง ์๋๋ผ ๋ค๋ฅธ VLM๋ค์๋ ์ ์ฉ์ด ๊ฐ๋ฅํ๋ค!
์คํจ ์ผ์ด์ค๋ฅผ ๋ณธ๋ค๋ฉด ํ ์คํธ ํ๋กฌํฌํธ์ ๋ฐ๋ผ ์ฐจ์ด๊ฐ ์ปท๋ค!! Image
๐งช Ablation ๋ถ์
ClipSurgery๋ ๋น๊ต + k-k๋ q-q ์ถ๊ฐํ๊ณ ์ ๊ทํ ๋ฑ์ ์ถ๊ฐํ๋ฉด์ ์ข์์ง๋ค!
์ ๊ทํ ํจ๊ณผ๋!? : ์๋ค!! ์๋ง์ 1/T ๊ฐ ๊ฐ ํ์ํ๋ค!
๋ฐ๋ณต์ ํจ๊ณผ๋!? : ๋ฐ๋ณต(iteration)์ ๋๋ฆฌ๋ฉด ํ ํฐ๋ค์ด ๋ ํฐ ํด๋ฌ์คํฐ๋ก ๋ฌถ์ฌ์ ๋จ์ํ ๋ฐ์ดํฐ์ (VOC)์ ์ ๋ฆฌํ์ง๋ง, ๋ค์ํ ๊ฐ์ฒด๊ฐ ๋ง์ ๋ฐ์ดํฐ์ (Context)์์๋ ๊ณผ๋ํ ๋ณํฉ์ด ์คํ๋ ค ์ฑ๋ฅ์ ๋จ์ด๋จ๋ฆฐ๋ค.
โ ๊ฒฐ๋ก
- GEM์ Vision-Language Transformer๊ฐ ๋ด์ฌ์ ์ผ๋ก ๊ฐ์ง localization ๋ฅ๋ ฅ์ ๋ฐ๊ตดํ ์ฐ๊ตฌ
- ํฅํ ๋ ํฐ VLM๊ณผ ๊ฒฐํฉ ์, fine-tuned detector๋ฅผ ๋์ฒดํ ์ ์ฌ๋ ฅ ๋ณด์
- open-world recognition, segmentation, grounding์ ์๋ก์ด ํจ๋ฌ๋ค์ ์ ์!