Post

๐Ÿ“ GEM: Grounding Everything in Vision-Language Transformers

๐Ÿ“ GEM: Grounding Everything in Vision-Language Transformers

๐Ÿ“ GEM: Unlocking the Latent Localization Ability of VLMs!

Image


๐Ÿš€ GEM Key Summary

One-liner: CLIP Surgery + (1) Attention Expansion + (2) Regularization

1) Self-Self Attention Expansion

  • CLIP Surgery only used valueโ€“value (vโ€“v) attention
  • GEM extends this to include queryโ€“query (qโ€“q), keyโ€“key (kโ€“k) โ†’ utilizes full selfโ€“self attention

2) Regularization

  • CLIP Surgery had no normalization concept
  • GEM introduces three components for more stable and generalized localization:
    i) Adaptive temperature: adaptively adjusts softmax temperature for each dataset/model
    ii) L2 normalization: removes influence from token magnitude differences
    iii) Iterative selfโ€“self attention: repeats clustering multiple times for reinforcement

3) Training-Free Grounding with Zero-Shot Localization & Segmentation

  • Directly extracts localization ability from pretrained VLMs
  • Achieves performance comparable to fine-tuned detectors
  • Open-vocabulary grounding without additional training

๐Ÿ” Flow of Existing Research

Image

1. Localization-first approaches

  • Idea: first detect regions or masks, then label them using VL models
  • Examples:
    • OpenSeg: fine-tuned with class-agnostic masks + image-text pairs
    • OVSeg: segmentation model + CLIP for mask classification
    • MaskCLIP(3): mask proposal network + CLIP encoder
    • GroundingSAM: GroundingDINO (detector) + SAM (masking)

2. Modifying VL model architecture/training

  • Idea: alter ViT to encourage localization properties
  • Examples:
    • SegCLIP, GroupViT: insert grouping blocks
    • ViL-Seg, OVSegmentor: clustering / Slot Attention
    • ReCo: retrieval-based fine supervision
    • PACL: add a decoder with grounding loss on top of CLIP

3. Training-free adaptation

  • Idea: adapt pretrained VL models for localization without training
  • Examples:
    • MaskCLIP: remove final MLP, use value projection
    • CLIP Surgery: add surgery pathway to ViT backbone (vโ€“v attention with residual)

โžก๏ธ GEMโ€™s core concept: extending the training-free CLIP Surgery approach!


๐Ÿงฑ GEM Architecture

Easier to understand through code than images!

$$$python class SelfSelfAttention(nn.Module): def init(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., ss_attn_iter=1, ss_attn_temp=None): super().init() self.num_heads = num_heads head_dim = dim // num_heads self.scale = qk_scale or head_dim ** -0.5 self.ss_attn_iter = ss_attn_iter self.ss_attn_temp = ss_attn_temp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
    self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
    self.attn_drop = nn.Dropout(attn_drop)
    self.proj = nn.Linear(dim, dim)
    self.proj_drop = nn.Dropout(proj_drop)

def forward(self, x, attn_bias=None, prev_attn=None):
    x = x.transpose(0, 1)
    B, N, C = x.shape
    qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
    q, k, v = qkv[0], qkv[1], qkv[2]
    self.v_values = v
    # original self-attention for the original path
    attn_ori_return = (q @ k.transpose(-2, -1)) * self.scale
    attn_ori = attn_ori_return.softmax(dim=-1)
    attn_ori = self.attn_drop(attn_ori)

    x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
    x_ori = self.proj_drop(self.proj(x_ori))

    # GEM
    xs1 = v
    xs2 = k
    xs3 = q

    #  >>> i) Adaptive temperature: `inv_temp`
    if self.ss_attn_temp is None:
        pre_norm = torch.norm(x, dim=-1).mean(dim=-1, keepdim=True).unsqueeze(1).unsqueeze(-1)
        inv_temp = pre_norm * self.scale
    else:
        inv_temp = self.ss_attn_temp

    # >>> iii) Iterative selfโ€“self attention
    for it in range(self.ss_attn_iter):

      #   >>> ii) L2 normalization
        xs1 = F.normalize(xs1, dim=-1)
        xs2 = F.normalize(xs2, dim=-1)
        xs3 = F.normalize(xs3, dim=-1)

        attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
        attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
        attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

        attn1 = (attn_return1).softmax(dim=-1)
        attn2 = (attn_return2).softmax(dim=-1)
        attn3 = (attn_return3).softmax(dim=-1)

        xs1 = attn1 @ xs1
        xs2 = attn2 @ xs2
        xs3 = attn3 @ xs3

    # Assignment to V
    xs1 = F.normalize(xs1, dim=-1)
    xs2 = F.normalize(xs2, dim=-1)
    xs3 = F.normalize(xs3, dim=-1)

    attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
    attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
    attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

    attn1 = (attn_return1).softmax(dim=-1)
    attn2 = (attn_return2).softmax(dim=-1)
    attn3 = (attn_return3).softmax(dim=-1)

    xs1 = attn1 @ v
    xs2 = attn2 @ v
    xs3 = attn3 @ v

    ## >>> iiii : qkv ensemble!!!
    xs = (xs1 + xs2 + xs3) / 3

    x = xs.transpose(1, 2).reshape(B, N, C)
    x = self.proj_drop(self.proj(x))

    return [x.transpose(0, 1), x_ori.transpose(0, 1)] $$$

1) Self-Self Attention Expansion

  • As seen in the code:
  • xs1 = v-v, xs2 = k-k, xs3 = q-q

2) Regularization
i) Adaptive temperature
ii) L2 normalization
iii) Iterative selfโ€“self attention
iiii) qkv-Ensemble (averaging all results)


๐Ÿงช Experiments & Results

๐ŸŽฏ Segmentation & Localization Benchmarks

Image

  • On complex datasets like PascalContext, ADE20K:
    • Significantly better than previous training-free methods
    • Comparable or superior to fine-tuned approaches

๐ŸŽฏ Zero-Shot Point Prediction (OpenImages V7)

Image

  • First training-free SOTA
  • Demonstrates localization without LLM/VLM hybrid models
  • Downside: inference FPS is quite slow

๐Ÿ‘€ Qualitative Results

Comparison with other models!
Image

  1. Methods trained with localization (GroundingSAM, OVSeg)
    • Strength: high-quality masks if object correctly identified (e.g., Cat, Squirrel, Jet Ski)
    • Weakness: fail to detect entities not in datasets (e.g., Boxer, Violin)
    • Cause: reliance on handcrafted segmentation annotation โ†’ limited scope
  2. Segmentation-specialized training methods (GroupViT, SegCLIP)
    • Strength: accurate for common objects (e.g., Cat, Squirrel, Lizard)
    • Weakness: fail on rare objects (e.g., Jet Ski, Logo, Flag)
    • Cause: limited curated vocab โ†’ reduced diversity
  3. Training-free methods (MaskCLIP, CLIPSurgery, GEM)
    • Strength: leverage millions of image-text pairs from VLM pretraining โ†’ recognize diverse entities
    • Weakness: masks less sharp than GroundingSAM
    • GEMโ€™s extra achievement:
      • Sharper segmentation compared to other training-free approaches (clearer contours, fewer holes)
      • Detects objects missed by MaskCLIP & CLIPSurgery (e.g., Logo)

Applicable not only to CLIP but to other VLMs as well!
Image

Failure cases show strong dependence on text prompts!
Image


๐Ÿงช Ablation Analysis

  • Comparison with CLIP Surgery: adding k-k, q-q, normalization, etc. leads to improvements
    Image

  • Effect of normalization: clear benefit, requires proper 1/T value
    Image

  • Effect of iterations:

    • More iterations โ†’ beneficial for datasets with fewer classes (VOC)
    • Fewer iterations โ†’ better for complex datasets with many classes (Context)

Image


โœ… Conclusion

  • GEM reveals the latent localization ability of Vision-Language Transformers
  • Potential to replace fine-tuned detectors when combined with larger VLMs
  • Introduces a new paradigm for open-world recognition, segmentation, and grounding!

๐Ÿ“ (ํ•œ๊ตญ์–ด) GEM: VLM์ด ๊ฐ€์ง„ ์ž ์žฌ์  Localization ๋Šฅ๋ ฅ์„ ๋Œ์–ด๋‚ด๋‹ค!

Image

  • ์ œ๋ชฉ: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
  • ํ•™ํšŒ: CVPR 2024
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ GEM
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Training-Free, Grounding, Vision-Language Transformer, Self-Self Attention, Zero-Shot
  • ์š”์•ฝ: CLIP Surgery์˜ ํ™•์žฅํŒ๋А๋‚Œ!! ์‚ฌ์ „ ํ•™์Šต๋œ Vision-Language Transformer(VLM)์˜ ๋‚ด์žฌ๋œ attention ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•ด, ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด(training-free) ๊ฐ์ฒด ์œ„์น˜ ์ธ์‹๊ณผ ๋ถ„ํ• ๊นŒ์ง€ ์ˆ˜ํ–‰ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ GEM ์ œ์•ˆ!

๐Ÿš€ GEM ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: CLIP Surgery์— 1. Attention ํ™•์žฅ ๋ฐ 2. Regularization ๋„์ž…

1) Self-Self Attention ํ™•์žฅ

  • CLIP Surgery ์—์„œ๋Š” ์˜ค์ง valueโ€“value (vโ€“v) attention๋งŒ ์‚ฌ์šฉ
  • ๊ทธ๋Ÿฐ๋ฐ GEM์€!! vโ€“v๋ฟ๋งŒ ์•„๋‹ˆ๋ผ queryโ€“query (qโ€“q), keyโ€“key (kโ€“k)๊นŒ์ง€ ํ™•์žฅ โ†’ selfโ€“self attention ์ „๋ฐ˜ ํ™œ์šฉ

2) Regularization ๋„์ž…

  • CLIP Surgery์—๋Š” ์ •๊ทœํ™” ๊ฐœ๋…์ด ์—†์Œ
  • GEM์€ ์•ˆ์ •์ ์ด๊ณ  ์ผ๋ฐ˜ํ™”๋œ localization์„ ์œ„ํ•ด ์„ธ๊ฐ€์ง€ ์š”์†Œ๋ฅผ ๋„์ž…!!
    i) Adaptive temperature: ๋ฐ์ดํ„ฐ์…‹/๋ชจ๋ธ๋งˆ๋‹ค ์ ์‘์ ์œผ๋กœ softmax ์˜จ๋„ ์กฐ์ •
    ii) L2 ์ •๊ทœํ™”: ํ† ํฐ์˜ ํฌ๊ธฐ(norm) ์ฐจ์ด๋กœ ์ƒ๊ธฐ๋Š” ์˜ํ–ฅ ์ œ๊ฑฐ
    iii) Iterative selfโ€“self attention: ํ•„์š” ์‹œ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฐ•ํ™”

3) Training-Free Grounding ์ด๋ฉด์„œ Zero-Shot Localization & Segmentation

  • ์‚ฌ์ „ ํ•™์Šต๋œ VLM์—์„œ ๋ฐ”๋กœ localization ์„ฑ๋Šฅ ์ถ”์ถœ
  • fine-tuned detector ์ˆ˜์ค€์— ๋งž๋จน๋Š” ์„ฑ๋Šฅ
  • ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด open-vocabulary grounding ๋‹ฌ์„ฑ

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ๋ฆ„

Image

1. Localization-first ์ ‘๊ทผ

  • ์•„์ด๋””์–ด: ๋จผ์ € ์˜์—ญ(Region)์ด๋‚˜ ๋งˆ์Šคํฌ๋ฅผ ์ฐพ์€ ๋’ค VL ๋ชจ๋ธ๋กœ ๋ผ๋ฒจ๋ง
  • ์˜ˆ์‹œ:
    • OpenSeg: class-agnostic mask + image-text pair๋กœ ํŒŒ์ธํŠœ๋‹
    • OVSeg: segmentation model + CLIP์œผ๋กœ ๋งˆ์Šคํฌ ๋ถ„๋ฅ˜
    • MaskCLIP(3): ๋งˆ์Šคํฌ ์ œ์•ˆ ๋„คํŠธ์›Œํฌ + CLIP ์ธ์ฝ”๋”
    • GroundingSAM: GroundingDINO(๊ฒ€์ถœ) + SAM(๋งˆ์Šคํฌ)

2. VL ๋ชจ๋ธ ๊ตฌ์กฐ/ํ•™์Šต ์ˆ˜์ • ์ ‘๊ทผ

  • ์•„์ด๋””์–ด: ViT ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ์–ด localization ํŠน์„ฑ์„ ์œ ๋„
  • ์˜ˆ์‹œ:
    • SegCLIP, GroupViT: grouping block ์‚ฝ์ž…
    • ViL-Seg, OVSegmentor: clustering/Slot Attention ํ™œ์šฉ
    • ReCo: retrieval ๊ธฐ๋ฐ˜ ๋ฏธ์„ธ ๊ฐ๋…
    • PACL: CLIP ์œ„์— decoder + grounding loss

3. Training-free ์ ์‘ ์ ‘๊ทผ

  • ์•„์ด๋””์–ด: ํ•™์Šต ์—†์ด ๊ธฐ์กด VL ๋ชจ๋ธ์„ localization์— ๋งž๊ฒŒ ๋ณ€ํ˜•
  • ์˜ˆ์‹œ:
    • MaskCLIP: ๋งˆ์ง€๋ง‰ MLP ์ œ๊ฑฐ, value projection ํ™œ์šฉ
    • CLIP Surgery: ViT ๋ฐฑ๋ณธ์— surgery pathway ์ถ”๊ฐ€ (valueโ€“value attention ์‚ฌ์šฉ, residual๋กœ ๋ˆ„์ )
  • ์ด ์ค‘์—์„œ Training-Free์ธ 3๋ฒˆ์˜ CLIP Surgery๋ฅผ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ด GEM์˜ ๊ธฐ๋ณธ๊ฐœ๋…!!

๐Ÿงฑ GEM ๊ตฌ์กฐ (Architecture)

์—ฌ๊ธด ์ด๋ฏธ์ง€๋ณด๋‹ค ์ฝ”๋“œ๋กœ ๋ณด๋Š”๊ฐœ ํŽธํ•จ!!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class SelfSelfAttention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., ss_attn_iter=1,
                 ss_attn_temp=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5
        self.ss_attn_iter = ss_attn_iter
        self.ss_attn_temp = ss_attn_temp

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x, attn_bias=None, prev_attn=None):
        x = x.transpose(0, 1)
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        self.v_values = v
        # original self-attention for the original path
        attn_ori_return = (q @ k.transpose(-2, -1)) * self.scale
        attn_ori = attn_ori_return.softmax(dim=-1)
        attn_ori = self.attn_drop(attn_ori)

        x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
        x_ori = self.proj_drop(self.proj(x_ori))

        # GEM
        xs1 = v
        xs2 = k
        xs3 = q

        #  >>> i) Adaptive temperature: `inv_temp`
        if self.ss_attn_temp is None:
            pre_norm = torch.norm(x, dim=-1).mean(dim=-1, keepdim=True).unsqueeze(1).unsqueeze(-1)
            inv_temp = pre_norm * self.scale
        else:
            inv_temp = self.ss_attn_temp

        # >>> iii) Iterative selfโ€“self attention: ๋ฐ˜๋ณตํ•จ!!
        for it in range(self.ss_attn_iter):

          #   >>> ii) L2 ์ •๊ทœํ™”: ํ† ํฐ์˜ ํฌ๊ธฐ(norm) ์ฐจ์ด๋กœ ์ƒ๊ธฐ๋Š” ์˜ํ–ฅ ์ œ๊ฑฐ  
            xs1 = F.normalize(xs1, dim=-1)
            xs2 = F.normalize(xs2, dim=-1)
            xs3 = F.normalize(xs3, dim=-1)

            attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
            attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
            attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

            attn1 = (attn_return1).softmax(dim=-1)
            attn2 = (attn_return2).softmax(dim=-1)
            attn3 = (attn_return3).softmax(dim=-1)

            xs1 = attn1 @ xs1
            xs2 = attn2 @ xs2
            xs3 = attn3 @ xs3

        # Assigment to V
        xs1 = F.normalize(xs1, dim=-1)
        xs2 = F.normalize(xs2, dim=-1)
        xs3 = F.normalize(xs3, dim=-1)

        attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
        attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
        attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

        attn1 = (attn_return1).softmax(dim=-1)
        attn2 = (attn_return2).softmax(dim=-1)
        attn3 = (attn_return3).softmax(dim=-1)

        xs1 = attn1 @ v
        xs2 = attn2 @ v
        xs3 = attn3 @ v

        ## >>> iiii : qkv ensemble!!!
        xs = (xs1 + xs2 + xs3) / 3

        x = xs.transpose(1, 2).reshape(B, N, C)
        x = self.proj_drop(self.proj(x))

        return [x.transpose(0, 1), x_ori.transpose(0, 1)]

1) Self-Self Attention ํ™•์žฅ

  • ์•„๋ž˜ ์ฝ”๋“œ์—์„œ ๋ณผ์ˆ˜ ์žˆ๋“ฏ~!!
  • xs1 = v-v, xs2 = k-k , xs3= q-q

2) Regularization ๋„์ž…
i) Adaptive temperature: ๋ฐ์ดํ„ฐ์…‹/๋ชจ๋ธ๋งˆ๋‹ค ์ ์‘์ ์œผ๋กœ softmax ์˜จ๋„ ์กฐ์ •
ii) L2 ์ •๊ทœํ™”: ํ† ํฐ์˜ ํฌ๊ธฐ(norm) ์ฐจ์ด๋กœ ์ƒ๊ธฐ๋Š” ์˜ํ–ฅ ์ œ๊ฑฐ
iii) Iterative selfโ€“self attention: ํ•„์š” ์‹œ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฐ•ํ™”
iiii) qkv-Ensemble: ๋งˆ์ง€๋ง‰์— ๋‹ค ๋”ํ•ด์„œ ํ‰๊ท ๋ƒ„!!

๐Ÿงช ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ ๋ถ„์„

๐ŸŽฏ Segmentation & Localization Benchmarks

Image

  • PascalContext, ADE20K ๋“ฑ ๋ณต์žกํ•œ ๋ ˆ์ด๋ธ”๋ง ๋ฐ์ดํ„ฐ์…‹์—์„œ
    • ๊ธฐ์กด training-free ๋ฐฉ๋ฒ• ๋Œ€๋น„ ์›”๋“ฑํ•œ ์„ฑ๋Šฅ
    • fine-tuned ๋ฐฉ๋ฒ•์— ๊ทผ์ ‘ํ•˜๊ฑฐ๋‚˜ ๋Šฅ๊ฐ€

๐ŸŽฏ Zero-Shot Point Prediction (OpenImages V7)

Image

  • ์ตœ์ดˆ์˜ training-free SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • LLM/VLM ์กฐํ•ฉ ์—†์ด๋„ localization ๊ฐ€๋Šฅ์„ฑ ํ™•์ธ
  • ๋‹ค๋งŒ fps๋Š” ๋„ˆ๋ฌด ๋А๋ฆฌ๋‹ค..

๐Ÿ‘€ ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€ ๋ณด๊ธฐ

๋‹ค๋ฅธ๋ชจ๋ธ๋“ค๊ณผ์˜ ๊ฒฐ๊ณผ๋น„๊ต์ด์ง€๋ฏธ!! Image

  1. Localization ์ •๋ณด๋กœ ํ•™์Šตํ•œ ๋ฐฉ๋ฒ• (GroundingSAM, OVSeg)
    • ์žฅ์ : ๊ฐ์ฒด๋ฅผ ์ •ํ™•ํžˆ ์ธ์‹ํ•˜๋ฉด ๊ณ ํ’ˆ์งˆ ๋งˆ์Šคํฌ ์ถœ๋ ฅ (์˜ˆ: Cat, Squirrel, Jet Ski)
    • ํ•œ๊ณ„: ๋ฐ์ดํ„ฐ์…‹์— ์ž˜ ์•ˆ ๋‚˜์˜ค๋Š” ๊ฐœ์ฒด(Boxer, Violin) ํƒ์ง€ ๋ถˆ๊ฐ€
    • ์›์ธ: ์ˆ˜์ž‘์—… segmentation annotation ์˜์กด โ†’ ๋ฒ”์œ„ ์ œํ•œ
  2. Segmentation ํŠนํ™” ํ•™์Šต ๋ฐฉ๋ฒ• (GroupViT, SegCLIP)
    • ์žฅ์ : ํ”ํ•œ ๊ฐ์ฒด ์ž˜ ๋ถ„ํ•  (์˜ˆ: Cat, Squirrel, Lizard)
    • ํ•œ๊ณ„: ๋“œ๋ฌธ ๊ฐœ์ฒด(Jet Ski, Logo, Flag) ๋ถ„ํ•  ์‹คํŒจ
    • ์›์ธ: ์ œํ•œ๋œ vocab curation์œผ๋กœ ํ•™์Šต โ†’ ์–ดํœ˜ ๋‹ค์–‘์„ฑ ๋ถ€์กฑ
  3. Training-free ๋ฐฉ๋ฒ• (MaskCLIP, CLIPSurgery, GEM)
    • ์žฅ์ : VL ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ํ™œ์šฉ โ†’ ๋‹ค์–‘ํ•œ ๊ฐœ์ฒด ์ธ์‹ ๊ฐ€๋Šฅ
    • ๋‹จ์ : ๋งˆ์Šคํฌ ๊ฒฝ๊ณ„๊ฐ€ ๋‚ ์นด๋กญ์ง€ ๋ชปํ•จ (GroundingSAM๋ณด๋‹ค ๋œ ์ •๋ฐ€)
    • GEM์˜ ์ถ”๊ฐ€ ์„ฑ๊ณผ:
      • ๊ธฐ์กด training-free ๋Œ€๋น„ ๋” ์„ ๋ช…ํ•œ segmentation (๊ฒฝ๊ณ„ ๋šœ๋ ท, ๊ตฌ๋ฉ ์ ์Œ)
      • MaskCLIP, CLIPSurgery๊ฐ€ ๋†“์นœ ๊ฐ์ฒด๊นŒ์ง€ ํƒ์ง€ (์˜ˆ: Logo)

CLIP ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ VLM๋“ค์—๋„ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค! Image

์‹คํŒจ ์ผ€์ด์Šค๋ฅผ ๋ณธ๋‹ค๋ฉด ํ…์ŠคํŠธ ํ”„๋กฌํฌํŠธ์— ๋”ฐ๋ผ ์ฐจ์ด๊ฐ€ ์ปท๋‹ค!! Image


๐Ÿงช Ablation ๋ถ„์„

  • ClipSurgery๋ž‘ ๋น„๊ต + k-k๋ž‘ q-q ์ถ”๊ฐ€ํ•˜๊ณ  ์ •๊ทœํ™” ๋“ฑ์„ ์ถ”๊ฐ€ํ•˜๋ฉด์„œ ์ข‹์•„์ง„๋‹ค!
    Image

  • ์ •๊ทœํ™” ํšจ๊ณผ๋Š”!? : ์žˆ๋‹ค!! ์•Œ๋งž์€ 1/T ๊ฐ’ ๊ฐ€ ํ•„์š”ํ•˜๋‹ค!
    Image

  • ๋ฐ˜๋ณต์˜ ํšจ๊ณผ๋Š”!? : ๋ฐ˜๋ณต(iteration)์„ ๋Š˜๋ฆฌ๋ฉด ํ† ํฐ๋“ค์ด ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๋ฌถ์—ฌ์„œ ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ์…‹(VOC)์—” ์œ ๋ฆฌํ•˜์ง€๋งŒ, ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹(Context)์—์„œ๋Š” ๊ณผ๋„ํ•œ ๋ณ‘ํ•ฉ์ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ๋–จ์–ด๋œจ๋ฆฐ๋‹ค.

Image


โœ… ๊ฒฐ๋ก 

  • GEM์€ Vision-Language Transformer๊ฐ€ ๋‚ด์žฌ์ ์œผ๋กœ ๊ฐ€์ง„ localization ๋Šฅ๋ ฅ์„ ๋ฐœ๊ตดํ•œ ์—ฐ๊ตฌ
  • ํ–ฅํ›„ ๋” ํฐ VLM๊ณผ ๊ฒฐํ•ฉ ์‹œ, fine-tuned detector๋ฅผ ๋Œ€์ฒดํ•  ์ž ์žฌ๋ ฅ ๋ณด์œ 
  • open-world recognition, segmentation, grounding์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„ ์ œ์‹œ!
This post is licensed under CC BY 4.0 by the author.