Post

๐Ÿ“Understanding FG-CLip - FG-Clip ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ“Understanding FG-CLip - FG-Clip ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  Understanding FG-CLIP (in English)

๐Ÿ” A more precise and detailed evolution of CLIP!

Image

Paper: FG-CLIP: Fine-Grained Visual and Textual Alignment
Conference: ICML 2025 (Xie, Chunyu, et al.)
Code: 360CVGroup/FG-CLIP


๐Ÿ” Summary of FG-CLIP

FG-CLIP (Fine-Grained CLIP) is a model designed to overcome the coarse recognition limitations of original CLIP.
It improves visual-textual alignment through three core techniques:


1. Global Semantic Alignment with Long Captions

  • 1.6 billion long image-text pairs generated using a state-of-the-art multimodal model (LMM).
  • These enriched captions help the model understand more nuanced and complex visual content.
  • Result: improved global semantic alignment performance.

2. High-Quality Visual Grounding Dataset Construction

  • 40 million bounding boxes over 12 million images, each with detailed region-level descriptions.
  • Enables learning of region-specific, context-aware visual representations.
  • All of this is unified into the FineHARD dataset, which also includes hard negatives:
    • Improves the modelโ€™s fine-grained alignment capacity significantly.

3. Learning with Hard Fine-Grained Negative Samples

  • Includes 10 million fine-grained hard negatives.
  • These are semantically similar but visually/verbally distinct image-text pairs.
  • Trains the model to detect subtle differences โ†’ improving fine-grained discrimination skills.

๐Ÿง  Why was FG-CLIP developed?

  • The original CLIP performs well on general multimodal tasks,
    but relies heavily on short and vague captions, which limit fine-grained understanding.
  • Existing datasets (COCO, LAION, etc.) are large but lack detail.

Thatโ€™s why they released FineHARD, a fine-grained dataset!

ChallengeLimitationEffect on CLIP
๐Ÿงต Lack of detailDescriptions are too general, lack object-level attributes/relationsHard to train fine-grained alignment
๐Ÿ“ฆ Limited data sizeOnly 1โ€“2M examples in FineCLIP/LongCLIP vs billions in LAIONLow generalization/performance compared to foundation models
๐Ÿ”ง Label noisePseudo-labeling using object detectors introduces inconsistencyReduced accuracy and misalignment risk
โš  Lack of hard negativesMostly easy positive examples; not enough confusing negative samplesWeakness in distinguishing subtly different concepts

๐Ÿ“š Previous CLIP Extensions and Their Limitations

๐Ÿง  CLIPSelf (ICLR 2024)

Adds DINO-style self-supervised learning to CLIP
Skips the text encoder โ€” focuses only on visual refinement!

  • Adds self-supervised pretext tasks on top of CLIPโ€™s image encoder.
  • Strength: Learn better visual representations without labels.
  • Limitation: Doesnโ€™t improve visual-language reasoning or alignment.

๐ŸŽฏ FineCLIP (NeurIPS 2024)

Improves object-text mapping at the part-level / attribute-level

  • Uses an object detector to extract region-level representations.
  • Learns multi-stage alignment (region, sentence, image).
  • Limitation: Depends on object detector accuracy; less generalizable in abstract scenes.

๐Ÿ”ฅ LLaVA (NeurIPS 2023)

A โ€œproto-versionโ€ of vision-language assistants like Qwen-VL or GPT-4V.

  • Connects CLIP vision encoder with LLM (e.g., Vicuna).
  • Enables chat-like multimodal interaction.
  • Limitation: Heavily reliant on high-quality alignment data, lacks visual reasoning depth.

๐Ÿงพ LongCLIP (ECCV 2024)

Expands CLIPโ€™s text encoder to handle long-form captions.

  • Trains on millions of long image-caption pairs.
  • Improves story-style image understanding and contextual reasoning.
  • Limitation: Long caption noise and text encoder overload can degrade alignment quality.

๐Ÿš€ FG-CLIP Model Architecture

structure

  • FG-CLIP builds on CLIPโ€™s dual encoder structure.
  • Two-stage training:
    • Stage 1: Align global image-text meaning
    • Stage 2: Adds region-level contrast + hard negative learning

Stage 1: Global Contrastive Learning

  • Aligns image and text at a global semantic level
  • Uses both long and short captions per image
  • Long: rich context; Short: concise concept

โœ… Implementation Summary:

  • 1.6 billion image-text pairs
  • Pretrained from original CLIP
  • Vision Backbone: ViT-B and ViT-L
  • Optimizer: AdamW (LR=1e-4, weight decay=0.05, ฮฒ1=0.9, ฮฒ2=0.98)
  • Warmup: 200 steps
  • Batch size per NPU: 384
  • Precision: BFloat16
  • Optimization: DeepSpeed Zero-2
  • Epochs: 1

Stage 2: Regional + Hard Negative + Reuse of Stage1 Global Loss

2-1. ๐Ÿ“ Regional Contrastive Learning (L_regional)

  • Aligns bounding box features with corresponding text phrases using RoIAlign.
  • Enhances local grounding and part-level semantic understanding.

2-2. โš ๏ธ Hard Fine-Grained Negatives (L_hard)

  • Generates subtle distractors by changing attributes (e.g., โ€œblue shirtโ€ โ†’ โ€œred shirtโ€).
  • Trains the model to distinguish subtle semantic differences.

2-3. Repeat of Global Contrastive (L_global)

Combined loss:

1
L = L_global + ฮฑ ยท L_regional + ฮฒ ยท L_hard
  • ฮฑ = 0.1, ฮฒ = 0.5

โœ… Implementation Summary:

  • 12M images, each with:
    • long + short captions
    • visual grounding annotations
    • hard negatives
  • Optimizer: AdamW (LR=1e-6, weight decay=0.001, ฮฒ1=0.9, ฮฒ2=0.98)
  • Warmup: 50 steps
  • Batch size per GPU: 512
  • Optimization: DeepSpeed Zero-2, CUDA TF32, BFloat16
  • Epochs: 1

๐Ÿ“œ FG-CLIP Data Summary

Two phases of data preparation:

๐Ÿ“Œ Phase 1: Recaptioning LAION-2B

Surprisingly, they used Huawei NPUs (910B) instead of NVIDIA GPUs!

  • Original LAION captions are too generic (โ€œa birdโ€)
  • Used CogVLM2-19B to generate context-rich recaptions
    • Before: "a bird" โ†’ After: "a red-winged blackbird perched on a tree branch"
  • Recaptioned 2 billion images in 30 days using 160ร—910B NPU cluster

๐Ÿ“ฆ Phase 2: FineHARD Dataset (Visual Grounding + Hard Negatives)

ComponentAmount
Long image-level captions12M images
Bounding box region captions40M boxes
Hard negatives10M samples
Build time7 days on NPU cluster
โ‘  Visual Grounding
  • Based on GRIT images + CogVLM2-generated captions
  • Extract referring expressions with SpaCy
  • Use YOLO-World for bounding boxes (Confidence โ‰ฅ 0.4)
  • NMS removes overlaps โ†’ 12M images, 40M regions with rich captions

โ‘ก Hard Negative Generation
  • Keep object name, change attributes to create contrast
  • Use LLaMA-3.1-70B to generate 10 distractors per positive
  • Post-process to remove symbols like ;, ,, \n
  • Quality check: 98.9% valid, 1.1% noise
  • Example:
    • Positive: "a man in a blue striped shirt"
    • Negative: "a man in a red checkered shirt"

๐Ÿ’ฏ Performance (Ablation Test)

ablation

Stage 1 alone already boosts performance thanks to long+short caption alignment.
As you add more components in Stage 2:

  • L_regional improves bbox classification
  • L_hard boosts fine-grained text-image discrimination

But โ€” slight drop in short retrieval was observed, perhaps due to confusion from longer and more complex negatives!


๐Ÿ”ฎ Final Thoughts

Preparing such a massive dataset must have taken tremendous effortโ€ฆ
But each component clearly contributes to the model as the authors intended.

Especially impressive: the improvement in detail-level accuracy via negative samples ๐Ÿ‘
I might try this myself somedayโ€ฆ

And finally โ€” I was amazed they used Huawei NPUs! NVIDIA isnโ€™t the only game in town! ๐Ÿง 


๐Ÿง  (ํ•œ๊ตญ์–ด) FG-CLIP ์•Œ์•„๋ณด๊ธฐ!

๐Ÿ” ๋” ์„ธ์„ธํ•œ ํ”„๋กฌํฌํŠธ๋„ ๊ฐ€๋Šฅํ•œ, ๋ฐœ์ „๋œ CLIP!!

Image

๋…ผ๋ฌธ: FG-CLIP: Fine-Grained Visual and Textual Alignment
๋ฐœํ‘œ: ICML 2025 (Xie, Chunyu, et al.)
์ฝ”๋“œ: 360CVGroup/FG-CLIP


๐Ÿ” FG-CLIP์˜ ํŠน์ง• ์š”์•ฝ

FG-CLIP (Fine-Grained CLIP)์€ ๊ธฐ์กด CLIP์˜ ์„ธ๋ฐ€ํ•œ ์ธ์‹ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ๋ชจ๋ธ๋กœ,
๋‹ค์Œ์˜ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.


1. ์žฅ๋ฌธ ์บก์…˜ ๊ธฐ๋ฐ˜์˜ ๊ธ€๋กœ๋ฒŒ ์˜๋ฏธ ์ •๋ ฌ ๊ฐ•ํ™”

  • ์ตœ์‹  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ(LMM)์„ ํ™œ์šฉํ•ด 1.6์–ต ๊ฐœ์˜ ์žฅ๋ฌธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ์ƒ์„ฑ.
  • ๊ธฐ์กด๋ณด๋‹ค ํ›จ์”ฌ ํ’๋ถ€ํ•œ ๋ฌธ๋งฅ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•จ์œผ๋กœ์จ, ๋ชจ๋ธ์ด ๋ณต์žกํ•˜๊ณ  ์„ธ๋ถ€์ ์ธ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธ€๋กœ๋ฒŒ ์˜๋ฏธ ์ •๋ ฌ(global semantic alignment) ๋Šฅ๋ ฅ์ด ํ–ฅ์ƒ๋จ.

2. ๊ณ ํ’ˆ์งˆ Visual Grounding ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

  • 1,200๋งŒ ๊ฐœ ์ด๋ฏธ์ง€์— ํฌํ•จ๋œ 4์ฒœ๋งŒ ๊ฐœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค์— ๋Œ€ํ•ด, ๋ฌธ๋งฅ์ด ํ’๋ถ€ํ•œ ์„ค๋ช…์„ ์ œ๊ณต.
  • ์ด ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ๋ชจ๋ธ์€ ์ •ํ™•ํ•˜๊ณ  ์ง€์—ญ(region)-๊ธฐ๋ฐ˜์˜ ํ‘œํ˜„์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ.
  • ์ด๋Š” ์„ธ๋ฐ€ํ•œ ๊ฐ์ฒด ๊ตฌ๋ถ„ ๋ฐ ์œ„์น˜ ๊ธฐ๋ฐ˜ ์ •๋ ฌ ์ž‘์—…์—์„œ ํฐ ๋„์›€์ด ๋จ.

  • ์ตœ์ข…์ ์œผ๋กœ๋Š” FineHARD ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ†ตํ•ฉ๋จ!!
    • visual grounding ๋ฐ์ดํ„ฐ์™€ hard negative ์ƒ˜ํ”Œ์„ ํ†ตํ•ฉํ•˜์—ฌ FineHARD๋ผ๋Š” ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ.
    • ์„ธ๋ฐ€ํ•œ ์ •๋ ฌ ๋Šฅ๋ ฅ ํ–ฅ์ƒ์— ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•จ.

3. Hard Negative ์ƒ˜ํ”Œ์„ ํ†ตํ•œ ํŒ๋ณ„ ๋Šฅ๋ ฅ ๊ฐ•ํ™”

  • 1์ฒœ๋งŒ ๊ฐœ์˜ fine-grained hard negative sample์„ ํฌํ•จํ•œ ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜(Corpus) ๋„์ž….
  • ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•˜์ง€๋งŒ ์†์„ฑ์ด ๋‹ค๋ฅธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ํ•™์Šต์‹œ์ผœ, ๋” ์ •๋ฐ€ํ•œ ๊ตฌ๋ถ„(discrimination) ๋Šฅ๋ ฅ์„ ๋ถ€์—ฌ.
  • ๋ชจ๋ธ์ด ์„ธ๋ฐ€ํ•œ ์ฐจ์ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ , ํ˜ผ๋™ ์—†์ด ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„.

๐Ÿง  FG-CLIP ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ

  • ๊ธฐ์กด CLIP์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์—์„œ๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ,
    ์งง๊ณ  ๊ฐœ๋žต์ ์ธ ์บก์…˜์— ์˜์กดํ•˜์—ฌ ์„ธ๋ฐ€ํ•œ ์ดํ•ด(fine-grained understanding)๊ฐ€ ๋ถ€์กฑํ•จ.
  • ๊ธฐ์กด ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์˜ ํ•œ๊ณ„์กด์žฌ

    ๊ทธ๋ž˜์„œ! โ€œFineHARDโ€๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ๊ณต๊ฐœํ–ˆ์ฃ !

๊ตฌ๋ถ„ํ•œ๊ณ„ ๋‚ด์šฉ์˜ํ–ฅ ๋ฐ ๋ฌธ์ œ์ 
๐Ÿงต ์„ธ๋ฐ€ํ•œ ์ •๋ณด ๋ถ€์กฑ์ผ๋ฐ˜์ ์ธ ์žฅ๋ฉด ๋ฌ˜์‚ฌ ์œ„์ฃผ (COCO, LAION ๋“ฑ), ์„ธ๋ถ€ ๊ฐ์ฒดยท์†์„ฑยท์œ„์น˜ ์ •๋ณด๊ฐ€ ๋ถ€์กฑ์ •๊ตํ•œ ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ ํ•™์Šต ์–ด๋ ค์›€
๐Ÿ“ฆ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ์˜ ํ•œ๊ณ„FineCLIP (250๋งŒ์Œ), LongCLIP (100๋งŒ์Œ) โ†’ ์—ฌ์ „ํžˆ LAION ๋“ฑ์— ๋น„ํ•ด ์ ์€ ๊ทœ๋ชจ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ๋Œ€๋น„ ํ‘œํ˜„๋ ฅ ๋ฐ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๋ถ€์กฑ
๐Ÿ”ง ์ž๋™ ๋ผ๋ฒจ๋ง ๋…ธ์ด์ฆˆ๊ฐ์ฒด ํƒ์ง€ ๊ธฐ๋ฐ˜ pseudo-label์€ ํšจ์œจ์ ์ด๋‚˜ ๋ผ๋ฒจ ํ’ˆ์งˆ ํŽธ์ฐจ๋กœ ์ธํ•œ ๋…ธ์ด์ฆˆ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌํ•™์Šต ์ •ํ™•๋„ ์ €ํ•˜, ๋ฏธ์„ธ ์ •๋ ฌ ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ
โš  Hard Negative ์ƒ˜ํ”Œ ๋ถ€์กฑ๋Œ€๋ถ€๋ถ„ ๊ตฌ๋ณ„ ์‰ฌ์šด ์–‘์„ฑ ์ƒ˜ํ”Œ ์ค‘์‹ฌ ๊ตฌ์„ฑ, ์–ด๋ ค์šด ์Œ์„ฑ ์ƒ˜ํ”Œ ๋ถ€์กฑ์œ ์‚ฌํ•œ ์Œ ๊ฐ„ ๋ฏธ์„ธํ•œ ์ฐจ์ด ๊ตฌ๋ถ„ ํ•™์Šต ์–ด๋ ค์›€, ์„ธ๋ฐ€ํ•œ ์ธ์‹ ์„ฑ๋Šฅ ์ €ํ•˜

๊ธฐ์กด์— ์กด์žฌํ•˜๋˜ CLIP ํ›„์†์˜ ์—ฐ๊ตฌ๋“ค๊ณผ ๊ทธ ํ•œ๊ณ„!

๐Ÿง  1. CLIPSelf (ICLR 2024)

CLIP์„ ๊ฐ€์ง€๊ณ  DINO ์ฒ˜๋Ÿผ ์ž๊ธฐ์ง€๋„ ํ•™์Šต!!
์กฐ๊ธˆ ๋” ์ƒ์„ธํžˆ๋Š”, ํ…์ŠคํŠธ์ชฝ์€ ์Šคํ‚ตํ•˜๊ณ  ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ํ‘œํ˜„๋ ฅ์„ ๊ฐ•ํ™”!
๋น„์Šทํ•œ ์ด๋ฏธ์ง€๋Š” ๋น„์Šทํ•œ ์ž„๋ฒ ๋”ฉ์„ ๊ฐ™๋„๋ก!!

  • ๋ชฉํ‘œ: CLIP์˜ ์ด๋ฏธ์ง€ ํ‘œํ˜„์— ์ž๊ธฐ์ง€๋„ ํ•™์Šต(self-supervised)์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ.
  • ํ•ต์‹ฌ ๊ธฐ๋ฒ•:
    • ๊ธฐ์กด CLIP์˜ ๊ตฌ์กฐ ์œ ์ง€.
    • ์ด๋ฏธ์ง€ ํŠน์ง•์—์„œ ์ž๊ธฐ ์˜ˆ์ธก(pretext task)์„ ํ†ตํ•ด ์ •์ œ๋œ ํ‘œํ˜„ ํ•™์Šต.
  • ์žฅ์ : ๋ผ๋ฒจ ์—†์ด๋„ ๋” ์ •๊ตํ•˜๊ณ  ์ผ๋ฐ˜ํ™”๋œ ์‹œ๊ฐ ํ‘œํ˜„ ํš๋“ ๊ฐ€๋Šฅ.
  • ํ•œ๊ณ„: ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ ฌ์ด๋‚˜ ์–ธ์–ด ๊ธฐ๋ฐ˜ reasoning ์„ฑ๋Šฅ ํ–ฅ์ƒ์—๋Š” ํ•œ๊ณ„

๐ŸŽฏ 2. FineCLIP (NeurIPS 2024)

์„ธ๋ถ€ ๊ฐ์ฒด-๋ฌธ์žฅ ๋Œ€์‘์„ ๊ฐ•ํ™”์‹œ์ผœ์„œ
part-level, attribute-level ํ‘œํ˜„๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ด!!
๊ธฐ์กด์—๋Š” ๊ฐ€๋ฐฉ ๊นŒ์ง€๋งŒ ์ดํ•ดํ–ˆ๋‹ค๋ฉด ์ด์   ๋…ธ๋ž€๊ฐ€๋ฐฉ! ๋“ฑ ์„ธ๋ถ€์ ์ธ๊ฒƒ๋„ ์ดํ•ดํ•ด!!

  • ๋ชฉํ‘œ: CLIP์˜ coarseํ•œ ํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , fine-grained ์‹œ๊ฐ-์–ธ์–ด ํ‘œํ˜„ ๊ฐ•ํ™”.
  • ํ•ต์‹ฌ ๊ธฐ๋ฒ•:
    • ๊ฐ์ฒด ๊ฒ€์ถœ๊ธฐ๋ฅผ ์ด์šฉํ•ด region-level ์ •๋ณด์™€ CLIP ์ž„๋ฒ ๋”ฉ ์ •๋ ฌ.
    • ๋‹ค๋‹จ๊ณ„ ์ •๋ ฌ ํ•™์Šต (๊ฐ์ฒด, ๋ฌธ์žฅ, ์ด๋ฏธ์ง€ ์ˆ˜์ค€).
  • ์žฅ์ : ์„ธ๋ถ€ ๊ฐ์ฒด ์ธ์‹ ๋ฐ ์„ธ๋ฐ€ํ•œ ํ…์ŠคํŠธ ๋งคํ•‘ ์„ฑ๋Šฅ ๊ฐ•ํ™”.
  • ํ•œ๊ณ„
    • ์˜คํ”ˆ๋„๋ฉ”์ธ ์žฅ๋ฉด์ด๋‚˜ ๋ฏธํ•™์ /์ถ”์ƒ์  ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ์–ด๋ ค์›€.
    • ๋˜ํ•œ, top-down region ์„ ํƒ์œผ๋กœ ์ธํ•œ ํ‘œํ˜„ ํŽธํ–ฅ์˜ ๊ฐ€๋Šฅ์„ฑ๋„ ์กด์žฌํ•จ.

๐Ÿ”ฅ 3. LLaVA (Large Language and Vision Assistant) (NeurIPS 2023)

์šฐ๋ฆฌ์—๊ฒ ์ด์   ์ต์ˆ™ํ•ด์ง„ ๋น„์ „๋ชจ๋ธ(Llama-Vision, qwen2.5VL)์˜ ์‹œ์กฐ์ƒˆ ๋А๋‚Œ!

  • ๋ชฉํ‘œ: GPT ๊ธฐ๋ฐ˜ LLM์— ์‹œ๊ฐ ์ •๋ณด ํ•ด์„ ๋Šฅ๋ ฅ์„ ๋ถ€์—ฌํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์–ด์‹œ์Šคํ„ดํŠธ ๊ฐœ๋ฐœ.
  • ํ•ต์‹ฌ ๊ธฐ๋ฒ•:
    • CLIP Vision Encoder + LLM (์˜ˆ: Vicuna) ์—ฐ๊ฒฐ.
    • ์ด๋ฏธ์ง€ โ†” ์ž์—ฐ์–ด ๋Œ€ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ ์ฒ˜๋ฆฌ.
  • ์žฅ์ :
    • ๋Œ€ํ™”ํ˜• ๋น„์ „ ์ดํ•ด ์‹œ์Šคํ…œ ๊ตฌ์ถ• ๊ฐ€๋Šฅ.
    • ChatGPT ์œ ์‚ฌํ•œ UX ์ œ๊ณต + ์ด๋ฏธ์ง€ ์ธ์‹ ๊ธฐ๋Šฅ ํ†ตํ•ฉ.
  • ํ•œ๊ณ„: ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ alignment ๋ฐ์ดํ„ฐ ์˜์กด๋„๊ฐ€ ๋†’๊ณ , ์‹ค์ œ ์‹œ๊ฐ reasoning ๋Šฅ๋ ฅ์€ ์ œํ•œ์ ์ž„.

๐Ÿงพ 4. LongCLIP (ECCV 2024)

์งง์€ ํ…์ŠคํŠธ๋งŒ ๊ฐ€๋Šฅํ–ˆ๋˜ CLIP์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์—…๊ทธ๋ ˆ์ด๋“œ ํ•˜๊ณ  ํ•™์Šต์‹œ์ผœ์„œ!!
๊ธด๋ฌธ์žฅ๋„ CLIP์—์„œ ๋ฐ›์•„๋“œ๋ฆด ์ˆ˜ ์žˆ๋„๋ก ํ•จ!

  • ๋ชฉํ‘œ: CLIP์˜ ์งง์€ ํ…์ŠคํŠธ ์ค‘์‹ฌ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ์žฅ๋ฌธ ์บก์…˜ ๊ธฐ๋ฐ˜์˜ ์ •๊ตํ•œ ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ ์‹คํ˜„.
  • ํ•ต์‹ฌ ๊ธฐ๋ฒ•:
    • ๋Œ€๊ทœ๋ชจ ์žฅ๋ฌธ ์ด๋ฏธ์ง€-์บก์…˜ ์Œ์„ ํ™œ์šฉํ•œ ํ•™์Šต.
    • CLIP์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์žฅ๋ฌธ ์ ํ•ฉ ๊ตฌ์กฐ๋กœ ํ™•์žฅ.
  • ์žฅ์ :
    • ์Šคํ† ๋ฆฌ, ์„ค๋ช… ์ค‘์‹ฌ ์ด๋ฏธ์ง€ ์ดํ•ด์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ๋ฐœํœ˜.
    • Zero-shot ์ธ์‹ ๋ฐ ๋ฌธ๋งฅ ์ดํ•ด ๋Šฅ๋ ฅ ํ–ฅ์ƒ.
  • ํ•œ๊ณ„: ์žฅ๋ฌธ noise + ์ธ์ฝ”๋” ๋ถ€ํ•˜, ํ•ต์‹ฌ ํ‘œํ˜„ ์ •๋ ฌ ์–ด๋ ค์›€

๐Ÿš€ FG-CLIP ๋ชจ๋ธ ๊ตฌ์กฐ!

structure

  • FG-CLIP์€ CLIP์˜ dual-encoder ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•์žฅ๋˜๋ฉฐ, ๋‘ ๋‹จ๊ณ„๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค!
  • Stage1 ์—์„œ๋Š” CLIP ์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ์˜ ์˜๋ฏธ์—ฐ๊ฒฐ ์ž‘์—…์„,
  • Stage2 ์—์„œ๋Š” 1๋‹จ๊ณ„์™€ ํ•จ๊ป˜(Stage2-3) ์ด๋ฏธ์ง€ ๋‚ด์˜ ์ง€์—ญ ์„ ํƒ(Stage2-1) + Negative Samples Learning(Stage2-2) ๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค!

Stage1 : ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ์˜ ์˜๋ฏธ์—ฐ๊ฒฐ ์ž‘์—… : Global Contrastive Learning

  • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ์ „์—ญ์  ์˜๋ฏธ ์ •๋ ฌ์„ ์ˆ˜ํ–‰
  • ๊ธด ์บก์…˜(long captions)๊ณผ ์งง์€ ์บก์…˜(short captions)์„ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ, ๋„“์€ ์˜๋ฏธ ์ŠคํŽ™ํŠธ๋Ÿผ์„ ํ•™์Šต
  • ๊ธด ๋ฌธ์žฅ์€ ๋ณต์žกํ•œ ์˜๋ฏธ, ์งง์€ ๋ฌธ์žฅ์€ ๊ธฐ๋ณธ ๊ฐœ๋… ํŒŒ์•…์— ๋„์›€์„ ์คŒ

โœ… ์„ธ๋ถ€ ๊ตฌํ˜„ ์‚ฌํ•ญ (Implementation Details)

  • ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ: 16์–ต(1.6B) ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ์‚ฌ์šฉ (๊ฐ๊ฐ long + short caption ํฌํ•จ)
  • ๋ชจ๋ธ ์ดˆ๊ธฐํ™”: ๊ธฐ์กด CLIP ๊ฐ€์ค‘์น˜๋กœ ์‹œ์ž‘
  • Vision Backbone: ViT-B / ViT-L ๋‘ ๊ตฌ์„ฑ ์‹คํ—˜
  • Optimizer: AdamW
    • Learning rate: 1e-4
    • Weight decay: 0.05
    • ฮฒ1: 0.9, ฮฒ2: 0.98
  • ํ•™์Šต ์Šค์ผ€์ค„: ์ดˆ๊ธฐ 200 step warmup
  • ๋ฐฐ์น˜ ํฌ๊ธฐ: NPU ๋‹น 384
  • ์˜จ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ (ฯ„): 0.07 (ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ณ€์ˆ˜)
  • ์ •๋ฐ€๋„: BFloat16 ์‚ฌ์šฉ
  • ์ตœ์ ํ™” ๊ธฐ๋ฒ•: DeepSpeed์˜ Zero-2 ์ ์šฉ
  • ํ•™์Šต ํšŸ์ˆ˜: ์ „์ฒด 1 epoch๋งŒ ์ˆ˜ํ–‰ (๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ๋„ ์ถฉ๋ถ„)

Stage2 : ์ด๋ฏธ์ง€ ๋‚ด์˜ ์ง€์—ญ ์„ ํƒ(2-1) + Negative Samples Learning(2-2) + Stage1 ๋ฐ˜๋ณต

2-1. ๐Ÿ“ Regional Contrastive Learning (L_regional)

  • ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ ํŠน์ • ์˜์—ญ์„ ๋Œ€์‘ํ•˜๋Š” ํ…์ŠคํŠธ ์กฐ๊ฐ๊ณผ ์ •๋ ฌ
  • ์ด๋ฏธ์ง€์˜ bounding box๋ณ„ ํŠน์ง•์„ ์ถ”์ถœ(RoIAlign ์‚ฌ์šฉ), ํ•ด๋‹น ๊ตฌ์—ญ์˜ ํ…์ŠคํŠธ ํ‘œํ˜„๊ณผ ์—ฐ๊ฒฐ
  • ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ-์–ธ์–ด์  ์˜๋ฏธ ๋Œ€์‘ ๋Šฅ๋ ฅ ๊ฐ•ํ™”

2-2. โš ๏ธ Hard Fine-Grained Negative Samples Learning (L_hard)

  • ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ๋‹ค๋ฅธ ํ•˜๋“œ ๋ถ€์ • ์ƒ˜ํ”Œ(hard negatives)์„ ์ƒ์„ฑํ•ด ํ•™์Šต
  • ์ •๋‹ต ์„ค๋ช…์—์„œ ์†์„ฑ์„ ๋ณ€๊ฒฝํ•ด ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด์ ์„ ๊ฐ€์ง„ ์ƒ˜ํ”Œ์„ ๊ตฌ์„ฑ
  • ๋ชจ๋ธ์ด ์œ ์‚ฌํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ์ƒ˜ํ”Œ์„ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€ fine-grained ์ธ์‹๋ ฅ์„ ๊ทน๋Œ€ํ™”

2-3 Stage1์˜ ๋ฐ˜๋ณต (L_global)


  • ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ†ตํ•ฉ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์„ธ ๊ฐ€์ง€ ํ•™์Šต ์š”์†Œ๋ฅผ ์กฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:
1
L = L_global + ฮฑ ยท L_regional + ฮฒ ยท L_hard
  • ฮฑ = 0.1 (์ง€์—ญ ์ •๋ ฌ ๋น„์ค‘)
  • ฮฒ = 0.5 (ํ•˜๋“œ ๋ถ€์ • ์ƒ˜ํ”Œ ๋น„์ค‘)

โœ… ์„ธ๋ถ€ ๊ตฌํ˜„ ์‚ฌํ•ญ (Implementation Details)

  • ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ:
    • ์ด 1,200๋งŒ ์žฅ ์ด๋ฏธ์ง€
    • ํฌํ•จ ์ •๋ณด:
      • ์งง์€ ์บก์…˜ (short caption)
      • ๊ธด ์บก์…˜ (long caption)
      • ์ •๋ฐ€ํ•œ ์‹œ๊ฐ์  ์ •๋ ฌ(visual grounding) ์ฃผ์„
      • Hard Fine grained Negative Sample
  • ๋ชจ๋ธ ์ดˆ๊ธฐํ™”: 1๋‹จ๊ณ„(Global Contrastive Learning) ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋กœ ์‚ฌ์šฉ
  • Optimizer: AdamW
    • Learning rate: 1e-6
    • Weight decay: 0.001
    • ฮฒ1: 0.9, ฮฒ2: 0.98
  • Warmup ๋‹จ๊ณ„: ์ดˆ๊ธฐ 50 step
  • ๋ฐฐ์น˜ ํฌ๊ธฐ: GPU๋‹น 512
  • ํ•™์Šต ์ตœ์ ํ™” ๊ธฐ๋ฒ•:
    • DeepSpeed Zero-2 ์ตœ์ ํ™”
    • CUDA TF32 ์—ฐ์‚ฐ ๊ฐ€์† ์‚ฌ์šฉ
    • BFloat16 ์ •๋ฐ€๋„ ์ ์šฉ
  • ํ•™์Šต ํšŸ์ˆ˜: ์ „์ฒด 1 epoch ์ˆ˜ํ–‰

๐Ÿ“œ FG-CLIP์˜ ๋ฐ์ดํ„ฐ

FG-CLIP์€ ์ •๊ตํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ •๋ ฌ์„ ์œ„ํ•ด ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•˜๊ณ  ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
๋ฐ์ดํ„ฐ๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ ๋‚˜๋‰˜๋ฉฐ, ๊ฐ๊ฐ์˜ ๋ชฉ์ ์— ๋งž๊ฒŒ ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“Œ Stage1: LAION-2B ๋ฆฌ์บก์…”๋‹ (Global Contrastive Pretraining)

์‹ ๊ธฐํ•œ์ ์€ Nvidia GPU๊ฐ€ ์•„๋‹ˆ๋ผ Huawei์˜ NPU๋ฅผ ์‚ฌ์šฉํ–ˆ๋„ค!!?

  • ๊ธฐ์กด LAION-2B ๋ฐ์ดํ„ฐ์…‹์€ โ€œa birdโ€์ฒ˜๋Ÿผ ์ผ๋ฐ˜์ ์ด๊ณ  ๋‹จ์ˆœํ•œ ์„ค๋ช…์ด ๋งŽ์•„ ์ •๋ฐ€ ํ•™์Šต์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ.
  • ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด CogVLM2-19B ๋Œ€ํ˜• ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์„ ์‚ฌ์šฉ, ๋ชจ๋“  ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์ •๊ตํ•˜๊ณ  ๋ฌธ๋งฅ์ด ํ’๋ถ€ํ•œ ์บก์…˜(recaptions)์„ ์ƒˆ๋กœ ์ƒ์„ฑ
  • ์˜ˆ์‹œ:
    • ๊ธฐ์กด: "a bird"
    • ๊ฐœ์„ : "a red-winged blackbird perched on a tree branch in a park"
  • ์ „์ฒด 20์–ต ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ๋ฆฌ์บก์…”๋‹์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, 160ร—910B NPU ํด๋Ÿฌ์Šคํ„ฐ๋กœ 30์ผ๊ฐ„ ์ฒ˜๋ฆฌ๋จ.
  • ์†Œ๊ฑฐ ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ด๋Ÿฌํ•œ ์ •์ œ๋œ ์„ค๋ช…์€ ๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ด.

๐Ÿ“ฆ Stage2: FineHARD ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• (Regional + Hard Negative ํ•™์Šต)

FG-CLIP์˜ ํ•ต์‹ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์ธ FineHARD๋Š” ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ์š”์†Œ๋กœ ๊ตฌ์ถ•!!

๊ตฌ์„ฑ ์š”์†Œ์ˆ˜๋Ÿ‰
์ •์ œ๋œ ์ด๋ฏธ์ง€ ์บก์…˜ (์ „์ฒด ์ด๋ฏธ์ง€ ์„ค๋ช…)12,000,000 (1,200๋งŒ ์žฅ)
Region-level ์„ค๋ช… (bounding boxes)40,000,000 (4์ฒœ๋งŒ ๊ฐœ)
Hard negative ์ƒ˜ํ”Œ10,000,000 (์ฒœ๋งŒ ๊ฐœ)
์ „์ฒด ๊ตฌ์ถ• ์†Œ์š” ์‹œ๊ฐ„7์ผ (910B NPU ํด๋Ÿฌ์Šคํ„ฐ)
โ‘  ์ •๋ฐ€ Region-Text ์ •๋ ฌ (Visual Grounding)
  • GRIT ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, CogVLM2-19B๋กœ ์ „์ฒด ์ด๋ฏธ์ง€ ์บก์…˜ ์ƒ์„ฑ.
  • SpaCy๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์บก์…˜์—์„œ ์ง€์‹œ ํ‘œํ˜„(referring expressions) ์ถ”์ถœ.
  • ์ด๋ฅผ YOLO-World ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜์—ฌ ํ•ด๋‹นํ•˜๋Š” bounding box ์ถ”์ถœ.
  • Confidence โ‰ฅ 0.4, NMS ์ ์šฉ ํ•˜์—ฌ
    • 1,200๋งŒ ๊ฐœ ์ด๋ฏธ์ง€
    • 4์ฒœ๋งŒ ๊ฐœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค
    • ๊ฐ ์˜์—ญ์— ๋Œ€ํ•œ ์ •๋ฐ€ ์„ค๋ช…(region captions) ํ™•๋ณด

โ‘ก ํ•˜๋“œ ๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ƒ˜ํ”Œ ์ƒ์„ฑ (Hard Negative Mining)
  • ์ •๋‹ต ์„ค๋ช…์—์„œ ์†์„ฑ๋งŒ ๋ณ€๊ฒฝํ•˜๊ณ  ๊ฐ์ฒด๋ช…์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ถ€์ • ์ƒ˜ํ”Œ ์ƒ์„ฑ.
  • LLaMA-3.1-70B ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๊ฐ ์–‘์„ฑ ์ƒ˜ํ”Œ๋‹น 10๊ฐœ์˜ ํ•˜๋“œ ๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ƒ์„ฑ.
  • ํŠน์ˆ˜๊ธฐํ˜ธ(์„ธ๋ฏธ์ฝœ๋ก , ์ค„๋ฐ”๊ฟˆ ๋“ฑ) ์ œ๊ฑฐํ•ด ๋ฌธ์žฅ ์ •์ œ.
  • ํ’ˆ์งˆ ์ ๊ฒ€ ๊ฒฐ๊ณผ:
    • 98.9% ์ƒ˜ํ”Œ์ด ์œ ํšจ
    • 1.1%๋งŒ ์žก์Œ์œผ๋กœ ํ™•์ธ โ†’ ๋น„์ง€๋„ ๋ฐฉ์‹ ๊ธฐ์ค€ ์šฐ์ˆ˜ํ•œ ํ’ˆ์งˆ
  • ์˜ˆ์‹œ:
    • Positive : "a man in a blue striped shirt"
    • Negative : "a man in a red checkered shirt"

๐Ÿ’ฏ Stage ๋ณ„๋กœ์˜ ์„ฑ๋Šฅ์€!?(Ablation Test)

ablation

์œ„๋Š” ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ablation Test ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค!
์šฐ์„  ๊ธฐ๋ณธ์ ์œผ๋กœ Stage1, ๊ธด ๋ฌธ์žฅ๊ณผ ์งง์€ ๋ฌธ์žฅ์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•œ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์—ฐ๊ฒฐ์ž‘์—…์œผ๋กœ ์ด๋ฏธ ์ „์ฒด์ ์ธ ์„ฑ๋Šฅ๊ฐœ์„ ์ด ์ด๋ฃจ์–ด์กŒ๋Š”๋ฐ์š”!,

์ดํ›„ stage ์—์„œ Global > Regional > hard๋ฅผ ์ถ”๊ฐ€ํ• ์ˆ˜๋ก ๋”์šฑ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋ฉฐ,
ํŠนํžˆ ์—ฐ๊ตฌ์ž์˜ ์˜๋„์— ๋งž๊ฒŒ, L_regional ์ด ์ถ”๊ฐ€๋˜๋ฉด์„œ bbox์˜ ์ •ํ™•๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ , L_hard ๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด์„œ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ธฐ์— Fine-Grained Understanding ์ด ํš๊ธฐ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ์ ์ด ์ธ์ƒ ๊นŠ์—ˆ์Šต๋‹ˆ๋‹ค!

ํ•œํŽธ, short retrieval์€ ์˜คํžˆ๋ ค ๊ฐ์†Œํ–ˆ๋Š”๋ฐ,,
์ด๊ฒƒ์€ ๊ธด ๋ฌธ์žฅ, negative๋ฅผ ํ•™์Šตํ•˜๋ฉฐ ํ–‡๊ฐˆ๋ฆฐ๊ฒƒ์ธ๊ฐ€? ๋กœ ์˜์‹ฌ๋˜์—ˆ๋‹ค!!


๐Ÿ”ฎ ๋А๋‚€์ 

์ด๋Ÿฐ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์„ ์ค€๋น„ํ•˜๋Š”๋ฐ ๋งŽ์€ ๊ณต์ˆ˜๊ฐ€ ํˆฌ์ž…๋ฌ์„๊ฒƒ ๊ฐ™๊ณ ..
์—ฐ๊ตฌ์ž์˜ ์˜๋„์— ๋งž๊ฒŒ ๊ฐ๊ฐ์˜ ๋ชจ๋“ˆ์ด ์ž˜ ๋™์ž‘ํ•˜๋Š”๊ฒŒ ๋ฉ‹์ ธ๋ถ€๋ €๋‹ค!

๊ทธ๋ฆฌ๊ณ ! negative์˜ ํ•™์Šต์œผ๋กœ Detail์„ ๋งž์ธ ๋Š” ์„ฑ๋Šฅ์ด ์ข‹์•„์กŒ๋‹ค๋Š”๊ฒƒ์ด ๊ธฐ์–ต์— ๋‚จ๋Š”๋‹ค!
๋‚˜๋„ ์จ๋จน์–ด๋ด์•ผ์ง€

๋งˆ์ง€๋ง‰์œผ๋กœ!! Nvidia๊ฐ€ ์ „๋ถ€์ธ์ค„ ์•Œ์•˜๋Š”๋ฐ, NPU๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๋‹ˆ!! ์ธ์ƒ ๊นŠ๋‹ค!!

This post is licensed under CC BY 4.0 by the author.