📝Understanding FG-CLip - FG-Clip 알아보기?!!

Posted Jun 11, 2025

By DrFirst

24 min read

🧠 Understanding FG-CLIP (in English)

🔍 A more precise and detailed evolution of CLIP!

Paper: FG-CLIP: Fine-Grained Visual and Textual Alignment
Conference: ICML 2025 (Xie, Chunyu, et al.)
Code: 360CVGroup/FG-CLIP

🔍 Summary of FG-CLIP

FG-CLIP (Fine-Grained CLIP) is a model designed to overcome the coarse recognition limitations of original CLIP.
It improves visual-textual alignment through three core techniques:

1. Global Semantic Alignment with Long Captions

1.6 billion long image-text pairs generated using a state-of-the-art multimodal model (LMM).
These enriched captions help the model understand more nuanced and complex visual content.
Result: improved global semantic alignment performance.

2. High-Quality Visual Grounding Dataset Construction

40 million bounding boxes over 12 million images, each with detailed region-level descriptions.
Enables learning of region-specific, context-aware visual representations.
All of this is unified into the FineHARD dataset, which also includes hard negatives:
- Improves the model’s fine-grained alignment capacity significantly.

3. Learning with Hard Fine-Grained Negative Samples

Includes 10 million fine-grained hard negatives.
These are semantically similar but visually/verbally distinct image-text pairs.
Trains the model to detect subtle differences → improving fine-grained discrimination skills.

🧠 Why was FG-CLIP developed?

The original CLIP performs well on general multimodal tasks,
but relies heavily on short and vague captions, which limit fine-grained understanding.
Existing datasets (COCO, LAION, etc.) are large but lack detail.

That’s why they released FineHARD, a fine-grained dataset!

Challenge	Limitation	Effect on CLIP
🧵 Lack of detail	Descriptions are too general, lack object-level attributes/relations	Hard to train fine-grained alignment
📦 Limited data size	Only 1–2M examples in FineCLIP/LongCLIP vs billions in LAION	Low generalization/performance compared to foundation models
🔧 Label noise	Pseudo-labeling using object detectors introduces inconsistency	Reduced accuracy and misalignment risk
⚠ Lack of hard negatives	Mostly easy positive examples; not enough confusing negative samples	Weakness in distinguishing subtly different concepts

📚 Previous CLIP Extensions and Their Limitations

🧠 CLIPSelf (ICLR 2024)

Adds DINO-style self-supervised learning to CLIP
Skips the text encoder — focuses only on visual refinement!

Adds self-supervised pretext tasks on top of CLIP’s image encoder.
Strength: Learn better visual representations without labels.
Limitation: Doesn’t improve visual-language reasoning or alignment.

🎯 FineCLIP (NeurIPS 2024)

Improves object-text mapping at the part-level / attribute-level

Uses an object detector to extract region-level representations.
Learns multi-stage alignment (region, sentence, image).
Limitation: Depends on object detector accuracy; less generalizable in abstract scenes.

🔥 LLaVA (NeurIPS 2023)

A “proto-version” of vision-language assistants like Qwen-VL or GPT-4V.

Connects CLIP vision encoder with LLM (e.g., Vicuna).
Enables chat-like multimodal interaction.
Limitation: Heavily reliant on high-quality alignment data, lacks visual reasoning depth.

🧾 LongCLIP (ECCV 2024)

Expands CLIP’s text encoder to handle long-form captions.

Trains on millions of long image-caption pairs.
Improves story-style image understanding and contextual reasoning.
Limitation: Long caption noise and text encoder overload can degrade alignment quality.

🚀 FG-CLIP Model Architecture

FG-CLIP builds on CLIP’s dual encoder structure.
Two-stage training:
- Stage 1: Align global image-text meaning
- Stage 2: Adds region-level contrast + hard negative learning

Stage 1: Global Contrastive Learning

Aligns image and text at a global semantic level
Uses both long and short captions per image
Long: rich context; Short: concise concept

✅ Implementation Summary:

1.6 billion image-text pairs
Pretrained from original CLIP
Vision Backbone: ViT-B and ViT-L
Optimizer: AdamW (LR=1e-4, weight decay=0.05, β1=0.9, β2=0.98)
Warmup: 200 steps
Batch size per NPU: 384
Precision: BFloat16
Optimization: DeepSpeed Zero-2
Epochs: 1

Stage 2: Regional + Hard Negative + Reuse of Stage1 Global Loss

2-1. 📍 Regional Contrastive Learning (L_regional)

Aligns bounding box features with corresponding text phrases using RoIAlign.
Enhances local grounding and part-level semantic understanding.

2-2. ⚠️ Hard Fine-Grained Negatives (L_hard)

Generates subtle distractors by changing attributes (e.g., “blue shirt” → “red shirt”).
Trains the model to distinguish subtle semantic differences.

2-3. Repeat of Global Contrastive (L_global)

Combined loss:

L = L_global + α · L_regional + β · L_hard

α = 0.1, β = 0.5

✅ Implementation Summary:

12M images, each with:
- long + short captions
- visual grounding annotations
- hard negatives
Optimizer: AdamW (LR=1e-6, weight decay=0.001, β1=0.9, β2=0.98)
Warmup: 50 steps
Batch size per GPU: 512
Optimization: DeepSpeed Zero-2, CUDA TF32, BFloat16
Epochs: 1

📜 FG-CLIP Data Summary

Two phases of data preparation:

📌 Phase 1: Recaptioning LAION-2B

Surprisingly, they used Huawei NPUs (910B) instead of NVIDIA GPUs!

Original LAION captions are too generic (“a bird”)
Used CogVLM2-19B to generate context-rich recaptions
- Before: "a bird" → After: "a red-winged blackbird perched on a tree branch"
Recaptioned 2 billion images in 30 days using 160×910B NPU cluster

📦 Phase 2: FineHARD Dataset (Visual Grounding + Hard Negatives)

Component	Amount
Long image-level captions	12M images
Bounding box region captions	40M boxes
Hard negatives	10M samples
Build time	7 days on NPU cluster

① Visual Grounding

Based on GRIT images + CogVLM2-generated captions
Extract referring expressions with SpaCy
Use YOLO-World for bounding boxes (Confidence ≥ 0.4)
NMS removes overlaps → 12M images, 40M regions with rich captions

② Hard Negative Generation

Keep object name, change attributes to create contrast
Use LLaMA-3.1-70B to generate 10 distractors per positive
Post-process to remove symbols like ;, ,, \n
Quality check: 98.9% valid, 1.1% noise
Example:
- Positive: "a man in a blue striped shirt"
- Negative: "a man in a red checkered shirt"

💯 Performance (Ablation Test)

Stage 1 alone already boosts performance thanks to long+short caption alignment.
As you add more components in Stage 2:

L_regional improves bbox classification
L_hard boosts fine-grained text-image discrimination

But — slight drop in short retrieval was observed, perhaps due to confusion from longer and more complex negatives!

🔮 Final Thoughts

Preparing such a massive dataset must have taken tremendous effort…
But each component clearly contributes to the model as the authors intended.

Especially impressive: the improvement in detail-level accuracy via negative samples 👏
~~I might try this myself someday…~~

And finally — I was amazed they used Huawei NPUs! NVIDIA isn’t the only game in town! 🧠

🧠 (한국어) FG-CLIP 알아보기!

🔍 더 세세한 프롬포트도 가능한, 발전된 CLIP!!

논문: FG-CLIP: Fine-Grained Visual and Textual Alignment
발표: ICML 2025 (Xie, Chunyu, et al.)
코드: 360CVGroup/FG-CLIP

🔍 FG-CLIP의 특징 요약

FG-CLIP (Fine-Grained CLIP)은 기존 CLIP의 세밀한 인식 한계를 극복하기 위해 설계된 모델로,
다음의 세 가지 핵심 기법을 통해 시각-언어 정렬 성능을 크게 향상시킵니다.

1. 장문 캡션 기반의 글로벌 의미 정렬 강화

최신 멀티모달 모델(LMM)을 활용해 1.6억 개의 장문 이미지-텍스트 쌍을 생성.
기존보다 훨씬 풍부한 문맥 정보가 담긴 데이터로 학습함으로써, 모델이 복잡하고 세부적인 시각 정보를 더 잘 이해할 수 있도록 함.
결과적으로 글로벌 의미 정렬(global semantic alignment) 능력이 향상됨.

2. 고품질 Visual Grounding 데이터셋 구축

1,200만 개 이미지에 포함된 4천만 개 바운딩 박스에 대해, 문맥이 풍부한 설명을 제공.
이 데이터셋을 통해 모델은 정확하고 지역(region)-기반의 표현을 학습할 수 있음.
이는 세밀한 객체 구분 및 위치 기반 정렬 작업에서 큰 도움이 됨.
최종적으로는 FineHARD 데이터셋으로 통합됨!!
- visual grounding 데이터와 hard negative 샘플을 통합하여 FineHARD라는 고품질 데이터셋 구성.
- 세밀한 정렬 능력 향상에 핵심적인 역할을 함.

3. Hard Negative 샘플을 통한 판별 능력 강화

1천만 개의 fine-grained hard negative sample을 포함한 대규모 말뭉치(Corpus) 도입.
의미적으로 유사하지만 속성이 다른 이미지-텍스트 쌍을 학습시켜, 더 정밀한 구분(discrimination) 능력을 부여.
모델이 세밀한 차이를 감지하고, 혼동 없이 구별할 수 있도록 유도.

🧠 FG-CLIP 등장의 배경

기존 CLIP은 멀티모달 작업에서는 뛰어나지만,
짧고 개략적인 캡션에 의존하여 세밀한 이해(fine-grained understanding)가 부족함.
기존 이미지-텍스트 데이터셋의 한계존재
그래서! “FineHARD”라는 데이터셋을 공개했죠!

구분	한계 내용	영향 및 문제점
🧵 세밀한 정보 부족	일반적인 장면 묘사 위주 (COCO, LAION 등), 세부 객체·속성·위치 정보가 부족	정교한 시각-언어 정렬 학습 어려움
📦 데이터 규모의 한계	FineCLIP (250만쌍), LongCLIP (100만쌍) → 여전히 LAION 등에 비해 적은 규모	대규모 사전학습 대비 표현력 및 일반화 능력 부족
🔧 자동 라벨링 노이즈	객체 탐지 기반 pseudo-label은 효율적이나 라벨 품질 편차로 인한 노이즈 가능성 존재	학습 정확도 저하, 미세 정렬 오류 발생 가능
⚠ Hard Negative 샘플 부족	대부분 구별 쉬운 양성 샘플 중심 구성, 어려운 음성 샘플 부족	유사한 쌍 간 미세한 차이 구분 학습 어려움, 세밀한 인식 성능 저하

기존에 존재하던 CLIP 후속의 연구들과 그 한계!

🧠 1. CLIPSelf (ICLR 2024)

CLIP을 가지고 DINO 처럼 자기지도 학습!!
조금 더 상세히는, 텍스트쪽은 스킵하고 이미지에 대한 표현력을 강화!
비슷한 이미지는 비슷한 임베딩을 같도록!!

목표: CLIP의 이미지 표현에 자기지도 학습(self-supervised)을 추가하여 성능 향상.
핵심 기법:
- 기존 CLIP의 구조 유지.
- 이미지 특징에서 자기 예측(pretext task)을 통해 정제된 표현 학습.
장점: 라벨 없이도 더 정교하고 일반화된 시각 표현 획득 가능.
한계: 텍스트 정보를 활용하지 않기 때문에 멀티모달 정렬이나 언어 기반 reasoning 성능 향상에는 한계

🎯 2. FineCLIP (NeurIPS 2024)

세부 객체-문장 대응을 강화시켜서
part-level, attribute-level 표현력을 향상시킴!!
기존에는 가방 까지만 이해했다면 이젠 노란가방! 등 세부적인것도 이해해!!

목표: CLIP의 coarse한 텍스트에 기반한 한계를 극복하고, fine-grained 시각-언어 표현 강화.
핵심 기법:
- 객체 검출기를 이용해 region-level 정보와 CLIP 임베딩 정렬.
- 다단계 정렬 학습 (객체, 문장, 이미지 수준).
장점: 세부 객체 인식 및 세밀한 텍스트 매핑 성능 강화.
한계
- 오픈도메인 장면이나 미학적/추상적 이미지에 대한 일반화가 어려움.
- 또한, top-down region 선택으로 인한 표현 편향의 가능성도 존재함.

🔥 3. LLaVA (Large Language and Vision Assistant) (NeurIPS 2023)

우리에겐 이젠 익숙해진 비전모델(Llama-Vision, qwen2.5VL)의 시조새 느낌!

목표: GPT 기반 LLM에 시각 정보 해석 능력을 부여한 멀티모달 어시스턴트 개발.
핵심 기법:
- CLIP Vision Encoder + LLM (예: Vicuna) 연결.
- 이미지 ↔ 자연어 대화가 가능한 멀티모달 프롬프트 처리.
장점:
- 대화형 비전 이해 시스템 구축 가능.
- ChatGPT 유사한 UX 제공 + 이미지 인식 기능 통합.
한계: 고품질 이미지-텍스트 alignment 데이터 의존도가 높고, 실제 시각 reasoning 능력은 제한적임.

🧾 4. LongCLIP (ECCV 2024)

짧은 텍스트만 가능했던 CLIP의 텍스트 인코더를 업그레이드 하고 학습시켜서!!
긴문장도 CLIP에서 받아드릴 수 있도록 함!

목표: CLIP의 짧은 텍스트 중심 한계를 극복하고, 장문 캡션 기반의 정교한 시각-언어 정렬 실현.
핵심 기법:
- 대규모 장문 이미지-캡션 쌍을 활용한 학습.
- CLIP의 텍스트 인코더를 장문 적합 구조로 확장.
장점:
- 스토리, 설명 중심 이미지 이해에서 우수한 성능 발휘.
- Zero-shot 인식 및 문맥 이해 능력 향상.
한계: 장문 noise + 인코더 부하, 핵심 표현 정렬 어려움

🚀 FG-CLIP 모델 구조!

FG-CLIP은 CLIP의 dual-encoder 구조를 기반으로 확장되며, 두 단계로 학습됩니다!
Stage1 에서는 CLIP 처럼 이미지-텍스트의 의미연결 작업을,
Stage2 에서는 1단계와 함께(Stage2-3) 이미지 내의 지역 선택(Stage2-1) + Negative Samples Learning(Stage2-2) 를 진행합니다!

Stage1 : 이미지-텍스트의 의미연결 작업 : Global Contrastive Learning

이미지와 텍스트의 전역적 의미 정렬을 수행
긴 캡션(long captions)과 짧은 캡션(short captions)을 모두 활용하여, 넓은 의미 스펙트럼을 학습
긴 문장은 복잡한 의미, 짧은 문장은 기본 개념 파악에 도움을 줌

✅ 세부 구현 사항 (Implementation Details)

데이터 규모: 16억(1.6B) 이미지-텍스트 쌍 사용 (각각 long + short caption 포함)
모델 초기화: 기존 CLIP 가중치로 시작
Vision Backbone: ViT-B / ViT-L 두 구성 실험
Optimizer: AdamW
- Learning rate: 1e-4
- Weight decay: 0.05
- β1: 0.9, β2: 0.98
학습 스케줄: 초기 200 step warmup
배치 크기: NPU 당 384
온도 파라미터 (τ): 0.07 (학습 가능한 변수)
정밀도: BFloat16 사용
최적화 기법: DeepSpeed의 Zero-2 적용
학습 횟수: 전체 1 epoch만 수행 (대규모 데이터로도 충분)

Stage2 : 이미지 내의 지역 선택(2-1) + Negative Samples Learning(2-2) + Stage1 반복

2-1. 📍 Regional Contrastive Learning (L_regional)

이미지 내부의 특정 영역을 대응하는 텍스트 조각과 정렬
이미지의 bounding box별 특징을 추출(RoIAlign 사용), 해당 구역의 텍스트 표현과 연결
세밀한 시각-언어적 의미 대응 능력 강화

2-2. ⚠️ Hard Fine-Grained Negative Samples Learning (L_hard)

의미적으로 유사하지만 실제로는 다른 하드 부정 샘플(hard negatives)을 생성해 학습
정답 설명에서 속성을 변경해 미묘한 차이점을 가진 샘플을 구성
모델이 유사하지만 다른 샘플을 구별할 수 있게 도와 fine-grained 인식력을 극대화

2-3 Stage1의 반복 (L_global)

두 번째 단계에서는 다음과 같은 통합 손실 함수를 통해 세 가지 학습 요소를 조합합니다:

L = L_global + α · L_regional + β · L_hard

α = 0.1 (지역 정렬 비중)
β = 0.5 (하드 부정 샘플 비중)

✅ 세부 구현 사항 (Implementation Details)

데이터 규모:
- 총 1,200만 장 이미지
- 포함 정보:
  - 짧은 캡션 (short caption)
  - 긴 캡션 (long caption)
  - 정밀한 시각적 정렬(visual grounding) 주석
  - Hard Fine grained Negative Sample
모델 초기화: 1단계(Global Contrastive Learning) 학습 결과를 초기 가중치로 사용
Optimizer: AdamW
- Learning rate: 1e-6
- Weight decay: 0.001
- β1: 0.9, β2: 0.98
Warmup 단계: 초기 50 step
배치 크기: GPU당 512
학습 최적화 기법:
- DeepSpeed Zero-2 최적화
- CUDA TF32 연산 가속 사용
- BFloat16 정밀도 적용
학습 횟수: 전체 1 epoch 수행

📜 FG-CLIP의 데이터

FG-CLIP은 정교한 이미지-텍스트 정렬을 위해 방대한 양의 고품질 데이터셋을 구성하고 활용합니다.
데이터는 크게 두 가지 단계로 나뉘며, 각각의 목적에 맞게 최적화되어 있습니다.

📌 Stage1: LAION-2B 리캡셔닝 (Global Contrastive Pretraining)

신기한점은 Nvidia GPU가 아니라 Huawei의 NPU를 사용했네!!?

기존 LAION-2B 데이터셋은 “a bird”처럼 일반적이고 단순한 설명이 많아 정밀 학습에 한계가 있음.
이를 보완하기 위해 CogVLM2-19B 대형 멀티모달 모델을 사용, 모든 이미지에 대해 정교하고 문맥이 풍부한 캡션(recaptions)을 새로 생성
예시:
- 기존: "a bird"
- 개선: "a red-winged blackbird perched on a tree branch in a park"
전체 20억 이미지에 대해 리캡셔닝을 수행하며, 160×910B NPU 클러스터로 30일간 처리됨.
소거 실험 결과, 이러한 정제된 설명은 다양한 다운스트림 작업에서 성능을 크게 향상시킴.

📦 Stage2: FineHARD 데이터셋 구축 (Regional + Hard Negative 학습)

FG-CLIP의 핵심 학습 데이터셋인 FineHARD는 세 가지 구성요소로 구축!!

구성 요소	수량
정제된 이미지 캡션 (전체 이미지 설명)	12,000,000 (1,200만 장)
Region-level 설명 (bounding boxes)	40,000,000 (4천만 개)
Hard negative 샘플	10,000,000 (천만 개)
전체 구축 소요 시간	7일 (910B NPU 클러스터)

① 정밀 Region-Text 정렬 (Visual Grounding)

GRIT 이미지를 기반으로, CogVLM2-19B로 전체 이미지 캡션 생성.
SpaCy를 활용하여 캡션에서 지시 표현(referring expressions) 추출.
이를 YOLO-World 객체 탐지 모델에 입력하여 해당하는 bounding box 추출.
Confidence ≥ 0.4, NMS 적용 하여
- 1,200만 개 이미지
- 4천만 개 바운딩 박스
- 각 영역에 대한 정밀 설명(region captions) 확보

② 하드 네거티브 샘플 생성 (Hard Negative Mining)

정답 설명에서 속성만 변경하고 객체명은 그대로 유지하는 방식으로 부정 샘플 생성.
LLaMA-3.1-70B 모델을 사용해 각 양성 샘플당 10개의 하드 네거티브 생성.
특수기호(세미콜론, 줄바꿈 등) 제거해 문장 정제.
품질 점검 결과:
- 98.9% 샘플이 유효
- 1.1%만 잡음으로 확인 → 비지도 방식 기준 우수한 품질
예시:
- Positive : "a man in a blue striped shirt"
- Negative : "a man in a red checkered shirt"

💯 Stage 별로의 성능은!?(Ablation Test)

위는 논문에서 제시된 ablation Test 결과입니다!
우선 기본적으로 Stage1, 긴 문장과 짧은 문장을 모두 사용한 텍스트-이미지 연결작업으로 이미 전체적인 성능개선이 이루어졌는데요!,

이후 stage 에서 Global > Regional > hard를 추가할수록 더욱 성능이 향상되며,
특히 연구자의 의도에 맞게, L_regional 이 추가되면서 bbox의 정확도가 증가하고, L_hard 가 추가되면서 텍스트에 대한 이해가 증가하기에 Fine-Grained Understanding 이 획기적으로 증가한다는 점이 인상 깊었습니다!

한편, short retrieval은 오히려 감소했는데,,
이것은 긴 문장, negative를 학습하며 햇갈린것인가? 로 의심되었다!!

🔮 느낀점

이런 많은 데이터셋을 준비하는데 많은 공수가 투입됬을것 같고..
연구자의 의도에 맞게 각각의 모듈이 잘 동작하는게 멋져부렀다!

그리고! negative의 학습으로 Detail을 맞츠는 성능이 좋아졌다는것이 기억에 남는다!
~~나도 써먹어봐야지~~

마지막으로!! Nvidia가 전부인줄 알았는데, NPU를 사용했다니!! 인상 깊다!!

AI, Research

This post is licensed under CC BY 4.0 by the author.