📝Understanding CLIP4HOI - CLIP4HOI를 알아보자!!!
🧠 Understanding CLIP4HOI
🔍 Combining CLIP and DETR for Zero-Shot HOI Detection!!!
Paper CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
Conference: NeurIPS 2023 (Mao, Yunyao, et al.)
Code: maoyunyao/CLIP4HOI 코드는 없음,,
🔎 Key Summary
- 💡 CLIP’s Vision-Language capabilities applied to HOI (Human-Object Interaction) detection!
- 🏗️ DETR + CLIP Integration: Adding 3 HOI-specific components to DETR for enhanced HOI detection performance
- ✅ CLIP’s Rich Representation: Leveraging image-text matching capabilities for HOI classification!
🚨 Limitations of Existing Methods
Existing Zero-shot HOI detection methods have evolved around two main paradigms, but each has its own inherent problems:
1️⃣ Limitations of One-Stage Frameworks 🎯
Representative Studies: GEN-VLKT, HOICLIP, QPIC, HOTR, etc.
- Approach: Simultaneously predicting ⟨human, verb, object⟩ triplets with a single query
- Problems:
- Joint positional distribution overfitting: Decoder overfits to human-object positional distributions for known verb-object combinations
- Position dependency: Reliance on seen category patterns like “when a person holds a cup, they are usually positioned on the right”
- Distribution mismatch vulnerability: Performance drastically drops when there are significant positional distribution differences between seen and unseen categories
Concrete Example: If “person-cup” interactions occurred mainly in specific positions during training, the model fails to recognize the same interaction in completely different positional relationships
2️⃣ Limitations of Knowledge Distillation 🧠
Representative Studies: GEN-VLKT, EoID
- Approach: Transferring CLIP’s knowledge to existing HOI detectors through knowledge distillation
- Problems:
- Data-sensitive: Heavy dependence on the completeness of training data
- Seen categories bias: Distillation process is dominated by seen category samples
- Poor unseen categories generalization: Lack of unseen categories during training damages generalization ability for those categories
- Distribution imbalance: Action probability distribution distillation is sensitive to imbalanced training data
3️⃣ Limitations of Compositional Learning Approaches 📚
Representative Studies: VCL, FCL, ATL, ConsNet, SCL
- Approach: Performing compositional learning by decomposing HOI into verbs and objects
- Problems:
- Dependency on predefined categories: Unseen categories must be predefined during learning
- Limited generalization: Restricted to predictable combinations, not truly “practical” zero-shot
- Complex interaction modeling limitations: Simple decomposition has limitations in representing complex HOI
- Lack of scalability: Limited flexibility for new verb-object combinations
4️⃣ Limitations of Existing CLIP Utilization Methods 🔧
Representative Studies: GEN-VLKT, EoID, HOICLIP
- Problems:
- Indirect utilization: Using CLIP only as an auxiliary tool, underutilizing its core capabilities
- Complex pipeline: Requiring complex combinations of knowledge distillation and fine-tuning
- Training-free limitations: Training-free approaches like HOICLIP have performance constraints
- Lack of adaptation: Failure to directly apply CLIP’s powerful vision-language capabilities to HOI tasks
💡 CLIP4HOI’s Solution: Solving positional distribution problems with two-stage paradigm + eliminating distillation dependency through direct CLIP adaptation!
🏗️ CLIP4HOI Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Input Image
↓
1. Object Detector (DETR): Detect humans + objects
↓
2. HO Interactor: Generate proposals for each pair
- Example of pair proposal generation:
- Input:
- Human: bbox[100, 50, 200, 300] + features[human visual information]
- Cup: bbox[180, 120, 220, 160] + features[cup visual information]
- HO Interactor Output (Q: HOI tokens):
- Spatial relationship: "Cup is near the right side of human"
- Distance: "30 pixel distance"
- Interaction features: Vector encoding "graspable distance, near-hand position" etc.
- Combined representation: HOI proposal combining all this information
↓
3. HOI Decoder (Based on CLIP image encoder)
- Transformer structure, generating output with QKV below
- Query: Pairwise HOI proposals from 2. HO Interactor
- Key/Value: Global image features from CLIP image encoder
- Output: Visual features specialized for each H-O pair
↓
4. HOI Classifier (Compare with CLIP text embeddings)
- Compare results from 3 with CLIP text embeddings to select highest score!
↓
Final HOI predictions
📦 3 Key Components (Learnable Parts)
1️⃣ HO Interactor 🤝
- Generate ⟨human, object⟩ pairs from DETR’s object detection results
- Feature interaction + Spatial information injection
Generate HOI tokens (Q) containing human-object relationship information
1 2 3 4 5 6 7 8
- Input: - Human: bbox[100, 50, 200, 300] + features[human visual information] - Cup: bbox[180, 120, 220, 160] + features[cup visual information] - HO Interactor Output: - Spatial relationship: "Cup is near the right side of human" - Distance: "30 pixel distance" - Interaction features: Vector encoding "graspable distance, near-hand position" etc. - Combined representation: HOI proposal combining all this information
2️⃣ HOI Decoder 🧠
- Utilize CLIP image representation to generate HOI visual features
- Apply attention mechanism using HOI tokens as queries
Enrich visual HOI representations
1 2 3
- Query: Pairwise HOI proposals from 2. HO Interactor - Key/Value: Global image features from CLIP image encoder - Output: Visual features specialized for each H-O pair
3️⃣ HOI Classifier 🎯
- Utilize CLIP’s shared embedding space for direct adaptation
- Learnable prompt:
[PREFIX] [verb] [CONJUN] [object]
format for task-specific text representation - Global + Pairwise HOI scores calculation for multi-label classification
- 🔍 Detailed Operation Process
- CLIP Direct Adaptation: Use CLIP directly as fine-grained classifier instead of knowledge distillation
1. Text Embedding Generation
1 2 3
# Using learnable prompt tokens text = "[PREFIX] verb [CONJUN] object" # Learnable tokens T = clip_text_encoder(text) # [Nc x Co]
2. Visual Features Processing
1 2 3 4 5 6 7
# Linear projection + L2 normalization I_prime = Linear(I) # Global image features P_prime = Linear(P) # Pairwise H-O features = 2️⃣ HOI Decoder T_hat = L2_Norm(T) # Text embeddings I_hat = L2_Norm(I_prime) # Global visual P_hat = L2_Norm(P_prime) # Pairwise visual
3. HOI Scores Calculation
1 2 3 4 5 6
# Temperature scaling + Sigmoid (multi-label) S_glob = Sigmoid(I_hat @ T_hat.T / τ) # Global HOI scores S_pair = Sigmoid(P_hat @ T_hat.T / τ) # Pairwise HOI scores # Integrate with object detector priors for final prediction final_result = integrate(S_glob, S_pair, object_priors)
Ablation Study Results Analysis!! 🔬
- HOI Decoder: Absolutely Critical Component for Zero-shot Performance(1)
- Largest performance drop with 3.77 mAP decrease
- Especially 4.67 mAP sharp decrease in Unseen categories
- → CLIP visual features aggregation is key to zero-shot performance!
- Global vs Pairwise: Different optimal strategies needed for Seen/Unseen(2)
- Seen: 0.58 mAP improvement (actually helpful!)
- Unseen: 2.99 mAP decrease (significant loss!)
- → Global context is essential for zero-shot generalization
- Multi-modal Integration: Proved complementary roles of CLIP + DETR(3)
📊 Experimental Results & Performance
CLIP4HOI demonstrates significantly superior performance compared to existing methods on major benchmarks:
🏆 Key Benchmark Results
- Best performance on Unseen categories!!! Zero-shot!!
- Also excellent performance in Fully supervised settings!!
⚠️ Limitations
These limitations were not mentioned in the paper but were considered through discussions with GPT!!
1️⃣ Learnable Prompt’s Seen Category Bias 🎯
- Problem:
[PREFIX]
,[CONJUN]
tokens are trained only on seen HOI categories - Result: Potential generalization performance degradation in unseen categories due to overfitting
- Dilemma: Trade-off between zero-shot performance vs learnable prompt optimization
2️⃣ Limitations of Close-set Zero-shot 📋
- Actual Meaning: Zero-shot only within predefined candidate HOI classes
- Limitation: Still unable to recognize completely new verb-object combinations
- Constraint: Cannot detect interactions not defined in HOI taxonomy
3️⃣ Computational Cost of Two-stage Paradigm ⚡
- Structure: Object Detection → HO Interactor → HOI Decoder → HOI Classifier
- Problem: High computational cost due to 4-stage sequential processing
- Comparison: Increased inference time compared to one-stage methods
4️⃣ Complexity of Multi-modal Alignment 🔄
- Challenge: Difficulty in achieving optimal alignment between visual features and text embeddings
- Temperature scaling: Performance sensitivity to τ value tuning
- Feature space: Complexity in aligning embedding spaces across different modalities
5️⃣ Limitations in Fine-grained Interaction Distinction 🔍
- Examples:
"holding phone"
vs"talking on phone"
,"sitting on chair"
vs"standing near chair"
- Cause: Difficulty in distinguishing subtle differences due to limitations in spatial relationship encoding
- Impact: Performance degradation for HOIs that are visually similar but semantically different
✅ Summary
CLIP4HOI is an HOI detection method that combines DETR’s object detection capabilities with CLIP’s Vision-Language understanding.
🎯 Core Ideas
- DETR + CLIP Integration: Combination of proven object detection + powerful semantic understanding
- 3 New Components: Addition of HO Interactor, HOI Decoder, HOI Classifier
- Text Template Utilization: Flexible classification by expressing HOI in natural language
- Zero-shot Generalization: Inference of untrained HOI classes through text descriptions
⚡ Technical Innovations
- Spatial + Semantic Information Integration: Simultaneous utilization of spatial relationships and semantic understanding
- Global + Pairwise Scores: Consideration of both holistic and individual HOI scores
- CLIP Representation Maximization: Application of pre-trained vision-language model capabilities to HOI
📌 Successfully integrating CLIP’s text-image matching capabilities into HOI detection!
🚀 Practical Value: Effective learning and inference of complex HOI relationships through natural language descriptions!
🧠 (한국어) CLIP4HOI 알아보기?!!
🔍 CLIP과 DETR을 결합해서 ZeroShot HOI만들기!!!
논문: CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
발표: NeurIPS 2023 (Mao, Yunyao, et al.)
코드: maoyunyao/CLIP4HOI 코드는 없음,,
🔎 핵심 요약
- 💡 CLIP의 Vision-Language 능력을 HOI(Human-Object Interaction) detection에 활용!
- 🏗️ DETR + CLIP 결합: DETR에 HOI 전용 컴포넌트 3개 추가하여 HOI detection 성능 향상
- ✅ CLIP의 풍부한 표현력: 이미지-텍스트 매칭 능력을 HOI 분류에 활용!
🚨 기존 방법들의 한계점
기존 Zero-shot HOI detection 방법들은 크게 두 가지 패러다임으로 발전해왔지만, 각각 고유한 문제점들을 가지고 있었습니다:
1️⃣ One-Stage Framework의 한계 🎯
대표 연구들: GEN-VLKT, HOICLIP, QPIC, HOTR 등
- 방식: 하나의 쿼리로 ⟨human, verb, object⟩ triplet을 동시에 예측
- 문제점:
- Joint positional distribution 과적합: 알려진 verb-object 조합에 대한 인간-객체 위치 분포에 디코더가 과적합
- 위치 의존성: “사람이 컵을 들 때는 주로 오른쪽에 위치”와 같은 seen category 패턴에 의존
- 분포 불일치 취약성: Seen과 unseen categories 간 상당한 위치 분포 차이가 있을 때 성능 급격히 저하
구체적 예시: 훈련에서 “사람-컵” 상호작용이 주로 특정 위치에서 발생했다면, 전혀 다른 위치 관계에서는 제대로 인식하지 못함
2️⃣ Knowledge Distillation의 한계 🧠
대표 연구들: GEN-VLKT, EoID
- 방식: CLIP의 지식을 기존 HOI detector에 knowledge distillation으로 전달
- 문제점:
- Data-sensitive: 훈련 데이터의 완전성에 크게 의존
- Seen categories 편향: 증류 과정이 seen categories 샘플들에 의해 지배됨
- Unseen categories 일반화 부족: 훈련 중 unseen categories가 없어 해당 카테고리 일반화 능력 손상
- 분포 불균형: Action probability distribution distillation이 불균형한 훈련 데이터에 민감
3️⃣ Compositional Learning 방식의 한계 📚
대표 연구들: VCL, FCL, ATL, ConsNet, SCL
- 방식: HOI를 verb와 object로 분해하여 compositional learning 수행
- 문제점:
- 미리 정의된 카테고리 의존성: 학습 시 unseen categories를 미리 정의해야 함
- 제한된 일반화: 예측 가능한 조합에만 한정되어 진정한 “practical” zero-shot이 아님
- 복잡한 상호작용 모델링 한계: 단순한 decomposition으로는 복잡한 HOI 표현에 한계
- 확장성 부족: 새로운 verb-object 조합에 대한 유연성 제한
4️⃣ 기존 CLIP 활용 방식의 한계 🔧
대표 연구들: GEN-VLKT, EoID, HOICLIP
- 문제점:
- 간접적 활용: CLIP을 보조 도구로만 사용하여 핵심 능력 미활용
- 복잡한 파이프라인: Knowledge distillation과 fine-tuning의 복잡한 조합 필요
- Training-free의 한계: HOICLIP처럼 training-free 방식은 성능 제약
- Adaptation 부족: CLIP의 강력한 vision-language 능력을 HOI task에 직접 적용하지 못함
💡 CLIP4HOI의 해결책: Two-stage paradigm으로 위치 분포 문제 해결 + CLIP 직접 adaptation으로 증류 의존성 제거!
🏗️ CLIP4HOI Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Input Image
↓
1. Object Detector (DETR) : 사람들 + 객체들 탐지
↓
2. HO Interactor : 각 pair에 대한 proposal 생성
- pair에 대한 proposal의 예시는 아래와 같음
- Input:
- 사람: bbox[100, 50, 200, 300] + features[사람 시각 정보]
- 컵: bbox[180, 120, 220, 160] + features[컵 시각 정보]
- HO Interactor Output(Q: HOI tokens):
- 위치 관계: "컵이 사람 오른쪽 근처에 있음"
- 거리: "30픽셀 거리"
- 상호작용 특징: "잡을 수 있는 거리, 손 근처 위치" 등이 인코딩된 벡터
- 결합된 표현: 이 모든 정보가 합쳐진 HOI proposal
↓
3. HOI Decoder (CLIP image encoder 기반)
- Transformer 구조로, 아래 QKV로 Output 산출
- Query: 2. HO Interactor에서 나온 pairwise HOI proposals
- Key/Value: CLIP image encoder의 전체 이미지 features
- Output: 각 H-O pair에 특화된 visual features
↓
4. HOI Classifier (CLIP text embedding과 비교)
- 3에서의 결과물과 CLIP의 text embedding을 비교해서 가장 높은 값을 선정!
↓
Final HOI predictions
📦 3가지 핵심 컴포넌트 (학습되는 부분)
1️⃣ HO Interactor 🤝
- DETR의 객체 탐지 결과에서 ⟨human, object⟩ 쌍 생성
- Feature interaction + Spatial information injection
인간-객체 간의 관계 정보를 담은 HOI tokens(Q) 생성
1 2 3 4 5 6 7 8
- Input: - 사람: bbox[100, 50, 200, 300] + features[사람 시각 정보] - 컵: bbox[180, 120, 220, 160] + features[컵 시각 정보] - HO Interactor Output: - 위치 관계: "컵이 사람 오른쪽 근처에 있음" - 거리: "30픽셀 거리" - 상호작용 특징: "잡을 수 있는 거리, 손 근처 위치" 등이 인코딩된 벡터 - 결합된 표현: 이 모든 정보가 합쳐진 HOI proposal
2️⃣ HOI Decoder 🧠
- CLIP image representation을 활용하여 HOI visual features 생성
- HOI tokens를 query로 사용하여 attention mechanism 적용
시각적 HOI 표현을 더욱 풍부하게 만듬
1 2 3
- Query: 2. HO Interactor에서 나온 pairwise HOI proposals - Key/Value: CLIP image encoder의 전체 이미지 features - Output: 각 H-O pair에 특화된 visual features
3️⃣ HOI Classifier 🎯
- CLIP의 shared embedding space 활용하여 direct adaptation
- Learnable prompt:
[PREFIX] [verb] [CONJUN] [object]
형태로 task-specific text representation - Global + Pairwise HOI scores 계산하여 multi-label classification
- 🔍 구체적 작동 방식
- CLIP Direct Adaptation: Knowledge distillation 대신 CLIP을 직접 fine-grained classifier로 활용
1. Text Embedding 생성
1 2 3
# Learnable prompt tokens 사용 text = "[PREFIX] verb [CONJUN] object" # 학습 가능한 토큰들 T = clip_text_encoder(text) # [Nc x Co]
2. Visual Features Processing
1 2 3 4 5 6 7
# Linear projection + L2 normalization I_prime = Linear(I) # Global image features P_prime = Linear(P) # Pairwise H-O features = 2️⃣ HOI Decoder T_hat = L2_Norm(T) # Text embeddings I_hat = L2_Norm(I_prime) # Global visual P_hat = L2_Norm(P_prime) # Pairwise visual
3. HOI Scores 계산
1 2 3 4 5 6
# Temperature scaling + Sigmoid (multi-label) S_glob = Sigmoid(I_hat @ T_hat.T / τ) # Global HOI scores S_pair = Sigmoid(P_hat @ T_hat.T / τ) # Pairwise HOI scores # Object detector priors와 통합하여 최종 예측 final_result = integrate(S_glob, S_pair, object_priors)
Ablation Study 결과 해석!! 🔬
- HOI Decoder: Zero-shot 성능의 절대적 핵심 컴포넌트(1)
- 3.77 mAP 감소로 가장 큰 성능 하락
- 특히 Unseen categories에서 4.67 mAP 급격한 감소
- → CLIP visual features aggregation이 zero-shot 성능의 핵심!
- Global vs Pairwise: Seen/Unseen 간 서로 다른 최적 전략 필요(2)
- Seen: 0.58 mAP 향상 (오히려 도움!)
- Unseen: 2.99 mAP 감소 (큰 손실!)
- → Zero-shot 일반화를 위해서는 global context 필수
- Multi-modal Integration: CLIP + DETR의 상호보완적 역할 입증(3)
📊 실험 결과 & 성능
CLIP4HOI는 주요 벤치마크에서 기존 방법들을 크게 능가하는 성능을 보여줍니다:
🏆 주요 벤치마크 결과
- Unseen에서 성능이 제일 좋았다!!! Zeroshot!!
- 또한 Fully supervised 에서도 성능이 좋았어!!
⚠️ 한계점
논문에는 없었고 GPT와 대화하며 생각해본 한계점임!!
1️⃣ Learnable Prompt의 Seen Category 편향 🎯
- 문제:
[PREFIX]
,[CONJUN]
토큰들이 seen HOI categories로만 학습 - 결과: Unseen categories에서 overfitting으로 인한 일반화 성능 저하 가능성
- 딜레마: Zero-shot 성능 vs Learnable prompt 최적화 간의 trade-off
2️⃣ Close-set Zero-shot의 한계 📋
- 실제 의미: 미리 정의된 candidate HOI classes 내에서만 zero-shot
- 한계: 완전히 새로운 verb-object 조합은 여전히 인식 불가
- 제약: HOI taxonomy에 정의되지 않은 상호작용은 탐지 불가능
3️⃣ Two-stage Paradigm의 Computational Cost ⚡
- 구조: Object Detection → HO Interactor → HOI Decoder → HOI Classifier
- 문제: 4단계 sequential processing으로 인한 높은 연산 비용
- 비교: One-stage 방법들 대비 inference time 증가
4️⃣ Multi-modal Alignment의 복잡성 🔄
- Challenge: Visual features와 text embeddings 간의 optimal alignment 달성 어려움
- Temperature scaling: τ 값 tuning에 따른 성능 민감성
- Feature space: 서로 다른 modality 간 embedding space 일치시키기 복잡
5️⃣ Fine-grained Interaction 구분의 한계 🔍
- 예시:
"holding phone"
vs"talking on phone"
,"sitting on chair"
vs"standing near chair"
- 원인: Spatial relationship encoding의 한계로 미세한 차이 구분 어려움
- 영향: 시각적으로 유사하지만 의미적으로 다른 HOI들의 성능 저하
✅ 마무리 요약
CLIP4HOI는 DETR의 객체 탐지 능력과 CLIP의 Vision-Language 이해력을 결합한 HOI detection 방법입니다.
🎯 핵심 아이디어
- DETR + CLIP 결합: 검증된 객체 탐지 + 강력한 의미 이해 조합
- 3가지 새 컴포넌트: HO Interactor, HOI Decoder, HOI Classifier 추가
- 텍스트 템플릿 활용: 자연어로 HOI를 표현하여 유연한 분류 가능
- Zero-shot 일반화: 학습되지 않은 HOI 클래스도 텍스트 설명으로 추론
⚡ 기술적 혁신
- Spatial + Semantic 정보 결합: 공간적 관계와 의미적 이해를 동시에 활용
- Global + Pairwise 스코어: 전체적/개별적 HOI 점수를 모두 고려
- CLIP 표현력 극대화: Pre-trained vision-language 모델의 능력을 HOI에 적용
📌 CLIP의 텍스트-이미지 매칭 능력을 HOI detection에 성공적으로 접목!
🚀 실용적 가치: 복잡한 HOI 관계도 자연어 설명을 통해 효과적으로 학습 및 추론 가능!