🧠 OSrCIR: Reason-before-Retrieve for Composed Image Retrieval

Posted Jul 31, 2025

By DrFirst

8 min read

🧠 OSrCIR: Reason-before-Retrieve for Compositional Image Retrieval

Title: Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Conference: CVPR 2025 (Highlight paper, Yuanmin Tang et al.)
Code: OSrCIR (GitHub)
Keywords: Composed Image Retrieval, Chain-of-Thought, One-Stage, MLLM, Zero-Shot

📌 3-Sentence Summary

Previous Composed Image Retrieval (CIR) studies adopted a two-stage structure (Image Captioning → Text Reasoning).
OSrCIR allows the MLLM to directly reason over the reference image, inferring the target image’s features without relying on intermediate text.
As a result, it improves both accuracy and speed, operating with zero-shot inference only, without any training.

🔍 Limitations of Previous CIR Structures

Approach	Structure	Limitation
Two-Stage CIR	(1) Image → Caption (2) Text → Reasoning → Retrieval	Loss of image information, reasoning errors
Text-Only Reasoning	Reference image passed only via text	Hard to reflect visual attributes
MLLM-based QA	Reasoning through Q&A	Time-consuming, inconsistent

→ In short, using text as an intermediate step inherently causes information loss.

🌱 Core Idea of OSrCIR

“Reason first. Then retrieve.”

Traditional CIR: “Retrieve-and-Reason”
OSrCIR: “Reason-before-Retrieve”
Uses an MLLM to directly infer target features from the reference image
Retrieval is then performed using the generated reasoning result (text description)

🔧 OSrCIR Architecture

It’s one-stage, so everything happens at once! The pipeline is simple, though the internal prompting is complex.

Input: (Reference Image, Text Query)
Stage 1: MLLM performs chain-of-thought style reasoning on the reference image
Stage 2: The reasoning is refined into a target description (text query)
Stage 3: Candidate images are retrieved using CLIP-based text-image matching (zero-shot)

→ The entire process is end-to-end in a single stage, with more complex prompts rather than multiple models.

Both inputs (image + text query) go into one stage
Outputs include: image caption, thoughts, reflection, and the final description
The final description is then used for retrieval (same retrieval stage as CIReVL)

The thoughts and reflection steps help reduce hallucination

🧪 Experimental Results

Datasets & Metrics
- Benchmarks: CIRR, CIRCO, FashionIQ, GeneCIS
- Metrics:
  - CIRR, GeneCIS, FashionIQ → Recall@k (R@k)
  - CIRCO → mAP@k (multi-ground-truth)
  - CIRR Subset → RecallSubset@k

Baselines
- Textual Inversion (training-dependent): Pic2Word, SEARLE, Context-I2W, LinCIR
- Training-Free: CIReVL, CIReVL*
  - CIReVL* (star): Same two-stage structure, but both image captioning + text composition handled by the same MLLM (e.g., GPT-4V, LLaVA).
- Proposed Model: OSrCIR

📊 OSrCIR vs Baselines (Performance Summary)

Dataset	Metric	OSrCIR (ViT-L/14)	CIReVL*	Context-I2W	LinCIR	Gain (vs CIReVL*)
CIRCO	mAP@5	23.87%	18.92%	13.04%	-	+4.95%
CIRR	Avg (R@k)	↑	-	-	-	+3.23%
GeneCIS	Avg R@1	↑	-	-	-	+1.8% (vs CIReVL*), +5.2% (vs Context-I2W)
FashionIQ	R@10 (L/14)	↑	-	-	-	+4.74% (vs CIReVL*), +5.47% (vs Context-I2W)
FashionIQ	R@10 (G/14)	↑	-	-	Best	+4.6% (vs CIReVL*), but < LinCIR

Qualitative Results
- OSrCIR better captures fine-grained details and context
  - Examples: “poster”, “Chihuahua”, “Labrador”, “beach”
- In fashion, more precise with “one-shoulder dress”, complex t-shirt patterns, etc.
Limitation: In specialized domains like fashion, CLIP misalignment still remains a bottleneck.

✅ Conclusion & Significance

OSrCIR demonstrates how MLLM reasoning can be optimized for CIR.
Works purely with training-free zero-shot inference → generalizable and lightweight.
One of the first cases of applying Chain-of-Thought reasoning in a single-stage retrieval pipeline.
Opens new possibilities for applying VLM/MLLM reasoning in tutoring, search, and AGI planning tasks.

🧠 (한국어) OSrCIR: Reason-before-Retrieve. one-stage로 한방에 끝내는 CIR

제목: Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
학회: CVPR 2025 (Highlight paper, Tang et al.)
코드: OSrCIR (GitHub)
핵심 키워드: Composed Image Retrieval, Chain-of-Thought, One-Stage, MLLM, Zero-Shot

📌 3줄 요약

기존의 이미지+텍스트 조합 검색(CIR) 연구는 2-Stage 구조 (이미지 캡션 → 텍스트 추론) 사용
OSrCIR은 MLLM이 Reference 이미지를 직접 reasoning하여, 텍스트 없이 Target 이미지의 특성 자체를 추론
결과적으로 정확도/속도 향상, 사전 학습 없이 zero-shot inference만으로 작동 가능

🔍 기존 CIR 구조의 한계

방식	구조	문제점
2-Stage CIR	(1) 이미지 → 캡션 생성 (2) 텍스트 → 추론 → 검색	이미지 정보 손실, reasoning 오류 발생
Text-Only Reasoning	Reference 이미지 정보를 간접적으로 전달	시각적 속성 반영 어려움
MLLM 활용 방식	질문 응답으로 간접 reasoning	시간 소요, 일관성 부족

→ 즉, 텍스트를 중간 매개로 삼는 방식 자체가 본질적인 정보 손실을 유발함.

🌱 OSrCIR의 핵심 아이디어

“Reason first. Then retrieve.”

기존 CIR은 “Retrieve-and-Reason” 방식(먼저 가져오고, 나중에 추론, TIRG/ComposeAE 등)
OSrCIR은 반대로 ‘Reason-before-Retrieve’(먼저 추론하고, 그다음 가져오기)
MLLM을 사용해 이미지에서 직접 target 특성 추론
이 reasoning 결과(텍스트)를 기반으로 Target 이미지 검색 수행

🔧 OSrCIR 의 구조!!

One-stage여서 한번에 처리한다! 그래서 구조가 비교적 간단하다, 한 구조 내의 프롬포트가 복잡할뿐!

입력: (Reference Image, Text Query)
Stage 1: MLLM을 활용해 Reference 이미지에 대해 chain-of-thought 스타일 추론 수행
Stage 2: 추론 결과를 텍스트 쿼리로 정제
Stage 3: 검색 후보 이미지들과 CLIP 기반 텍스트-이미지 매칭 수행 (zero-shot)

→ 전체 과정이 end-to-end로 단일 단계(one-stage) 에서 처리됨

위와 같이 1번의 stage에서 Input(그림, 수정프롬포트)이 2개 들어가고,
이미지 설명, Thoughts, reflection, final description이 한번에 나옴!!
여기서 final description을 바탕으로 Image Retrieval 진행!(CIReVL과 동일한 CIR)

Thoughts에 이어 reflection 을 통해서 할루시내이션을 줄여준다!!

🧪 실험 결과 요약

데이터셋 및 평가 지표
- 사용된 CIR 벤치마크: CIRR, CIRCO, FashionIQ, GeneCIS
- 평가 지표:
  - CIRR, GeneCIS, FashionIQ → Recall@k (R@k)
  - CIRCO → mAP@k (다중 정답 허용)
  - CIRR Subset → RecallSubset@k

비교 기준
- Textual Inversion 기반 (학습 필요): Pic2Word, SEARLE, Context-I2W, LinCIR
- Training-Free 기반: CIReVL, CIReVL*
  - CIReVL* (CIReVL-star) :동일한 2단계 구조를 유지하지만, 참조 이미지 캡셔닝 + 텍스트 조합 생성 → 같은 MLLM(예: GPT-4V, LLaVA 등)으로 모두 처리.
- 제안 모델: OSrCIR

📊 OSrCIR vs Baselines 성능 비교 요약

Dataset	Metric	OSrCIR (ViT-L/14)	CIReVL*	Context-I2W	LinCIR	향상도 (vs CIReVL*)
CIRCO	mAP@5	23.87%	18.92%	13.04%	-	+4.95%
CIRR	Avg (R@k)	↑	-	-	-	+3.23%
GeneCIS	Avg R@1	↑	-	-	-	+1.8% (vs CIReVL*), +5.2% (vs Context-I2W)
FashionIQ	R@10 (L/14)	↑	-	-	-	+4.74% (vs CIReVL*), +5.47% (vs Context-I2W)
FashionIQ	R@10 (G/14)	↑	-	-	Best	+4.6% (vs CIReVL*), but < LinCIR

정성적 결과 (Qualitative Results)
- OSrCIR은 세부 속성과 문맥을 더 정확히 반영
  - 예: “포스터”, “치와와”, “래브라도”, “해변” 같은 세부 요소 반영
- 패션 속성에서도 “원숄더 드레스”, “복잡한 티셔츠 패턴” 등 세부 디테일을 더 잘 캡처
다만, 패션과 같은 특수 도메인에서는 CLIP 정렬 부족의 한계가 관찰됨.

✅ 결론 및 의의

OSrCIR은 MLLM의 고차 reasoning 능력을 CIR에 최적화된 방식으로 끌어낸 대표적 사례
별도 학습 없이 inference만으로 동작 → Training-free + Generalizable
Chain-of-Thought reasoning이 단일 스테이지 retrieval에 직접 적용된 최초 사례 중 하나
향후 VLM 기반 튜터링, 검색, AGI planning 등에서의 응용 가능성 매우 큼

AI, Research

This post is licensed under CC BY 4.0 by the author.