๐ง OSrCIR: Reason-before-Retrieve for Composed Image Retrieval
๐ง OSrCIR: Reason-before-Retrieve for Composed Image Retrieval
๐ง OSrCIR: Reason-before-Retrieve for Compositional Image Retrieval
- Title: Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
- Conference: CVPR 2025 (Highlight paper, Yuanmin Tang et al.)
- Code: OSrCIR (GitHub)
- Keywords:
Composed Image Retrieval
,Chain-of-Thought
,One-Stage
,MLLM
,Zero-Shot
๐ 3-Sentence Summary
- Previous Composed Image Retrieval (CIR) studies adopted a two-stage structure (Image Captioning โ Text Reasoning).
- OSrCIR allows the MLLM to directly reason over the reference image, inferring the target imageโs features without relying on intermediate text.
- As a result, it improves both accuracy and speed, operating with zero-shot inference only, without any training.
๐ Limitations of Previous CIR Structures
Approach | Structure | Limitation |
---|---|---|
Two-Stage CIR | (1) Image โ Caption (2) Text โ Reasoning โ Retrieval | Loss of image information, reasoning errors |
Text-Only Reasoning | Reference image passed only via text | Hard to reflect visual attributes |
MLLM-based QA | Reasoning through Q&A | Time-consuming, inconsistent |
โ In short, using text as an intermediate step inherently causes information loss.
๐ฑ Core Idea of OSrCIR
โReason first. Then retrieve.โ
- Traditional CIR: โRetrieve-and-Reasonโ
- OSrCIR: โReason-before-Retrieveโ
- Uses an MLLM to directly infer target features from the reference image
- Retrieval is then performed using the generated reasoning result (text description)
๐ง OSrCIR Architecture
Itโs one-stage, so everything happens at once! The pipeline is simple, though the internal prompting is complex.
- Input: (Reference Image, Text Query)
- Stage 1: MLLM performs chain-of-thought style reasoning on the reference image
- Stage 2: The reasoning is refined into a target description (text query)
- Stage 3: Candidate images are retrieved using CLIP-based text-image matching (zero-shot)
โ The entire process is end-to-end in a single stage, with more complex prompts rather than multiple models.
- Both inputs (image + text query) go into one stage
- Outputs include: image caption, thoughts, reflection, and the final description
- The final description is then used for retrieval (same retrieval stage as CIReVL)
- The thoughts and reflection steps help reduce hallucination
๐งช Experimental Results
- Datasets & Metrics
- Benchmarks: CIRR, CIRCO, FashionIQ, GeneCIS
- Metrics:
- CIRR, GeneCIS, FashionIQ โ Recall@k (R@k)
- CIRCO โ mAP@k (multi-ground-truth)
- CIRR Subset โ RecallSubset@k
- Baselines
- Textual Inversion (training-dependent): Pic2Word, SEARLE, Context-I2W, LinCIR
- Training-Free: CIReVL, CIReVL*
- CIReVL* (star): Same two-stage structure, but both image captioning + text composition handled by the same MLLM (e.g., GPT-4V, LLaVA).
- Proposed Model: OSrCIR
๐ OSrCIR vs Baselines (Performance Summary)
Dataset | Metric | OSrCIR (ViT-L/14) | CIReVL* | Context-I2W | LinCIR | Gain (vs CIReVL*) |
---|---|---|---|---|---|---|
CIRCO | mAP@5 | 23.87% | 18.92% | 13.04% | - | +4.95% |
CIRR | Avg (R@k) | โ | - | - | - | +3.23% |
GeneCIS | Avg R@1 | โ | - | - | - | +1.8% (vs CIReVL*), +5.2% (vs Context-I2W) |
FashionIQ | R@10 (L/14) | โ | - | - | - | +4.74% (vs CIReVL*), +5.47% (vs Context-I2W) |
FashionIQ | R@10 (G/14) | โ | - | - | Best | +4.6% (vs CIReVL*), but < LinCIR |
- Qualitative Results
- OSrCIR better captures fine-grained details and context
- Examples: โposterโ, โChihuahuaโ, โLabradorโ, โbeachโ
- In fashion, more precise with โone-shoulder dressโ, complex t-shirt patterns, etc.
- OSrCIR better captures fine-grained details and context
- Limitation: In specialized domains like fashion, CLIP misalignment still remains a bottleneck.
โ Conclusion & Significance
- OSrCIR demonstrates how MLLM reasoning can be optimized for CIR.
- Works purely with training-free zero-shot inference โ generalizable and lightweight.
- One of the first cases of applying Chain-of-Thought reasoning in a single-stage retrieval pipeline.
- Opens new possibilities for applying VLM/MLLM reasoning in tutoring, search, and AGI planning tasks.
๐ง (ํ๊ตญ์ด) OSrCIR: Reason-before-Retrieve. one-stage๋ก ํ๋ฐฉ์ ๋๋ด๋ CIR
- ์ ๋ชฉ: Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
- ํํ: CVPR 2025 (Highlight paper, Tang et al.)
- ์ฝ๋: OSrCIR (GitHub)
- ํต์ฌ ํค์๋:
Composed Image Retrieval
,Chain-of-Thought
,One-Stage
,MLLM
,Zero-Shot
๐ 3์ค ์์ฝ
- ๊ธฐ์กด์ ์ด๋ฏธ์ง+ํ ์คํธ ์กฐํฉ ๊ฒ์(CIR) ์ฐ๊ตฌ๋ 2-Stage ๊ตฌ์กฐ (์ด๋ฏธ์ง ์บก์ โ ํ ์คํธ ์ถ๋ก ) ์ฌ์ฉ
- OSrCIR์ MLLM์ด Reference ์ด๋ฏธ์ง๋ฅผ ์ง์ reasoningํ์ฌ, ํ ์คํธ ์์ด Target ์ด๋ฏธ์ง์ ํน์ฑ ์์ฒด๋ฅผ ์ถ๋ก
- ๊ฒฐ๊ณผ์ ์ผ๋ก ์ ํ๋/์๋ ํฅ์, ์ฌ์ ํ์ต ์์ด zero-shot inference๋ง์ผ๋ก ์๋ ๊ฐ๋ฅ
๐ ๊ธฐ์กด CIR ๊ตฌ์กฐ์ ํ๊ณ
๋ฐฉ์ | ๊ตฌ์กฐ | ๋ฌธ์ ์ |
---|---|---|
2-Stage CIR | (1) ์ด๋ฏธ์ง โ ์บก์ ์์ฑ (2) ํ ์คํธ โ ์ถ๋ก โ ๊ฒ์ | ์ด๋ฏธ์ง ์ ๋ณด ์์ค, reasoning ์ค๋ฅ ๋ฐ์ |
Text-Only Reasoning | Reference ์ด๋ฏธ์ง ์ ๋ณด๋ฅผ ๊ฐ์ ์ ์ผ๋ก ์ ๋ฌ | ์๊ฐ์ ์์ฑ ๋ฐ์ ์ด๋ ค์ |
MLLM ํ์ฉ ๋ฐฉ์ | ์ง๋ฌธ ์๋ต์ผ๋ก ๊ฐ์ reasoning | ์๊ฐ ์์, ์ผ๊ด์ฑ ๋ถ์กฑ |
โ ์ฆ, ํ ์คํธ๋ฅผ ์ค๊ฐ ๋งค๊ฐ๋ก ์ผ๋ ๋ฐฉ์ ์์ฒด๊ฐ ๋ณธ์ง์ ์ธ ์ ๋ณด ์์ค์ ์ ๋ฐํจ.
๐ฑ OSrCIR์ ํต์ฌ ์์ด๋์ด
โReason first. Then retrieve.โ
- ๊ธฐ์กด CIR์ โRetrieve-and-Reasonโ ๋ฐฉ์(๋จผ์ ๊ฐ์ ธ์ค๊ณ , ๋์ค์ ์ถ๋ก , TIRG/ComposeAE ๋ฑ)
- OSrCIR์ ๋ฐ๋๋ก โReason-before-Retrieveโ(๋จผ์ ์ถ๋ก ํ๊ณ , ๊ทธ๋ค์ ๊ฐ์ ธ์ค๊ธฐ)
- MLLM์ ์ฌ์ฉํด ์ด๋ฏธ์ง์์ ์ง์ target ํน์ฑ ์ถ๋ก
- ์ด reasoning ๊ฒฐ๊ณผ(ํ ์คํธ)๋ฅผ ๊ธฐ๋ฐ์ผ๋ก Target ์ด๋ฏธ์ง ๊ฒ์ ์ํ
๐ง OSrCIR ์ ๊ตฌ์กฐ!!
One-stage์ฌ์ ํ๋ฒ์ ์ฒ๋ฆฌํ๋ค! ๊ทธ๋์ ๊ตฌ์กฐ๊ฐ ๋น๊ต์ ๊ฐ๋จํ๋ค, ํ ๊ตฌ์กฐ ๋ด์ ํ๋กฌํฌํธ๊ฐ ๋ณต์กํ ๋ฟ!
- ์ ๋ ฅ: (Reference Image, Text Query)
- Stage 1: MLLM์ ํ์ฉํด Reference ์ด๋ฏธ์ง์ ๋ํด chain-of-thought ์คํ์ผ ์ถ๋ก ์ํ
- Stage 2: ์ถ๋ก ๊ฒฐ๊ณผ๋ฅผ ํ ์คํธ ์ฟผ๋ฆฌ๋ก ์ ์
- Stage 3: ๊ฒ์ ํ๋ณด ์ด๋ฏธ์ง๋ค๊ณผ CLIP ๊ธฐ๋ฐ ํ ์คํธ-์ด๋ฏธ์ง ๋งค์นญ ์ํ (zero-shot)
โ ์ ์ฒด ๊ณผ์ ์ด end-to-end๋ก ๋จ์ผ ๋จ๊ณ(one-stage) ์์ ์ฒ๋ฆฌ๋จ
- ์์ ๊ฐ์ด 1๋ฒ์ stage์์ Input(๊ทธ๋ฆผ, ์์ ํ๋กฌํฌํธ)์ด 2๊ฐ ๋ค์ด๊ฐ๊ณ ,
- ์ด๋ฏธ์ง ์ค๋ช , Thoughts, reflection, final description์ด ํ๋ฒ์ ๋์ด!!
- ์ฌ๊ธฐ์ final description์ ๋ฐํ์ผ๋ก Image Retrieval ์งํ!(CIReVL๊ณผ ๋์ผํ CIR)
- Thoughts์ ์ด์ด reflection ์ ํตํด์ ํ ๋ฃจ์๋ด์ด์ ์ ์ค์ฌ์ค๋ค!!
๐งช ์คํ ๊ฒฐ๊ณผ ์์ฝ
- ๋ฐ์ดํฐ์
๋ฐ ํ๊ฐ ์งํ
- ์ฌ์ฉ๋ CIR ๋ฒค์น๋งํฌ: CIRR, CIRCO, FashionIQ, GeneCIS
- ํ๊ฐ ์งํ:
- CIRR, GeneCIS, FashionIQ โ Recall@k (R@k)
- CIRCO โ mAP@k (๋ค์ค ์ ๋ต ํ์ฉ)
- CIRR Subset โ RecallSubset@k
- ๋น๊ต ๊ธฐ์ค
- Textual Inversion ๊ธฐ๋ฐ (ํ์ต ํ์): Pic2Word, SEARLE, Context-I2W, LinCIR
- Training-Free ๊ธฐ๋ฐ: CIReVL, CIReVL*
- CIReVL* (CIReVL-star) :๋์ผํ 2๋จ๊ณ ๊ตฌ์กฐ๋ฅผ ์ ์งํ์ง๋ง, ์ฐธ์กฐ ์ด๋ฏธ์ง ์บก์ ๋ + ํ ์คํธ ์กฐํฉ ์์ฑ โ ๊ฐ์ MLLM(์: GPT-4V, LLaVA ๋ฑ)์ผ๋ก ๋ชจ๋ ์ฒ๋ฆฌ.
- ์ ์ ๋ชจ๋ธ: OSrCIR
๐ OSrCIR vs Baselines ์ฑ๋ฅ ๋น๊ต ์์ฝ
Dataset | Metric | OSrCIR (ViT-L/14) | CIReVL* | Context-I2W | LinCIR | ํฅ์๋ (vs CIReVL*) |
---|---|---|---|---|---|---|
CIRCO | mAP@5 | 23.87% | 18.92% | 13.04% | - | +4.95% |
CIRR | Avg (R@k) | โ | - | - | - | +3.23% |
GeneCIS | Avg R@1 | โ | - | - | - | +1.8% (vs CIReVL*), +5.2% (vs Context-I2W) |
FashionIQ | R@10 (L/14) | โ | - | - | - | +4.74% (vs CIReVL*), +5.47% (vs Context-I2W) |
FashionIQ | R@10 (G/14) | โ | - | - | Best | +4.6% (vs CIReVL*), but < LinCIR |
- ์ ์ฑ์ ๊ฒฐ๊ณผ (Qualitative Results)
- OSrCIR์ ์ธ๋ถ ์์ฑ๊ณผ ๋ฌธ๋งฅ์ ๋ ์ ํํ ๋ฐ์
- ์: โํฌ์คํฐโ, โ์น์์โ, โ๋๋ธ๋ผ๋โ, โํด๋ณโ ๊ฐ์ ์ธ๋ถ ์์ ๋ฐ์
- ํจ์ ์์ฑ์์๋ โ์์๋ ๋๋ ์คโ, โ๋ณต์กํ ํฐ์ ์ธ ํจํดโ ๋ฑ ์ธ๋ถ ๋ํ ์ผ์ ๋ ์ ์บก์ฒ
- OSrCIR์ ์ธ๋ถ ์์ฑ๊ณผ ๋ฌธ๋งฅ์ ๋ ์ ํํ ๋ฐ์
- ๋ค๋ง, ํจ์ ๊ณผ ๊ฐ์ ํน์ ๋๋ฉ์ธ์์๋ CLIP ์ ๋ ฌ ๋ถ์กฑ์ ํ๊ณ๊ฐ ๊ด์ฐฐ๋จ.
โ ๊ฒฐ๋ก ๋ฐ ์์
- OSrCIR์ MLLM์ ๊ณ ์ฐจ reasoning ๋ฅ๋ ฅ์ CIR์ ์ต์ ํ๋ ๋ฐฉ์์ผ๋ก ๋์ด๋ธ ๋ํ์ ์ฌ๋ก
- ๋ณ๋ ํ์ต ์์ด inference๋ง์ผ๋ก ๋์ โ Training-free + Generalizable
- Chain-of-Thought reasoning์ด ๋จ์ผ ์คํ ์ด์ง retrieval์ ์ง์ ์ ์ฉ๋ ์ต์ด ์ฌ๋ก ์ค ํ๋
- ํฅํ VLM ๊ธฐ๋ฐ ํํฐ๋ง, ๊ฒ์, AGI planning ๋ฑ์์์ ์์ฉ ๊ฐ๋ฅ์ฑ ๋งค์ฐ ํผ
This post is licensed under CC BY 4.0 by the author.