Post

๐Ÿง  OSrCIR: Reason-before-Retrieve for Composed Image Retrieval

๐Ÿง  OSrCIR: Reason-before-Retrieve for Composed Image Retrieval

๐Ÿง  OSrCIR: Reason-before-Retrieve for Compositional Image Retrieval

Image


๐Ÿ“Œ 3-Sentence Summary

  1. Previous Composed Image Retrieval (CIR) studies adopted a two-stage structure (Image Captioning โ†’ Text Reasoning).
  2. OSrCIR allows the MLLM to directly reason over the reference image, inferring the target imageโ€™s features without relying on intermediate text.
  3. As a result, it improves both accuracy and speed, operating with zero-shot inference only, without any training.

๐Ÿ” Limitations of Previous CIR Structures

ApproachStructureLimitation
Two-Stage CIR(1) Image โ†’ Caption (2) Text โ†’ Reasoning โ†’ RetrievalLoss of image information, reasoning errors
Text-Only ReasoningReference image passed only via textHard to reflect visual attributes
MLLM-based QAReasoning through Q&ATime-consuming, inconsistent

โ†’ In short, using text as an intermediate step inherently causes information loss.


๐ŸŒฑ Core Idea of OSrCIR

โ€œReason first. Then retrieve.โ€

  • Traditional CIR: โ€œRetrieve-and-Reasonโ€
  • OSrCIR: โ€œReason-before-Retrieveโ€
  • Uses an MLLM to directly infer target features from the reference image
  • Retrieval is then performed using the generated reasoning result (text description)

๐Ÿ”ง OSrCIR Architecture

Itโ€™s one-stage, so everything happens at once! The pipeline is simple, though the internal prompting is complex.

Image

  • Input: (Reference Image, Text Query)
  • Stage 1: MLLM performs chain-of-thought style reasoning on the reference image
  • Stage 2: The reasoning is refined into a target description (text query)
  • Stage 3: Candidate images are retrieved using CLIP-based text-image matching (zero-shot)

โ†’ The entire process is end-to-end in a single stage, with more complex prompts rather than multiple models.

Image

  • Both inputs (image + text query) go into one stage
  • Outputs include: image caption, thoughts, reflection, and the final description
  • The final description is then used for retrieval (same retrieval stage as CIReVL)

Image

  • The thoughts and reflection steps help reduce hallucination

๐Ÿงช Experimental Results

  • Datasets & Metrics
    • Benchmarks: CIRR, CIRCO, FashionIQ, GeneCIS
    • Metrics:
      • CIRR, GeneCIS, FashionIQ โ†’ Recall@k (R@k)
      • CIRCO โ†’ mAP@k (multi-ground-truth)
      • CIRR Subset โ†’ RecallSubset@k

  • Baselines
    • Textual Inversion (training-dependent): Pic2Word, SEARLE, Context-I2W, LinCIR
    • Training-Free: CIReVL, CIReVL*
      • CIReVL* (star): Same two-stage structure, but both image captioning + text composition handled by the same MLLM (e.g., GPT-4V, LLaVA).
    • Proposed Model: OSrCIR

๐Ÿ“Š OSrCIR vs Baselines (Performance Summary)

DatasetMetricOSrCIR (ViT-L/14)CIReVL*Context-I2WLinCIRGain (vs CIReVL*)
CIRCOmAP@523.87%18.92%13.04%-+4.95%
CIRRAvg (R@k)โ†‘---+3.23%
GeneCISAvg R@1โ†‘---+1.8% (vs CIReVL*), +5.2% (vs Context-I2W)
FashionIQR@10 (L/14)โ†‘---+4.74% (vs CIReVL*), +5.47% (vs Context-I2W)
FashionIQR@10 (G/14)โ†‘--Best+4.6% (vs CIReVL*), but < LinCIR

  • Qualitative Results
    • OSrCIR better captures fine-grained details and context
      • Examples: โ€œposterโ€, โ€œChihuahuaโ€, โ€œLabradorโ€, โ€œbeachโ€
    • In fashion, more precise with โ€œone-shoulder dressโ€, complex t-shirt patterns, etc.
  • Limitation: In specialized domains like fashion, CLIP misalignment still remains a bottleneck.

โœ… Conclusion & Significance

  • OSrCIR demonstrates how MLLM reasoning can be optimized for CIR.
  • Works purely with training-free zero-shot inference โ†’ generalizable and lightweight.
  • One of the first cases of applying Chain-of-Thought reasoning in a single-stage retrieval pipeline.
  • Opens new possibilities for applying VLM/MLLM reasoning in tutoring, search, and AGI planning tasks.

๐Ÿง  (ํ•œ๊ตญ์–ด) OSrCIR: Reason-before-Retrieve. one-stage๋กœ ํ•œ๋ฐฉ์— ๋๋‚ด๋Š” CIR

Image


๐Ÿ“Œ 3์ค„ ์š”์•ฝ

  1. ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€+ํ…์ŠคํŠธ ์กฐํ•ฉ ๊ฒ€์ƒ‰(CIR) ์—ฐ๊ตฌ๋Š” 2-Stage ๊ตฌ์กฐ (์ด๋ฏธ์ง€ ์บก์…˜ โ†’ ํ…์ŠคํŠธ ์ถ”๋ก ) ์‚ฌ์šฉ
  2. OSrCIR์€ MLLM์ด Reference ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ reasoningํ•˜์—ฌ, ํ…์ŠคํŠธ ์—†์ด Target ์ด๋ฏธ์ง€์˜ ํŠน์„ฑ ์ž์ฒด๋ฅผ ์ถ”๋ก 
  3. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ •ํ™•๋„/์†๋„ ํ–ฅ์ƒ, ์‚ฌ์ „ ํ•™์Šต ์—†์ด zero-shot inference๋งŒ์œผ๋กœ ์ž‘๋™ ๊ฐ€๋Šฅ

๐Ÿ” ๊ธฐ์กด CIR ๊ตฌ์กฐ์˜ ํ•œ๊ณ„

๋ฐฉ์‹๊ตฌ์กฐ๋ฌธ์ œ์ 
2-Stage CIR(1) ์ด๋ฏธ์ง€ โ†’ ์บก์…˜ ์ƒ์„ฑ (2) ํ…์ŠคํŠธ โ†’ ์ถ”๋ก  โ†’ ๊ฒ€์ƒ‰์ด๋ฏธ์ง€ ์ •๋ณด ์†์‹ค, reasoning ์˜ค๋ฅ˜ ๋ฐœ์ƒ
Text-Only ReasoningReference ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ์ „๋‹ฌ์‹œ๊ฐ์  ์†์„ฑ ๋ฐ˜์˜ ์–ด๋ ค์›€
MLLM ํ™œ์šฉ ๋ฐฉ์‹์งˆ๋ฌธ ์‘๋‹ต์œผ๋กœ ๊ฐ„์ ‘ reasoning์‹œ๊ฐ„ ์†Œ์š”, ์ผ๊ด€์„ฑ ๋ถ€์กฑ

โ†’ ์ฆ‰, ํ…์ŠคํŠธ๋ฅผ ์ค‘๊ฐ„ ๋งค๊ฐœ๋กœ ์‚ผ๋Š” ๋ฐฉ์‹ ์ž์ฒด๊ฐ€ ๋ณธ์งˆ์ ์ธ ์ •๋ณด ์†์‹ค์„ ์œ ๋ฐœํ•จ.


๐ŸŒฑ OSrCIR์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

โ€œReason first. Then retrieve.โ€

  • ๊ธฐ์กด CIR์€ โ€œRetrieve-and-Reasonโ€ ๋ฐฉ์‹(๋จผ์ € ๊ฐ€์ ธ์˜ค๊ณ , ๋‚˜์ค‘์— ์ถ”๋ก , TIRG/ComposeAE ๋“ฑ)
  • OSrCIR์€ ๋ฐ˜๋Œ€๋กœ โ€˜Reason-before-Retrieveโ€™(๋จผ์ € ์ถ”๋ก ํ•˜๊ณ , ๊ทธ๋‹ค์Œ ๊ฐ€์ ธ์˜ค๊ธฐ)
  • MLLM์„ ์‚ฌ์šฉํ•ด ์ด๋ฏธ์ง€์—์„œ ์ง์ ‘ target ํŠน์„ฑ ์ถ”๋ก 
  • ์ด reasoning ๊ฒฐ๊ณผ(ํ…์ŠคํŠธ)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Target ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ์ˆ˜ํ–‰

๐Ÿ”ง OSrCIR ์˜ ๊ตฌ์กฐ!!

One-stage์—ฌ์„œ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•œ๋‹ค! ๊ทธ๋ž˜์„œ ๊ตฌ์กฐ๊ฐ€ ๋น„๊ต์  ๊ฐ„๋‹จํ•˜๋‹ค, ํ•œ ๊ตฌ์กฐ ๋‚ด์˜ ํ”„๋กฌํฌํŠธ๊ฐ€ ๋ณต์žกํ• ๋ฟ!

Image

  • ์ž…๋ ฅ: (Reference Image, Text Query)
  • Stage 1: MLLM์„ ํ™œ์šฉํ•ด Reference ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด chain-of-thought ์Šคํƒ€์ผ ์ถ”๋ก  ์ˆ˜ํ–‰
  • Stage 2: ์ถ”๋ก  ๊ฒฐ๊ณผ๋ฅผ ํ…์ŠคํŠธ ์ฟผ๋ฆฌ๋กœ ์ •์ œ
  • Stage 3: ๊ฒ€์ƒ‰ ํ›„๋ณด ์ด๋ฏธ์ง€๋“ค๊ณผ CLIP ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋งค์นญ ์ˆ˜ํ–‰ (zero-shot)

โ†’ ์ „์ฒด ๊ณผ์ •์ด end-to-end๋กœ ๋‹จ์ผ ๋‹จ๊ณ„(one-stage) ์—์„œ ์ฒ˜๋ฆฌ๋จ

Image

  • ์œ„์™€ ๊ฐ™์ด 1๋ฒˆ์˜ stage์—์„œ Input(๊ทธ๋ฆผ, ์ˆ˜์ •ํ”„๋กฌํฌํŠธ)์ด 2๊ฐœ ๋“ค์–ด๊ฐ€๊ณ ,
  • ์ด๋ฏธ์ง€ ์„ค๋ช…, Thoughts, reflection, final description์ด ํ•œ๋ฒˆ์— ๋‚˜์˜ด!!
  • ์—ฌ๊ธฐ์„œ final description์„ ๋ฐ”ํƒ•์œผ๋กœ Image Retrieval ์ง„ํ–‰!(CIReVL๊ณผ ๋™์ผํ•œ CIR)

Image

  • Thoughts์— ์ด์–ด reflection ์„ ํ†ตํ•ด์„œ ํ• ๋ฃจ์‹œ๋‚ด์ด์…˜์„ ์ค„์—ฌ์ค€๋‹ค!!

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ

  • ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ํ‰๊ฐ€ ์ง€ํ‘œ
    • ์‚ฌ์šฉ๋œ CIR ๋ฒค์น˜๋งˆํฌ: CIRR, CIRCO, FashionIQ, GeneCIS
    • ํ‰๊ฐ€ ์ง€ํ‘œ:
      • CIRR, GeneCIS, FashionIQ โ†’ Recall@k (R@k)
      • CIRCO โ†’ mAP@k (๋‹ค์ค‘ ์ •๋‹ต ํ—ˆ์šฉ)
      • CIRR Subset โ†’ RecallSubset@k

  • ๋น„๊ต ๊ธฐ์ค€
    • Textual Inversion ๊ธฐ๋ฐ˜ (ํ•™์Šต ํ•„์š”): Pic2Word, SEARLE, Context-I2W, LinCIR
    • Training-Free ๊ธฐ๋ฐ˜: CIReVL, CIReVL*
      • CIReVL* (CIReVL-star) :๋™์ผํ•œ 2๋‹จ๊ณ„ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜์ง€๋งŒ, ์ฐธ์กฐ ์ด๋ฏธ์ง€ ์บก์…”๋‹ + ํ…์ŠคํŠธ ์กฐํ•ฉ ์ƒ์„ฑ โ†’ ๊ฐ™์€ MLLM(์˜ˆ: GPT-4V, LLaVA ๋“ฑ)์œผ๋กœ ๋ชจ๋‘ ์ฒ˜๋ฆฌ.
    • ์ œ์•ˆ ๋ชจ๋ธ: OSrCIR

๐Ÿ“Š OSrCIR vs Baselines ์„ฑ๋Šฅ ๋น„๊ต ์š”์•ฝ

DatasetMetricOSrCIR (ViT-L/14)CIReVL*Context-I2WLinCIRํ–ฅ์ƒ๋„ (vs CIReVL*)
CIRCOmAP@523.87%18.92%13.04%-+4.95%
CIRRAvg (R@k)โ†‘---+3.23%
GeneCISAvg R@1โ†‘---+1.8% (vs CIReVL*), +5.2% (vs Context-I2W)
FashionIQR@10 (L/14)โ†‘---+4.74% (vs CIReVL*), +5.47% (vs Context-I2W)
FashionIQR@10 (G/14)โ†‘--Best+4.6% (vs CIReVL*), but < LinCIR

  • ์ •์„ฑ์  ๊ฒฐ๊ณผ (Qualitative Results)
    • OSrCIR์€ ์„ธ๋ถ€ ์†์„ฑ๊ณผ ๋ฌธ๋งฅ์„ ๋” ์ •ํ™•ํžˆ ๋ฐ˜์˜
      • ์˜ˆ: โ€œํฌ์Šคํ„ฐโ€, โ€œ์น˜์™€์™€โ€, โ€œ๋ž˜๋ธŒ๋ผ๋„โ€, โ€œํ•ด๋ณ€โ€ ๊ฐ™์€ ์„ธ๋ถ€ ์š”์†Œ ๋ฐ˜์˜
    • ํŒจ์…˜ ์†์„ฑ์—์„œ๋„ โ€œ์›์ˆ„๋” ๋“œ๋ ˆ์Šคโ€, โ€œ๋ณต์žกํ•œ ํ‹ฐ์…”์ธ  ํŒจํ„ดโ€ ๋“ฑ ์„ธ๋ถ€ ๋””ํ…Œ์ผ์„ ๋” ์ž˜ ์บก์ฒ˜
  • ๋‹ค๋งŒ, ํŒจ์…˜๊ณผ ๊ฐ™์€ ํŠน์ˆ˜ ๋„๋ฉ”์ธ์—์„œ๋Š” CLIP ์ •๋ ฌ ๋ถ€์กฑ์˜ ํ•œ๊ณ„๊ฐ€ ๊ด€์ฐฐ๋จ.

โœ… ๊ฒฐ๋ก  ๋ฐ ์˜์˜

  • OSrCIR์€ MLLM์˜ ๊ณ ์ฐจ reasoning ๋Šฅ๋ ฅ์„ CIR์— ์ตœ์ ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ๋Œ์–ด๋‚ธ ๋Œ€ํ‘œ์  ์‚ฌ๋ก€
  • ๋ณ„๋„ ํ•™์Šต ์—†์ด inference๋งŒ์œผ๋กœ ๋™์ž‘ โ†’ Training-free + Generalizable
  • Chain-of-Thought reasoning์ด ๋‹จ์ผ ์Šคํ…Œ์ด์ง€ retrieval์— ์ง์ ‘ ์ ์šฉ๋œ ์ตœ์ดˆ ์‚ฌ๋ก€ ์ค‘ ํ•˜๋‚˜
  • ํ–ฅํ›„ VLM ๊ธฐ๋ฐ˜ ํŠœํ„ฐ๋ง, ๊ฒ€์ƒ‰, AGI planning ๋“ฑ์—์„œ์˜ ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ ๋งค์šฐ ํผ
This post is licensed under CC BY 4.0 by the author.