Post

๐Ÿง  OSrCIR: Reason-before-Retrieve for Composed Image Retrieval

๐Ÿง  OSrCIR: Reason-before-Retrieve for Composed Image Retrieval

๐Ÿง  (ํ•œ๊ตญ์–ด) OSrCIR: Reason-before-Retrieve ๋ธ”๋ผ๋ธ”๋ผ๋ธ”๋ผ


๐Ÿ“Œ 3์ค„ ์š”์•ฝ

  1. ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€+ํ…์ŠคํŠธ ์กฐํ•ฉ ๊ฒ€์ƒ‰(CIR) ์€ ๋Œ€๋ถ€๋ถ„ 2-Stage ๊ตฌ์กฐ (์ด๋ฏธ์ง€ ์บก์…˜ โ†’ ํ…์ŠคํŠธ ์ถ”๋ก ) ์‚ฌ์šฉ
  2. OSrCIR์€ MLLM์ด Reference ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ reasoningํ•˜์—ฌ, ํ…์ŠคํŠธ ์—†์ด Target ์ด๋ฏธ์ง€์˜ ํŠน์„ฑ ์ž์ฒด๋ฅผ ์ถ”๋ก 
  3. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ •ํ™•๋„/์†๋„ ํ–ฅ์ƒ, ์‚ฌ์ „ ํ•™์Šต ์—†์ด zero-shot inference๋งŒ์œผ๋กœ ์ž‘๋™ ๊ฐ€๋Šฅ

๐Ÿ” ๊ธฐ์กด CIR ๊ตฌ์กฐ์˜ ํ•œ๊ณ„

๋ฐฉ์‹๊ตฌ์กฐ๋ฌธ์ œ์ 
2-Stage CIR(1) ์ด๋ฏธ์ง€ โ†’ ์บก์…˜ ์ƒ์„ฑ (2) ํ…์ŠคํŠธ โ†’ ์ถ”๋ก  โ†’ ๊ฒ€์ƒ‰์ด๋ฏธ์ง€ ์ •๋ณด ์†์‹ค, reasoning ์˜ค๋ฅ˜ ๋ฐœ์ƒ
Text-Only ReasoningReference ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ์ „๋‹ฌ์‹œ๊ฐ์  ์†์„ฑ ๋ฐ˜์˜ ์–ด๋ ค์›€
MLLM ํ™œ์šฉ ๋ฐฉ์‹์งˆ๋ฌธ ์‘๋‹ต์œผ๋กœ ๊ฐ„์ ‘ reasoning์‹œ๊ฐ„ ์†Œ์š”, ์ผ๊ด€์„ฑ ๋ถ€์กฑ

โ†’ ์ฆ‰, ํ…์ŠคํŠธ๋ฅผ ์ค‘๊ฐ„ ๋งค๊ฐœ๋กœ ์‚ผ๋Š” ๋ฐฉ์‹ ์ž์ฒด๊ฐ€ ๋ณธ์งˆ์ ์ธ ์ •๋ณด ์†์‹ค์„ ์œ ๋ฐœํ•จ.


##

๐Ÿ“Š Comparison of CIRCO and CIRR Test Data

๐Ÿงพ Overview

ํ•ญ๋ชฉCIRCO (Composable Image Retrieval)CIRR (Composable Image Retrieval on Real life)
๋ชฉ์ ๊ตฌ์„ฑ ์š”์†Œ ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ (์กฐํ•ฉ์ )์ผ์ƒ ์žฅ๋ฉด ๊ธฐ๋ฐ˜์˜ ๊ตฌ์„ฑ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰
๋ฐ์ดํ„ฐ ์œ ํ˜•Synthetic-style + Multi-objectReal-life ์‚ฌ์ง„ (๋„์‹œ, ์ผ์ƒ ๋“ฑ)
์ฃผ์š” TaskCompositional RetrievalReference + Text-based Target Retrieval
์ƒ˜ํ”Œ ๊ตฌ์„ฑQuery Image + Target AttributeReference Image + Caption
์ •๋‹ต ์ˆ˜Top-1 ๋˜๋Š” Top-k (๋‹จ์ผ ์ •๋‹ต)Top-1 ๋˜๋Š” Top-k (๋‹จ์ผ ์ •๋‹ต)
Negative ๊ตฌ์กฐDisentangled Hard NegativesSemantically Similar Distractors
๋‚œ์ด๋„ ํŠน์ง•Attribute ์ˆ˜์ค€ ์กฐํ•ฉ์˜ ๋‹ค์–‘์„ฑ ๋†’์Œ์žฅ๋ฉด ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ Distractor ํฌํ•จ
์ฃผ์š” ์‚ฌ์šฉ ๋ชฉ์ ๋ชจ๋ธ์˜ ์กฐํ•ฉ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ํ‰๊ฐ€์‹ค์ œ ์ƒํ™ฉ์—์„œ์˜ Text-Image ์กฐํ•ฉ ๊ฒ€์ƒ‰ ํ‰๊ฐ€

๐Ÿ“ CIRCO Dataset

  • ์ถœ์ฒ˜: CIRCO: Compositional Image Retrieval with Complex Object Descriptions
  • ๊ตฌ์„ฑ: ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด ์†์„ฑ๊ณผ ๋ฐฐ๊ฒฝ ์กฐํ•ฉ์œผ๋กœ ์ƒ์„ฑ๋œ ์ฟผ๋ฆฌ-ํƒ€๊นƒ ์Œ
  • ์ž…๋ ฅ ์ฟผ๋ฆฌ ์˜ˆ์‹œ:
    • ์ด๋ฏธ์ง€: โ€˜๊ฐœ๊ฐ€ ๋‚˜๋ฌด ์˜†์— ์žˆ์Œโ€™
    • ์†์„ฑ ๋ณ€๊ฒฝ: โ€˜๋‚˜๋ฌด โ†’ ๋ฒค์น˜โ€™
  • ํŠน์ง•: ์‹œ๊ฐ ๊ฐœ์ฒด ๊ตฌ์„ฑ ์š”์†Œ ๋‹จ์œ„์˜ ์กฐํ•ฉ ๋Šฅ๋ ฅ ํ‰๊ฐ€ ๊ฐ€๋Šฅ

๐Ÿ“ CIRR Dataset

  • ์ถœ์ฒ˜: Compositional Image Retrieval on Real-life images
  • ๊ตฌ์„ฑ: ์‹ค์ œ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜, ๋ฌธ์žฅ ์„ค๋ช…๊ณผ ํ•จ๊ป˜ ์ฐธ์กฐ ์ด๋ฏธ์ง€ ์ œ๊ณต
  • ์ž…๋ ฅ ์ฟผ๋ฆฌ ์˜ˆ์‹œ:
    • ์ฐธ์กฐ ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์„ค๋ช…: โ€œ๊ฐ™์€ ์—ฌ์ž๊ฐ€ ์žˆ๋Š”๋ฐ ์˜ท ์ƒ‰์ด ๋‹ค๋ฅด๊ณ  ๋’ค์— ์žˆ๋Š” ์ž๋™์ฐจ๋Š” ์—†์Œโ€
  • ํŠน์ง•: ์‹ค์ œ ์žฅ๋ฉด์—์„œ์˜ ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์กฐํ•ฉ ๊ฒ€์ƒ‰ ๋Šฅ๋ ฅ ํ‰๊ฐ€

๐Ÿง  ์š”์•ฝ ์ •๋ฆฌ

  • CIRCO: ๊ตฌ์กฐ์ , ์กฐํ•ฉ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ…Œ์ŠคํŠธํ•˜๋Š” ๋ฐ ์ค‘์ .
  • CIRR: ํ˜„์‹ค ๊ธฐ๋ฐ˜์˜ ์ง๊ด€์ ์ธ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ํ…Œ์ŠคํŠธ์— ์ค‘์ .

๋‘ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘ V+L ๋ชจ๋ธ์˜ โ€œ์กฐํ•ฉ์  ์ดํ•ด์™€ ๊ฒ€์ƒ‰ ๋Šฅ๋ ฅโ€์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜์ง€๋งŒ, CIRCO๋Š” ๋” ๋ณต์žกํ•œ ์กฐํ•ฉ ํŒจํ„ด, CIRR์€ ์‹ค์ œ ์‚ฌ์ง„๊ณผ ์„ค๋ช… ๊ธฐ๋ฐ˜์˜ ์ง๊ด€์  ํ‰๊ฐ€์— ์ดˆ์ ์„ ๋‘ก๋‹ˆ๋‹ค.

๐ŸŒฑ OSrCIR์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

โ€œReason first. Then retrieve.โ€

  • ๊ธฐ์กด CIR์€ โ€œRetrieve-and-Reasonโ€ ๋ฐฉ์‹
  • OSrCIR์€ ๋ฐ˜๋Œ€๋กœ โ€˜Reason-before-Retrieveโ€™
  • MLLM์„ ์‚ฌ์šฉํ•ด ์ด๋ฏธ์ง€์—์„œ ์ง์ ‘ target ํŠน์„ฑ ์ถ”๋ก 
  • ์ด reasoning ๊ฒฐ๊ณผ(ํ…์ŠคํŠธ)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Target ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ์ˆ˜ํ–‰

๐Ÿ”ง OSrCIR ์•„ํ‚คํ…์ฒ˜ ์š”์•ฝ

arch

  • ์ž…๋ ฅ: (Reference Image, Text Query)
  • Stage 1: MLLM์„ ํ™œ์šฉํ•ด Reference ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด chain-of-thought ์Šคํƒ€์ผ ์ถ”๋ก  ์ˆ˜ํ–‰
  • Stage 2: ์ถ”๋ก  ๊ฒฐ๊ณผ๋ฅผ ํ…์ŠคํŠธ ์ฟผ๋ฆฌ๋กœ ์ •์ œ
  • Stage 3: ๊ฒ€์ƒ‰ ํ›„๋ณด ์ด๋ฏธ์ง€๋“ค๊ณผ CLIP ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋งค์นญ ์ˆ˜ํ–‰ (zero-shot)

โ†’ ์ „์ฒด ๊ณผ์ •์ด end-to-end๋กœ ๋‹จ์ผ ๋‹จ๊ณ„(one-stage) ์—์„œ ์ฒ˜๋ฆฌ๋จ


๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ

์ฃผ์š” ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด 2-Stage ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ์ •ํ™•๋„ + ํšจ์œจ ๋ชจ๋‘ ์šฐ์ˆ˜ํ•œ ์„ฑ๊ณผ ๋‹ฌ์„ฑ!

DatasetRecall@1 (๊ธฐ์กด SOTA)OSrCIRํ–ฅ์ƒํญ
CIRR52.1 (FashionIQ-CLIP)57.4+5.3
CIRCO33.837.9+4.1
FashionIQ48.754.2+5.5
  • Zero-shot ์„ค์ •์—์„œ ์‹คํ˜„๋จ (ํ•™์Šต ์—†์ด inference๋งŒ์œผ๋กœ)
  • Ablation ๊ฒฐ๊ณผ, reasoning์„ ์ƒ๋žตํ•˜๋ฉด ์„ฑ๋Šฅ ๊ธ‰๋ฝ

โœ… ๊ฒฐ๋ก  ๋ฐ ์˜์˜

  • OSrCIR์€ MLLM์˜ ๊ณ ์ฐจ reasoning ๋Šฅ๋ ฅ์„ CIR์— ์ตœ์ ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ๋Œ์–ด๋‚ธ ๋Œ€ํ‘œ์  ์‚ฌ๋ก€
  • ๋ณ„๋„ ํ•™์Šต ์—†์ด inference๋งŒ์œผ๋กœ ๋™์ž‘ โ†’ Training-free + Generalizable
  • Chain-of-Thought reasoning์ด ๋‹จ์ผ ์Šคํ…Œ์ด์ง€ retrieval์— ์ง์ ‘ ์ ์šฉ๋œ ์ตœ์ดˆ ์‚ฌ๋ก€ ์ค‘ ํ•˜๋‚˜
  • ํ–ฅํ›„ VLM ๊ธฐ๋ฐ˜ ํŠœํ„ฐ๋ง, ๊ฒ€์ƒ‰, AGI planning ๋“ฑ์—์„œ์˜ ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ ๋งค์šฐ ํผ

โ€œRetrieval is not just about matching. Itโ€™s about reasoning what to match.โ€

This post is licensed under CC BY 4.0 by the author.