Post

๐Ÿง  CIR - Composed Image Retrieval on Real-life Images : ์ด๋ฏธ์ง€ ํƒ์ƒ‰์˜ ์‹œ์ž‘์—ฐ๊ตฌ!!

๐Ÿง  CIR - Composed Image Retrieval on Real-life Images : ์ด๋ฏธ์ง€ ํƒ์ƒ‰์˜ ์‹œ์ž‘์—ฐ๊ตฌ!!

๐Ÿง  (English) Image+Text based Composed Image Retrieval!! CIRR

Image


๐Ÿ” Research Background

Traditional image retrieval generally relies on a text query or image query (single modality).
However, in real-world scenarios, users often want to provide compositional information such as:

โ€œI want an image similar to this one, but with a different background.โ€
โ†’ Reference Image + Modification Text = Composed Image Retrieval (CIR)


๐Ÿง  Main Contributions

  1. Definition of CIR Task

    • Clearly defines Composed Image Retrieval (CIR)
    • Query: Reference Image + Textual Modification
    • Goal: Retrieve the image that matches the composed query
  2. CIRR Dataset Proposal

    • Large-scale benchmark for real-life CIR
    • Over 21,000 queryโ€“target pairs
    • Includes natural scenes, object diversity, and complex textual expressions
  3. Evaluation Set Design

    • Fine-grained distractors: semantically similar images increase retrieval difficulty
    • Multiple reference forms: ensures diversity at instance-level and scene-level
  4. Benchmarking Existing Methods

    • Evaluates representative methods such as TIRG, FiLM, and MAAF
    • Demonstrates the difficulty and realism of the CIRR dataset

๐Ÿง  CIRR Dataset: Real-life Composed Retrieval

CIRR (Composed Image Retrieval on Real-life Images) is the first major benchmark for compositional image retrieval (CIR).
Its goal is to understand user intent expressed as โ€œreference image + modification text.โ€

  • Generality: Includes a wide variety of real-world scenes and objects (not domain-specific like fashion).
  • Scale: ~17,000 images and 21,000 queryโ€“target pairs.
  • Difficulty: Contains semantic distractors (very similar images), requiring models to capture precise modification intent.

This dataset formally defines CIR and sets a realistic benchmark, becoming the foundation for subsequent research.


๐Ÿ“ท Example Query in CIRR

cirr_example

  • Reference Image: A woman sitting on a bench
  • Text Modification: โ€œThe same woman is standing, wearing different clothes.โ€
  • Target Image: The real-life image that satisfies the condition

๐Ÿง  Significance of CIRR Dataset

Focused on defining a new problem and demonstrating its difficulty.
Benchmarked existing models (TIRG, FiLM, MAAF) on the dataset.
Showed limitations in complex queries, motivating future solutions.


Existing Models and Their Limitations

ModelKey IdeaRole in CIRR
TIRGCombines image and text with residual gatingTests text-guided image modification
FiLMFeature-wise linear modulation based on textReveals limitations for simple compositional queries
MAAFModality-aware attention fusionExplores handling of complex multimodal queries
  • Results:
MethodRecall@1Recall@5
TIRG20.1%47.6%
FiLM18.4%44.1%
MAAF22.0%49.2%
  • All models show relatively low performance
  • Limitations exposed in complex queries
  • CIRR task is inherently difficult

  • Conclusion & Impact
    • Defined a new task + provided a realistic benchmark + proved limitations of existing methods
    • Inspired follow-up datasets and methods (CIRCO, FashionIQ, CIRPL, etc.)
    • A landmark study that initiated CIR research


โ€‹ Timeline of Major Follow-up Research after CIRR (2021โ€“2025)

In the future, we will explore these studies in more detail!!

2021 โ€” CIRR (ICCV 2021)

  • Contribution: First to define the Composed Image Retrieval (CIR) task and release the CIRR dataset
  • Significance: Demonstrated limitations of existing models โ†’ Sparked follow-up research

2022 โ€” CIRPLANT

  • Contribution: Proposed a dedicated model architecture for CIR
  • Idea: Gradually fused image features with textual modifications to better represent intended changes
  • Significance: One of the first attempts to tackle CIR challenges through dedicated modeling

2022 โ€” CIRCO (ECCV 2022)

  • Contribution: Expanded dataset for object-centric composed retrieval
  • Idea: Allowed text modifications to apply to individual objects in the reference image
  • Significance: Provided a more fine-grained benchmark

2023 โ€” CIRPL

  • Contribution: Proposed Language-guided Pretraining for CIR
  • Idea: Adapted large-scale multimodal pretraining models to CIR โ†’ Improved performance
  • Significance: Connected CIR research with the trend of MLLMs

2024 โ€” CIReVL (Vision-by-Language, ICLR 2024)

  • Contribution: Introduced a training-free CIR model
  • Idea: Used VLM for image captioning โ†’ LLM for caption rewriting โ†’ CLIP-based retrieval
  • Significance: Scalable and interpretable modular design for zero-shot CIR

2024 โ€” Contrastive Scaling (arXiv 2024)

  • Contribution: Proposed a contrastive scaling strategy to expand positive/negative samples
  • Idea: Generated triplets via MLLMs โ†’ Two-stage fine-tuning for better CIR performance
  • Significance: Improved results even in low-resource settings

2025 โ€” ConText-CIR (CVPR 2025)

  • Contribution: Proposed a new framework with Text Concept-Consistency Loss
  • Idea: Ensured noun phrases in the modification text align with correct parts of the reference image using synthetic data pipeline
  • Significance: Achieved state-of-the-art in both supervised and zero-shot settings

2025 โ€” OSrCIR (CVPR 2025 Highlight)

  • Contribution: Introduced a training-free, one-stage reflective CoT-based zero-shot CIR model
  • Idea: Replaced two-stage pipeline with one-stage multimodal Chain-of-Thought reasoning using MLLMs
  • Significance: Preserved visual information better, improved performance by 1.8โ€“6.44%, and enhanced interpretability

2025 โ€” COR (Composed Object Retrieval) + CORE

  • Contribution: Proposed a new object-level retrieval task and dataset
  • Idea: Instead of retrieving the whole image, it focuses on retrieving/segmenting a specific object guided by reference object + text
  • Significance: Established a foundation for fine-grained multimodal object retrieval

Overall Summary

  • 2021: CIRR โ†’ Task definition + dataset release
  • 2022: CIRPLANT โ†’ Model design; CIRCO โ†’ Fine-grained dataset
  • 2023: CIRPL โ†’ Connection with large-scale pretraining
  • 2024: CIReVL โ†’ Training-free CIR; Contrastive Scaling โ†’ Expanded contrastive learning
  • 2025: ConText-CIR โ†’ Concept consistency loss; OSrCIR โ†’ One-stage reasoning;
    โ€ƒโ€ƒโ€ƒCOR/CORE โ†’ Object-level retrieval and segmentation

This trajectory shows how CIR research has become increasingly sophisticated, enhancing both expressiveness and real-world applicability.


๐Ÿงฉ Conclusion

CIRR (ICCV 2021) was the first work to formally define the CIR task and provide a realistic benchmark!!
Subsequent datasets and methods such as CIRCO, FashionIQ, and CIRPL have all built upon this foundation.


๐Ÿง  (ํ•œ๊ตญ์–ด) ์ด๋ฏธ์ง€+ํ…์ŠคํŠธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•„์š”ํ•œ ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ƒ‰ํ•œ๋‹ค!! CIRR

Image



๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

๊ธฐ์กด ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(image retrieval)์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ…์ŠคํŠธ ์ฟผ๋ฆฌ ๋˜๋Š” ์ด๋ฏธ์ง€ ์ฟผ๋ฆฌ ๋‹จ์ผ modality์— ์˜์กดํ–ˆ์Šต๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ํ˜„์‹ค ์„ธ๊ณ„์—์„œ ์‚ฌ์šฉ์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์กฐํ•ฉ์  ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

โ€œ์ด ์ด๋ฏธ์ง€์™€ ๋น„์Šทํ•œ๋ฐ, ๋ฐฐ๊ฒฝ๋งŒ ๋‹ค๋ฅด๊ฒŒ ํ•ด์ค˜โ€
โ†’ ์ฐธ์กฐ ์ด๋ฏธ์ง€ + ์ˆ˜์ • ํ…์ŠคํŠธ = ์กฐํ•ฉ์  ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ (Composed Image Retrieval)


๐Ÿง  ์ฃผ์š” ๊ธฐ์—ฌ

  1. CIR ๊ณผ์ œ ์ •๋ฆฝ

    • Composed Image Retrieval (CIR) ๊ฐœ๋…์„ ๋ช…ํ™•ํžˆ ์ •์˜
    • ์ฟผ๋ฆฌ: Reference Image + Textual Modification
    • ๋ชฉํ‘œ: ์กฐํ•ฉ๋œ ์ฟผ๋ฆฌ์— ๋ถ€ํ•ฉํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ƒ‰
  2. CIRR ๋ฐ์ดํ„ฐ์…‹ ์ œ์•ˆ

    • ์‹ค์ œ ์‚ฌ์ง„ ๊ธฐ๋ฐ˜์˜ ๋Œ€๊ทœ๋ชจ CIR ๋ฒค์น˜๋งˆํฌ
    • ์ด 21,000๊ฐœ ์ด์ƒ์˜ ์ฟผ๋ฆฌ-ํƒ€๊นƒ ์Œ
    • ์ž์—ฐ์  ์žฅ๋ฉด, ๊ฐ์ฒด ๋‹ค์–‘์„ฑ, ๋ณต์žกํ•œ ๋ฌธ์žฅ ํ‘œํ˜„ ๋ฐ˜์˜
  3. ํ‰๊ฐ€ ์„ธํŠธ ์„ค๊ณ„

    • Fine-grained distractors: ์‹œ๋งจํ‹ฑ ์œ ์‚ฌํ•œ ์ด๋ฏธ์ง€๋ฅผ ํฌํ•จํ•˜์—ฌ ๊ฒ€์ƒ‰ ๋‚œ์ด๋„ ์ƒ์Šน
    • Multiple reference forms: instance-level, scene-level ๋‹ค์–‘์„ฑ ๋ณด์žฅ
  4. ๊ธฐ์กด ๊ธฐ๋ฒ• ์„ฑ๋Šฅ ๋น„๊ต

    • TIRG, FiLM, MAAF ๋“ฑ ๋Œ€ํ‘œ์ ์ธ CIR ๋ฐฉ์‹๋“ค์„ ์‹คํ—˜
    • CIRR ๋ฐ์ดํ„ฐ์…‹์˜ ๋‚œ์ด๋„์™€ ํ˜„์‹ค์„ฑ์„ ์‹ค์ฆํ•จ

๐Ÿง  CIRR ๋ฐ์ดํ„ฐ์…‹: ํ˜„์‹ค ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์กฐํ•ฉ์  ๊ฒ€์ƒ‰

CIRR (Composed Image Retrieval on Real-life images) ๋ฐ์ดํ„ฐ์…‹์€ ์กฐํ•ฉ์  ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(CIR) ์—ฐ๊ตฌ์˜ ์‹œ์ดˆ๊ฐ€ ๋œ ์ค‘์š”ํ•œ ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค. โ€˜์ฐธ์กฐ ์ด๋ฏธ์ง€ + ์ˆ˜์ • ํ…์ŠคํŠธโ€™๋ฅผ ๊ฒฐํ•ฉํ•œ ์ฟผ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ์ž์˜ ๋ณต์žกํ•œ ์˜๋„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฒ”์šฉ์„ฑ: ํŒจ์…˜ ๋“ฑ ํŠน์ • ๋„๋ฉ”์ธ์ด ์•„๋‹Œ, ํ˜„์‹ค ์„ธ๊ณ„์˜ ๋‹ค์–‘ํ•œ ํ’๊ฒฝ๊ณผ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทœ๋ชจ: ์•ฝ 17,000๊ฐœ์˜ ์ด๋ฏธ์ง€์™€ 21,000๊ฐœ์˜ ์ฟผ๋ฆฌ-ํƒ€๊ฒŸ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • ๋‚œ์ด๋„: ์ •๋‹ต๊ณผ ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š” โ€˜์˜๋ฏธ๋ก ์  ๋ฐฉํ•ด๋ฌผโ€™์„ ํฌํ•จํ•˜์—ฌ, ๋ชจ๋ธ์ด ์ •ํ™•ํ•œ ์ˆ˜์ • ์˜๋„๋ฅผ ํŒŒ์•…ํ•ด์•ผ๋งŒ ์ •๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฐ์ดํ„ฐ์…‹์€ CIR ๊ณผ์ œ๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜๊ณ , ํ˜„์‹ค์ ์ธ ๋‚œ์ด๋„๋ฅผ ์ œ์‹œํ•จ์œผ๋กœ์จ ์ดํ›„ ๊ด€๋ จ ์—ฐ๊ตฌ๋“ค์˜ ์ค‘์š”ํ•œ ํ† ๋Œ€๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“ท CIRR์˜ ์ฟผ๋ฆฌ ์˜ˆ์‹œ

cirr_example

  • Reference Image: ์–ด๋–ค ์—ฌ์„ฑ์ด ๋ฒค์น˜์— ์•‰์•„ ์žˆ๋Š” ์ด๋ฏธ์ง€
  • Text Modification: โ€œ๊ฐ™์€ ์—ฌ์„ฑ์ด ๋‹ค๋ฅธ ์˜ท์„ ์ž…๊ณ  ์„œ ์žˆ๋‹คโ€
  • Target Image: ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ํ˜„์‹ค ์ด๋ฏธ์ง€

๐Ÿง  CIRR ๋ฐ์ดํ„ฐ์…‹์˜ ์˜์˜

์ƒˆ๋กœ์šด ๋ฌธ์ œ ์ •์˜์™€ ๋‚œ์ด๋„ ์ž…์ฆ์— ์ดˆ์ ๋“ค ๋‘๊ณ ,
๊ธฐ์กด ๋ชจ๋ธ(TIRG, FiLM, MAAF) ํ™œ์šฉ์„ ํ†ตํ•œ ์„ฑ๋Šฅ ํ‰๊ฐ€ํ•ด์„œ์„œ
๋ณต์žกํ•œ ์ฟผ๋ฆฌ ์ƒํ™ฉ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ์˜ ํ•œ๊ณ„ ์ฆ๋ช…, ์ดํ›„ ์—ฐ๊ตฌ์ž๋“ค์˜ ํ•ด๊ฒฐ์ฑ… ๊ฐœ๋ฐœ์„ ์œ ๋„ํ•œ ์„ ๊ตฌ์  ์—ญํ• ์„ ํ•จ


๊ธฐ์กด ๋ชจ๋ธ ์ ์šฉ๊ณผ ํ•œ๊ณ„ ๊ฒ€์ฆ

๋ชจ๋ธ๋ช…์ฃผ์š” ์•„์ด๋””์–ดCIRR ์ ์šฉ ์˜์˜
TIRG์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ residual gating ๋ฐฉ์‹์œผ๋กœ ๊ฒฐํ•ฉํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ์ˆ˜์ • ์„ฑ๋Šฅ ๊ฒ€์ฆ
FiLMํ…์ŠคํŠธ ์กฐ๊ฑด์— ๋”ฐ๋ผ ํ”ผ์ฒ˜๋ฅผ ์„ ํ˜• ๋ณ€์กฐ๋‹จ์ˆœ ์กฐํ•ฉ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ ํ•œ๊ณ„ ํ™•์ธ
MAAF๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ธ์‹ attention ์œตํ•ฉ๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ์„ฑ ํƒ์ƒ‰
  • ๊ฒฐ๊ณผ๋Š”?
๋ฐฉ๋ฒ•Recall@1Recall@5
TIRG20.1%47.6%
FiLM18.4%44.1%
MAAF22.0%49.2%
  • ๋ชจ๋“  ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์ ˆ๋Œ€์ ์œผ๋กœ ๋‚ฎ๊ฒŒ ์ธก์ •
  • ๋ณต์žกํ•œ ์ฟผ๋ฆฌ ์ƒํ™ฉ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ์˜ ํ•œ๊ณ„ ์ฆ๋ช…
  • CIRR ๊ณผ์ œ๊ฐ€ ๋ณธ์งˆ์ ์œผ๋กœ ์–ด๋ ค์šด ๋ฌธ์ œ์ž„์„ ์ž…์ฆ

  • ๊ฒฐ๋ก  ๋ฐ ์˜ํ–ฅ
    • ์ƒˆ๋กœ์šด ๊ณผ์ œ ์ •์˜ + ํ˜„์‹ค์  ๋ฒค์น˜๋งˆํฌ ์ œ์‹œ + ๊ธฐ์กด ๋ฐฉ๋ฒ• ํ•œ๊ณ„ ์ฆ๋ช…
    • ์ดํ›„ CIRCO, FashionIQ, CIRPL ๋“ฑ ํ›„์† ๋ฐ์ดํ„ฐ์…‹ยท๋ชจ๋ธ ์—ฐ๊ตฌ๋กœ ๋ฐœ์ „
    • CIR ์—ฐ๊ตฌ์˜ ์ถœ๋ฐœ์ ์ด์ž ๊ธฐ๋…๋น„์  ์—ฐ๊ตฌ

โ€‹ CIRR ์ดํ›„ ์ฃผ์š” ํ›„์† ์—ฐ๊ตฌ ํƒ€์ž„๋ผ์ธ (2021โ€“2025)

์•ž์œผ๋กœ ์•„๋ž˜ ์—ฐ๊ตฌ๋“ค์ด ๋Œ€ํ•˜์—ฌ ์ž์„ธํžˆ ์•Œ์•„๋ณผ๊ฒƒ์ด๋นˆ๋‹ค!!

2021 โ€” CIRR (ICCV 2021)

  • ๊ธฐ์—ฌ: ์ตœ์ดˆ๋กœ ์กฐํ•ฉ์  ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(CIR) ๊ณผ์ œ๋ฅผ ์ •์˜ํ•˜๊ณ , CIRR ๋ฐ์ดํ„ฐ์…‹ ๊ณต๊ฐœ
  • ์˜์˜: ๊ธฐ์กด ๋ชจ๋ธ ํ•œ๊ณ„ ์ž…์ฆ โ†’ ํ›„์† ์—ฐ๊ตฌ ์ด‰๋ฐœ

2022 โ€” CIRPLANT

  • ๊ธฐ์—ฌ: CIR ์ „์šฉ ๋ชจ๋ธ ๊ตฌ์กฐ ์ œ์•ˆ
  • ์•„์ด๋””์–ด: ์ด๋ฏธ์ง€ ํŠน์ง• + ํ…์ŠคํŠธ ์ˆ˜์ • ์ •๋ณด๋ฅผ ์ ์ง„์ ์œผ๋กœ ์œตํ•ฉํ•ด ๋ณ€๊ฒฝ ์˜๋„ ํ‘œํ˜„ ๊ฐ•ํ™”
  • ์˜์˜: ์‹ค์ œ ๋ชจ๋ธ๋ง์„ ํ†ตํ•œ ์ดˆ๊ธฐ ํ•ด๊ฒฐ ์‹œ๋„

2022 โ€” CIRCO (ECCV 2022)

  • ๊ธฐ์—ฌ: ๊ฐ์ฒด ์ค‘์‹ฌ์˜ ์„ธ๋ฐ€ํ•œ ๊ตฌ์„ฑ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹ ํ™•์žฅ
  • ์•„์ด๋””์–ด: ์ฐธ์กฐ ์ด๋ฏธ์ง€์˜ ๊ฐœ๋ณ„ ๊ฐ์ฒด ๋‹จ์œ„๋กœ ํ…์ŠคํŠธ ์ˆ˜์ • ๋ฐ˜์˜
  • ์˜์˜: ๋” ์ •๊ตํ•œ ์ˆ˜์ค€์˜ ๋ฒค์น˜๋งˆํฌ ์ œ๊ณต

2023 โ€” CIRPL

  • ๊ธฐ์—ฌ: ์–ธ์–ด ๊ธฐ๋ฐ˜ ํ‘œํ˜„ ํ•™์Šต(Language-guided Pretraining) ๊ธฐ๋ฒ• ์ œ์•ˆ
  • ์•„์ด๋””์–ด: ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ CIR์— ๋งž๊ฒŒ ์ ์‘ โ†’ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ์˜์˜: CIR ์—ฐ๊ตฌ์™€ MLLM ํ๋ฆ„์˜ ์—ฐ๊ฒฐ

2024 โ€” CIReVL (Vision-by-Language, ICLR 2024)

  • ๊ธฐ์—ฌ: ํ•™์Šต ์—†์ด ๊ตฌ์„ฑ ๊ฒ€์ƒ‰ ์ˆ˜ํ–‰ํ•˜๋Š” Training-Free ๋ชจ๋ธ ์ œ์•ˆ
  • ์•„์ด๋””์–ด: VLM์œผ๋กœ ์ด๋ฏธ์ง€ ์บก์…”๋‹ โ†’ LLM์œผ๋กœ ์žฌ๊ตฌ์„ฑ๋œ ์บก์…˜ ๊ธฐ๋ฐ˜ CLIP ๊ฒ€์ƒ‰
  • ์˜์˜: ์ œ๋กœ์ƒท CIR์—์„œ ํ™•์žฅ์„ฑ ๋†’๊ณ  ์ธ๊ฐ„ ์ดํ•ด ๊ฐ€๋Šฅํ•œ ๋ชจ๋“ˆ์‹ ๊ตฌ์กฐ ์ œ๊ณต :contentReference[oaicite:1]{index=1}

2024 โ€” Contrastive Scaling (arXiv 2024)

  • ๊ธฐ์—ฌ: ๋Œ€์กฐ ํ•™์Šต์œผ๋กœ ์–‘ยท์Œ์„ฑ ์ƒ˜ํ”Œ ํ™•์žฅ ์ „๋žต ์ œ์•ˆ
  • ์•„์ด๋””์–ด: MLLM์„ ํ™œ์šฉํ•ด triplets ์ƒ์„ฑ โ†’ ๋‘ ๋‹จ๊ณ„ ํŒŒ์ธํŠœ๋‹์œผ๋กœ ์„ฑ๋Šฅ ๊ฐœ์„ 
  • ์˜์˜: ์ €๋ฆฌ์†Œ์Šค ์ƒํ™ฉ์—์„œ๋„ CIR ์„ฑ๋Šฅ ํ–ฅ์ƒ :contentReference[oaicite:2]{index=2}

2025 โ€” ConText-CIR (CVPR 2025)

  • ๊ธฐ์—ฌ: Text Concept-Consistency Loss ๊ธฐ๋ฐ˜ ์ƒˆ๋กœ์šด CIR ํ”„๋ ˆ์ž„์›Œํฌ
  • ์•„์ด๋””์–ด: ๋ช…์‚ฌ๊ตฌ๊ฐ€ ์ฐธ์กฐ ์ด๋ฏธ์ง€์˜ ์˜ฌ๋ฐ”๋ฅธ ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ํฌํ•จ
  • ์˜์˜: supervised ๋ฐ zero-shot ์„ค์ •์—์„œ SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ :contentReference[oaicite:3]{index=3}

2025 โ€” OSrCIR (CVPR 2025 Highlight)

  • ๊ธฐ์—ฌ: Training-Free, One-Stage Reflective CoT ๊ธฐ๋ฐ˜ ZS-CIR ๋ชจ๋ธ ์ œ์•ˆ
  • ์•„์ด๋””์–ด: ๊ธฐ์กด ๋‘ ๋‹จ๊ณ„ ๋Œ€์‹ , MLLM์„ ํ™œ์šฉํ•œ ์ผ๋‹จ๊ณ„ multimodal Chain-of-Thought reasoning ์ ์šฉ
  • ์˜์˜: ์‹œ๊ฐ ์ •๋ณด ์œ ์ง€ ๊ฐ•ํ™”, 1.8~6.44% ์„ฑ๋Šฅ ํ–ฅ์ƒ, ํ•ด์„๋ ฅ ํ–ฅ์ƒ :contentReference[oaicite:4]{index=4}

2025 โ€” COR (Composed Object Retrieval) + CORE

  • ๊ธฐ์—ฌ: ๊ฐ์ฒด ์ˆ˜์ค€ ์ •๋ฐ€ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ณผ์ œ์™€ ๋ฐ์ดํ„ฐ์…‹ ๋ฐœํ‘œ
  • ์•„์ด๋””์–ด: ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ ์ฐธ์กฐ ๊ฐ์ฒด + ํ…์ŠคํŠธ๋กœ ํŠน์ • ๊ฐ์ฒด๋ฅผ Segmentation ๋ฐ Retrieval
  • ์˜์˜: ์ •๋ฐ€ํ•œ ๊ฐœ์ฒด ๊ธฐ๋ฐ˜ ๋‹ค์ค‘๋ชจ๋‹ฌ ๊ฒ€์ƒ‰์˜ ๊ธฐ์ดˆ๋ฅผ ๋งˆ๋ จ :contentReference[oaicite:5]{index=5}

์ข…ํ•ฉ ์š”์•ฝ

  • 2021: CIRR โ†’ ๋ฌธ์ œ ์ •์˜ + ๋ฐ์ดํ„ฐ์…‹ ๊ณต๊ฐœ
  • 2022: CIRPLANT โ†’ ๋ชจ๋ธ ๊ตฌ์กฐ; CIRCO โ†’ ์„ธ๋ฐ€ํ•œ ๋ฐ์ดํ„ฐ์…‹
  • 2023: CIRPL โ†’ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ์—ฐ๊ณ„
  • 2024: CIReVL โ†’ ํ•™์Šต ์—†์ด ๊ฒ€์ƒ‰; Contrastive Scaling โ†’ ๋Œ€์กฐ ํ•™์Šต ํ™•์žฅ
  • 2025: ConText-CIR โ†’ ๊ฐœ๋… ์ผ๊ด€์„ฑ ์†์‹ค; OSrCIR โ†’ ํ›ˆ๋ จ ์—†์ด ์ผ๋‹จ๊ณ„ reasoning;
    โ€ƒโ€ƒโ€ƒCOR/CORE โ†’ ๊ฐ์ฒด ์ˆ˜์ค€ ๊ฒ€์ƒ‰ ๋ฐ ๋ถ„ํ• 

์ด ํ๋ฆ„์„ ํ†ตํ•ด CIR ์—ฐ๊ตฌ๋Š” ์ ์  ์ •๊ตํ•ด์ง€๊ณ , ํ‘œํ˜„๋ ฅ๊ณผ ํ˜„์‹ค์„ฑ์„ ๊ฐ•ํ™”ํ•˜๋ฉฐ ๋ฐœ์ „ํ•ด์™”์Œ์„ ์•Œ ์ˆ˜ ์ด์ฐŒ์š”~~


๐Ÿงฉ ๊ฒฐ๋ก 

CIRR (ICCV 2021)๋Š” CIR ๊ณผ์ œ๋ฅผ ์ฒ˜์Œ์œผ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ์ •์˜ํ•˜๊ณ , ํ˜„์‹ค์  ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•œ ์„ ๊ตฌ์  ์—ฐ๊ตฌ!!
์ดํ›„ ๋“ฑ์žฅํ•œ CIRCO, FashionIQ, CIRPL ๋“ฑ์˜ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ์ด ์—ฐ๊ตฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐœ์ „๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.