Post

๐Ÿง  CIRCO - Zero-Shot Composed Image Retrieval with Textual Inversion (ICCV 2023)

๐Ÿง  CIRCO - Zero-Shot Composed Image Retrieval with Textual Inversion (ICCV 2023)

๐Ÿง  (ํ•œ๊ตญ์–ด) Textual Inversion์„ ํ™œ์šฉํ•œ ์ œ๋กœ์ƒท ์กฐํ•ฉ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰!! CIRCO

๊ธฐ์กด CIR์—ฐ๊ตฌ์—์„œ ZS-CIR ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•˜๊ณ , CIRCO ๋ฐ์ดํ„ฐ์…‹๋„ ๊ณต๊ฐœ!!

Image

  • ์ œ๋ชฉ: Zero-Shot Composed Image Retrieval with Textual Inversion
  • ํ•™ํšŒ: ICCV 2023 (Zhang et al.)
  • ์ฝ”๋“œ: CIRCO (GitHub)
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Composed Image Retrieval, CIRCO, Textual Inversion, Zero-Shot, ICCV 2023, ZS-CIR, SEARLE
  • ์ถ”๊ฐ€!! : ์šฉ์–ด๊ฐ€ ํ–‡๊ฐˆ๋ฆฌ๋Š”๋ฐ ZS-CIR, SEARLE ๊ฐ€ ๋ชจ๋‘ ์ด ๋…ผ๋ฌธ์—์„œ ๊ณต๊ฐœํ•œ ๋ชจ๋ธ์„ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค!!
  • ZS-CIR ๋Š” Zero shot Composed Image Retrieval์˜ ์•ฝ์–ด, SEARLE์€ zero-Shot composEd imAge Retrieval with textuaL invErsion ์ž…๋‹ˆ๋‹ค!!

๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

CIRR(ICCV 2021)์€ ์กฐํ•ฉ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(Composed Image Retrieval, CIR)์„ ์ •์˜ํ–ˆ์ง€๋งŒ, ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ์‹๊ณผ ๋‹จ์ผ ์ •๋‹ต ๋ผ๋ฒจ๋ง์— ์˜์กดํ–ˆ์Šต๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ํ˜„์‹ค ์‘์šฉ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์š”๊ตฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

โ€œ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์—์„œ๋„ ํ•™์Šต ์—†์ด(Zero-Shot)
์ฐธ์กฐ ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์ˆ˜์ •์œผ๋กœ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ฐพ๊ณ ,
๋™์‹œ์— ๋ณต์ˆ˜์˜ ์ •๋‹ต์„ ํ—ˆ์šฉํ•ด์•ผ ํ•œ๋‹ค!โ€

์ด๋ฅผ ์œ„ํ•ด ICCV 2023์—์„œ ๋ฐœํ‘œ๋œ ์ด๋ฒˆ ๋…ผ๋ฌธ์€
ZS-CIR ๋ชจ๋ธ ๊ณต๊ฐœ!! โ†’ Textual Inversion์„ ํ™œ์šฉํ•œ ์ œ๋กœ์ƒท CIR ํ”„๋ ˆ์ž„์›Œํฌ ๊ณต๊ฐœ
๋˜ํ•œ CIRCO๋ผ๋Š” ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ, ๋” ํ˜„์‹ค์ ์ด๊ณ  ์ •๊ตํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿง  ์ฃผ์š” ๊ธฐ์—ฌ

  1. ์ œ๋กœ์ƒท CIR ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ(ZS-CIR)

    • Textual Inversion์„ ํ™œ์šฉํ•ด ์ฐธ์กฐ ์ด๋ฏธ์ง€๋ฅผ ์ƒˆ๋กœ์šด ๊ฐœ๋… ํ† ํฐ์œผ๋กœ ์ž„๋ฒ ๋”ฉ
    • ์ˆ˜์ • ํ…์ŠคํŠธ์™€ ๊ฒฐํ•ฉํ•˜์—ฌ ์กฐํ•ฉ ์ฟผ๋ฆฌ ํ˜•์„ฑ
    • ๋ฐ์ดํ„ฐ์…‹๋ณ„ ํ•™์Šต ์—†์ด ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ์ ์šฉ ๊ฐ€๋Šฅ
  2. CIRCO ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

    • COCO 2017 ๊ธฐ๋ฐ˜์˜ ํ˜„์‹ค ์ด๋ฏธ์ง€ ์‚ฌ์šฉ
    • ๊ฐ์ฒด ์ค‘์‹ฌ(object-centric) + ๋‹ค์ค‘ ๊ฐ์ฒด ํฌํ•จ ์ฟผ๋ฆฌ
    • ์‹ค์ œ ์žฅ๋ฉด์—์„œ ๊ฐ์ฒด ์†์„ฑ ๋ณ€๊ฒฝ + ๊ฐ์ฒด ๊ฐ„ ๊ด€๊ณ„ ์ˆ˜์ • ๋ฐ˜์˜
  3. ๋ฒค์น˜๋งˆํฌ ๋ฐ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ

    • CIRR, FashionIQ, CIRCO์—์„œ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ ๊ฒ€์ฆ
    • ํ•™์Šต ์—†์ด๋„ ์˜๋ฏธ ์žˆ๋Š” ์„ฑ๋Šฅ ํ™•๋ณด

๐Ÿง  ์ฃผ์š” ๊ธฐ์—ฌ (์ž์„ธํžˆ!!)

1. ์ œ๋กœ์ƒท CIR ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ (ZS-CIR)

Image

  • ๊ธฐ์กด ๋ฌธ์ œ
    • CIRR, FashionIQ ๊ฐ™์€ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํŒŒ์ธํŠœ๋‹์„ ๊ฑฐ์ณ์•ผ ํ–ˆ์Œ
    • ๋”ฐ๋ผ์„œ ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์ด๋‚˜ unseen ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํžˆ ์ €ํ•˜๋˜๋Š” ๋ฌธ์ œ ์กด์žฌ
  • ํ•ต์‹ฌ ์•„์ด๋””์–ด
    • Textual Inversion ๊ธฐ๋ฒ•์„ CIR์— ์ ‘๋ชฉ
    • ์ฐธ์กฐ ์ด๋ฏธ์ง€๋ฅผ ์ƒˆ๋กœ์šด ํ† ํฐ(embedding)์œผ๋กœ ๋ณ€ํ™˜ โ†’ ๋งˆ์น˜ โ€œ๋‹จ์–ดโ€์ฒ˜๋Ÿผ ํ™œ์šฉ
    • ์ˆ˜์ • ํ…์ŠคํŠธ์™€ ๊ฒฐํ•ฉ โ†’ ์ตœ์ข…์ ์œผ๋กœ โ€œ์ด๋ฏธ์ง€+๋ฌธ์žฅ ์กฐํ•ฉ ์ฟผ๋ฆฌโ€ ํ˜•์„ฑ
  • ์žฅ์ 
    • ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด๋„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ (Zero-Shot)
    • ํŠน์ • ๋„๋ฉ”์ธ(ํŒจ์…˜, ์‹ค์ƒํ™œ ์ด๋ฏธ์ง€ ๋“ฑ)์— ๊ตญํ•œ๋˜์ง€ ์•Š๊ณ  ๋ฒ”์šฉ์„ฑ ํ™•๋ณด
    • ์ถ”๋ก  ๊ณผ์ •์ด ๋‹จ์ˆœํ•ด ํšจ์œจ์„ฑ๋„ ๋ณด์žฅ

2. CIRCO ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

CIRCO๋Š” CIR ์—ฐ๊ตฌ์—์„œ ์ฒ˜์Œ์œผ๋กœ ์ œ๋กœ์ƒท ์กฐํ•ฉ ๊ฒ€์ƒ‰์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ,
๋ฐ์ดํ„ฐ์…‹ ์ธก๋ฉด์—์„œ๋„

  • ๋ณต์ˆ˜์˜ ์ •๋‹ต
  • ํ˜„์‹ค์  ์ด๋ฏธ์ง€
  • ๋ณต์žกํ•œ ์ฟผ๋ฆฌ ๊ตฌ์„ฑ
    ์„ ๋ฐ˜์˜ํ•ด CIR ํ‰๊ฐ€์˜ ์งˆ์  ์ˆ˜์ค€์„ ๋Œ์–ด์˜ฌ๋ฆฐ ๊ธฐ๋…๋น„์  ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

Image

  • ํ˜„์‹ค์„ฑ
    • MS-COCO 2017 ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜
    • ํŠน์ • ๋„๋ฉ”์ธ(์˜ˆ: ํŒจ์…˜) ํŽธํ–ฅ์„ ์ค„์ด๊ณ , ๋‹ค์–‘ํ•œ ๋ฌผ์ฒดยท๋ฐฐ๊ฒฝยท๊ด€๊ณ„์„ฑ์„ ํฌํ•จ
  • ๊ฐ์ฒด ์ค‘์‹ฌ (Object-Centric)
    • ๋‹จ์ˆœํžˆ โ€œ์ „์ฒด ์žฅ๋ฉดโ€์ด ์•„๋‹ˆ๋ผ, ํŠน์ • ๊ฐ์ฒด์˜ ์†์„ฑ ๋ณ€ํ™”๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ์ฟผ๋ฆฌ ์ œ๊ณต
    • ์˜ˆ: โ€œ์‚ฌ์ง„ ์† ์ž๋™์ฐจ๋Š” ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ๋ฐ”๊พธ๊ณ , ์˜†์— ์žˆ๋˜ ๊ฐ•์•„์ง€๋Š” ๊ณ ์–‘์ด๋กœ ๋ฐ”๊ฟ”์ค˜.โ€
  • ๋ณต์ˆ˜ ์ •๋‹ต (Multi-Ground Truths)
    • ์ฟผ๋ฆฌ๋‹น ํ‰๊ท  4.53๊ฐœ์˜ ํƒ€๊นƒ ์ด๋ฏธ์ง€ ์กด์žฌ
    • ๊ธฐ์กด FashionIQ ๊ฐ™์€ ๋‹จ์ผ ์ •๋‹ต ๊ตฌ์กฐ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณต
    • False Negative ๋ฌธ์ œ ์™„ํ™” โ†’ ๊ฒ€์ƒ‰ ๋ชจ๋ธ ํ‰๊ฐ€๊ฐ€ ํ›จ์”ฌ ๊ณต์ •ํ•ด์ง
  • ๋ณต์žกํ•œ ์งˆ์˜ (Complex Queries)
    • ๊ฐ์ฒด ์†์„ฑ ์ˆ˜์ •๋ฟ ์•„๋‹ˆ๋ผ ๋‹ค์ค‘ ๊ฐ์ฒด ๋ฐ ๊ฐ์ฒด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํฌํ•จ
    • ๋‹จ์ˆœํ•œ โ€œ์ƒ‰์ƒ ๋ณ€๊ฒฝโ€์„ ๋„˜์–ด
      • โ€œ์‚ฌ๋žŒ์ด ์•‰์•„ ์žˆ๋˜ ์œ„์น˜์— ๋‹ค๋ฅธ ์ธ๋ฌผ์ด ์„œ ์žˆ๋‹คโ€
      • โ€œ๊ฐœ๊ฐ€ ์žˆ๋˜ ์ž๋ฆฌ์— ๊ณ ์–‘์ด๊ฐ€ ์žˆ๋‹คโ€ ๊ฐ™์€ ๋ณตํ•ฉ ์ฟผ๋ฆฌ๋„ ํฌํ•จ

3. ๋ฒค์น˜๋งˆํฌ ๋ฐ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ

Image

  • ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹: CIRR, FashionIQ, CIRCO ๋“ฑ ์ฃผ์š” CIR ๋ฒค์น˜๋งˆํฌ์—์„œ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ ๊ฒ€์ฆ

  • FashionIQ (Validation Set)
    • SEARLE (B/32): ํ‰๊ท  R@10 = 22.89, R@50 = 42.53
    • SEARLE-XL (L/14): ํ‰๊ท  R@10 = 25.56, R@50 = 46.23
    • ๊ธฐ์กด ๋ฐฉ๋ฒ• ๋Œ€๋น„ ํ™•์—ฐํžˆ ํ–ฅ์ƒ, Bases์ผ๋•Œ๋Š” ํ”„๋กฌํฌํŠธ๋ฅผ ํ•™์Šต์‹œํ‚จ OTI๋ณด๋‹ค ๊ทธ๋ƒฅ SEARLE๊ฐ€ ๋” ์ข‹์„๋–„๊ฐ€ ๋งŽ์•˜์Œ!!
  • CIRR (Test Set)
    • SEARLE (B/32): Recall@1 = 24.27, Recall@5 = 53.22, Recall@10 = 66.82
    • SEARLE-XL (L/14): Recall@1 = 24.22, Recall@5 = 52.48, Recall@10 = 66.29
    • SEARLE์ด ํ”„๋กฌํฌํŠธ ํ•™์Šตํ•œ๊ฒƒ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜์Œ!@
    • Subset Recall (๋” ์ •๋ฐ€ํ•œ ํ‰๊ฐ€)์—์„œ๋„ SEARLE-XL์ด Recall@3 = 88.19๋กœ SOTA ์ˆ˜์ค€ ์„ฑ๋Šฅ ํ™•๋ณด
  • ์˜์˜
    • ๋‹จ์ˆœํ•œ Zero-Shot ์ ‘๊ทผ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , FashionIQ์™€ CIRR์—์„œ ๊ธฐ์กด ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๊ธฐ๋ก
    • ํŠนํžˆ CIRR์—์„œ๋Š” Recall@1์ด 24%๋ฅผ ๋„˜์œผ๋ฉฐ, ํ•™์Šต ์—†์ด๋„ ์˜๋ฏธ ์žˆ๋Š” ๊ฒ€์ƒ‰ ํ’ˆ์งˆ์„ ๋ณด์žฅ
    • ์ด๋Š” CIR ์—ฐ๊ตฌ์—์„œ Zero-Shot ์ ‘๊ทผ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์ตœ์ดˆ๋กœ ์‹ค์ฆํ•œ ์„ฑ๊ณผ์ด๋ฉฐ, ์ดํ›„ CIReVL (ICLR 2024), OSrCIR (CVPR 2025) ๊ฐ™์€ Training-Free ๊ณ„์—ด ์—ฐ๊ตฌ๋กœ ์ด์–ด์ง€๋Š” ๊ธฐ๋ฐ˜์„ ๋งˆ๋ จํ•จ

๐Ÿงฉ ๊ฒฐ๋ก 

CIRCO (ICCV 2023)๋Š” Textual Inversion ๊ธฐ๋ฐ˜ ์ œ๋กœ์ƒท CIR(ZS-CIR)์„ ์ œ์•ˆํ•˜๊ณ ,
ํ˜„์‹ค์„ฑ๊ณผ ์ •๊ตํ•จ์„ ๊ฐ•ํ™”ํ•œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹(CIRCO)์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
์ด๋Š” CIR ์—ฐ๊ตฌ๊ฐ€ โ€œํ•™์Šต ๊ธฐ๋ฐ˜ + ๋‹จ์ผ ์ •๋‹ตโ€์—์„œ โ€œ์ œ๋กœ์ƒท + ๋‹ค์ค‘ ์ •๋‹ต + ๋ณต์žก ์งˆ์˜โ€๋กœ
์ง„ํ™”ํ•˜๋Š” ์ถœ๋ฐœ์ ์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿง  (English) Zero-Shot Composed Image Retrieval with Textual Inversion!! CIRCO

In this work, the authors released the ZS-CIR model and also introduced the CIRCO dataset!!

Image

  • Title: Zero-Shot Composed Image Retrieval with Textual Inversion
  • Conference: ICCV 2023 (Zhang et al.)
  • Code: CIRCO (GitHub)
  • Key Keywords: Composed Image Retrieval, CIRCO, Textual Inversion, Zero-Shot, ICCV 2023, ZS-CIR, SEARLE
  • Note!!: The terms ZS-CIR and SEARLE both refer to the model released in this paper.
  • ZS-CIR is the abbreviation of Zero-Shot Composed Image Retrieval, while SEARLE stands for zero-Shot composEd imAge Retrieval with textuaL invErsion.

๐Ÿ” Background

CIRR (ICCV 2021) defined Composed Image Retrieval (CIR) but relied heavily on training-based methods and single ground-truth labels.
However, in real-world scenarios, new demands have emerged:

โ€œWe need to retrieve images in unseen domains,
without additional training (Zero-Shot),
using reference images + textual modifications,
while allowing multiple correct answers!โ€

To meet these demands, the ICCV 2023 paper proposed:

  • ZS-CIR model โ†’ a Textual Inversion-based zero-shot CIR framework
  • CIRCO dataset โ†’ a more realistic and fine-grained benchmark

๐Ÿง  Key Contributions

  1. Zero-Shot CIR Framework (ZS-CIR)

    • Applied Textual Inversion to embed reference images as new concept tokens
    • Combined with modification text to form composed queries
    • Applicable across domains without dataset-specific training
  2. CIRCO Dataset

    • Based on COCO 2017 real-world images
    • Object-centric queries including multiple objects
    • Captures attribute changes + object relationships in natural scenes
  3. Benchmark & Zero-Shot Performance

    • Evaluated on CIRR, FashionIQ, and CIRCO
    • Achieved meaningful zero-shot performance without additional training

๐Ÿง  Key Contributions (Detailed)

1. Zero-Shot CIR Framework (ZS-CIR)

Image

  • Problem
    • Previous datasets like CIRR and FashionIQ required fine-tuning on training data
    • Performance dropped drastically on unseen domains or categories
  • Core Idea
    • Incorporate Textual Inversion into CIR
    • Convert reference images into pseudo-word tokens (embeddings), treated like โ€œwordsโ€
    • Combine with modification text โ†’ final image+text composed query
  • Advantages
    • Enables retrieval without additional training (Zero-Shot)
    • Domain-agnostic: works across fashion, real-life, and beyond
    • Simple inference pipeline with efficient retrieval

2. CIRCO Dataset

CIRCO is not only the first to enable zero-shot composed retrieval,
but also advances CIR evaluation by providing:

  • Multiple ground truths
  • Real-world images
  • Complex queries
    โ†’ raising the evaluation quality of CIR benchmarks

Image

  • Realism
    • Built on MS-COCO 2017
    • Avoids domain bias (e.g., fashion-only) and covers diverse scenes, objects, and contexts
  • Object-Centric
    • Queries reflect changes in specific objects, not only the global scene
    • Example: โ€œChange the car in the image to red, and replace the dog with a cat.โ€
  • Multiple Ground Truths
    • On average, 4.53 target images per query
    • Overcomes the single-ground-truth limitation of FashionIQ
    • Mitigates False Negative issue โ†’ fairer evaluation of retrieval systems
  • Complex Queries
    • Includes not only attribute modifications but also multi-object and relational changes
    • Beyond โ€œcolor change,โ€ includes cases like:
      • โ€œA person sitting becomes another person standingโ€
      • โ€œReplace the dog with a catโ€

3. Benchmark & Zero-Shot Performance

Image

  • Evaluation Datasets: CIRR, FashionIQ, CIRCO

  • FashionIQ (Validation Set)
    • SEARLE (B/32): Avg R@10 = 22.89, R@50 = 42.53
    • SEARLE-XL (L/14): Avg R@10 = 25.56, R@50 = 46.23
    • In some cases, plain SEARLE outperformed the optimized OTI version
  • CIRR (Test Set)
    • SEARLE (B/32): Recall@1 = 24.27, Recall@5 = 53.22, Recall@10 = 66.82
    • SEARLE-XL (L/14): Recall@1 = 24.22, Recall@5 = 52.48, Recall@10 = 66.29
    • SEARLE achieved better results than OTI-trained prompts in some settings
    • Subset Recall: SEARLE-XL reached Recall@3 = 88.19, achieving SOTA-level performance
  • Significance
    • Even with a pure Zero-Shot setup, SEARLE achieved competitive performance compared to training-based approaches
    • On CIRR, Recall@1 exceeded 24%, proving high-quality retrieval without training
    • This milestone validated the feasibility of Zero-Shot CIR, laying the groundwork for follow-up works such as CIReVL (ICLR 2024) and OSrCIR (CVPR 2025)

๐Ÿงฉ Conclusion

CIRCO (ICCV 2023) introduced Textual Inversion-based Zero-Shot CIR (ZS-CIR / SEARLE)
and established a new, more realistic dataset (CIRCO).
This work marked the evolution of CIR from โ€œtraining-based + single ground-truthโ€
to โ€œzero-shot + multiple ground-truths + complex queries.โ€

This post is licensed under CC BY 4.0 by the author.