Post

๐Ÿ“Š Evaluation Metrics in CIRR - CIRR๋ถ„์•ผ์˜ Metrics ์•Œ์•„๋ณด๊ธฐ

๐Ÿ“Š Evaluation Metrics in CIRR - CIRR๋ถ„์•ผ์˜ Metrics ์•Œ์•„๋ณด๊ธฐ

๐Ÿง  Understanding Metrics in the CIRR Domain

Image

  • While exploring OSrCIR and CIReVL, several evaluation metrics commonly used in the CIRR domain appeared.
  • Since these metrics also frequently appear in other research, I studied their definitions more carefully!

1. Recall@K (R@K)

  • Definition: A metric that measures the proportion of queries where the ground-truth image is within the Top-K retrieved results.
\[Recall@K = \frac{\#\{\text{์ฟผ๋ฆฌ์—์„œ Top-K ์•ˆ์— ์ •๋‹ต ์กด์žฌ}\}}{\#\{\text{์ „์ฒด ์ฟผ๋ฆฌ}\}}\]
  • Strengths: Simple to compute and highly intuitive; widely adopted as the standard in most CIR benchmarks.
  • Limitations: Fails to fully capture cases with multiple correct answers, and may underestimate performance in datasets with many false negatives (e.g., CIRR).
  • Common Usage: CIRR (ICCV 2021), FashionIQ.

2. Subset Recall@K

  • Definition: Similar to Recall@K, but instead of the whole database, evaluation is restricted to a smaller predefined subset of related images (e.g., the same scene group).
  • The subset is not chosen arbitrarily by the researcher, but is predefined within the dataset itself.
  • Strengths: Allows for fine-grained comparison of model performance within a small candidate pool.
  • Limitations: Does not directly reflect retrieval performance across the entire DB, and results may vary depending on how subsets are defined.
  • Common Usage: CIRR Subset Evaluation.

3. mAP@K (mean Average Precision at K)

  • Definition: Considers the ranking positions of all correct answers within the Top-K results and computes the average precision.
    \(mAP@K = \frac{1}{N}\sum_{i=1}^N \frac{1}{|G_i|}\sum_{j=1}^{|G_i|} Precision@r_{ij}\)

    (where (r_{ij}) is the rank of the j-th ground-truth image for the i-th query)

  • Strengths: Fair in multi-ground-truth scenarios; reflects retrieval quality more precisely than Recall@K.
  • Limitations: More complex to compute and less intuitive than Recall.
  • Common Usage: CIRCO (ICCV 2023), GeneCIS.

4. TIFA (Text-to-Image Faithfulness Assessment, Hu et al. 2023)

I will write a separate post on TIFA in more detail soon!!

  • Definition: A metric for evaluating the faithfulness of generated images to the input text prompt.
    • Uses a VQA (Visual Question Answering) model to automatically generate questions like โ€œIs there a dog?โ€, โ€œIs the dog brown?โ€ and compares them with the image.
  • Strengths: Goes beyond Recall by measuring textual faithfulness, and shows strong correlation with human evaluation.
  • Limitations: Highly dependent on the VQA modelโ€™s performance; incurs additional computational cost; not yet a fully standardized evaluation metric in CIR.
  • Common Usage: Hu et al. (ICCV 2023), Vision-by-Language (CIReVL, ICLR 2024).

๐Ÿ‘‰ Summary:

  • Recall@K โ†’ the basic standard metric in CIR.
  • Subset Recall โ†’ used in CIRR for fine-grained model comparison.
  • mAP@K โ†’ ensures fairness in multi-ground-truth settings (CIRCO, GeneCIS).
  • TIFA โ†’ evaluates textual faithfulness, providing a more human-aligned interpretation.

๐Ÿง  (ํ•œ๊ตญ์–ด) CIRR๋ถ„์•ผ์˜ Metrics ์•Œ์•„๋ณด๊ธฐ

Image

  • OSrCIR, CIReVL์„ ์ง„ํ–‰ํ•˜๋ฉฐ ๋‚˜์™”๋˜ CIRR ๋ถ„์•ผ์˜ ๋ชจ๋ธ ํ‰๊ฐ€ Metrics!
  • ๋‹ค๋ฅธ์—ฐ๊ตฌ์—์„œ๋„ ๋งŽ์ด ๋‚˜์˜ค๊ธฐ์— ์ •์˜์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค!!

1. Recall@K (R@K)

  • ์ •์˜: ๊ฒ€์ƒ‰๋œ Top-K ๊ฒฐ๊ณผ ์•ˆ์— ์ •๋‹ต ์ด๋ฏธ์ง€๊ฐ€ ์กด์žฌํ•˜๋Š” ๋น„์œจ์„ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ.
\[Recall@K = \frac{\#\{\text{์ฟผ๋ฆฌ์—์„œ Top-K ์•ˆ์— ์ •๋‹ต ์กด์žฌ}\}}{\#\{\text{์ „์ฒด ์ฟผ๋ฆฌ}\}}\]
  • ์žฅ์ : ๊ณ„์‚ฐ์ด ๋‹จ์ˆœํ•˜๊ณ  ์ง๊ด€์ ์ด๋ฉฐ, ๋Œ€๋ถ€๋ถ„์˜ CIR ๋ฒค์น˜๋งˆํฌ์—์„œ ํ‘œ์ค€์ ์œผ๋กœ ์‚ฌ์šฉ๋จ.
  • ํ•œ๊ณ„: ์ •๋‹ต ์ด๋ฏธ์ง€๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ ์ด๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, False Negative๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹(CIRR)์—์„œ๋Š” ์„ฑ๋Šฅ์„ ๊ณผ์†Œํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.
  • ์ฃผ ์‚ฌ์šฉ์ฒ˜: CIRR(ICCV 2021), FashionIQ.

2. Subset Recall@K

  • ์ •์˜: Recall@K์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ์ „์ฒด DB๊ฐ€ ์•„๋‹Œ ์ฟผ๋ฆฌ์™€ ๊ด€๋ จ๋œ ์ž‘์€ ์„œ๋ธŒ์…‹(์˜ˆ: ๋™์ผํ•œ ์žฅ๋ฉด์˜ ์ด๋ฏธ์ง€ ๊ทธ๋ฃน)์—์„œ๋งŒ ํ‰๊ฐ€.
  • Subset์€ ์—ฐ๊ตฌ์ž๊ฐ€ ์ž„์˜๋กœ ๋ฝ‘๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ๋ฐ์ดํ„ฐ์…‹ ์ž์ฒด์— ๋ฏธ๋ฆฌ ์ •์˜๋˜์–ด ์žˆ๋Š” ๊ด€๋ จ ์ด๋ฏธ์ง€ ๊ทธ๋ฃน์„ ์˜๋ฏธ
  • ์žฅ์ : ์ž‘์€ ํ›„๋ณด๊ตฐ ๋‚ด์—์„œ ๋ชจ๋ธ์˜ ์„ธ๋ฐ€ํ•œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.
  • ํ•œ๊ณ„: ์ „์ฒด DB ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๊ณ , Subset ์ •์˜ ๋ฐฉ์‹์— ๋”ฐ๋ผ ํŽธํ–ฅ ๋ฐœ์ƒ ๊ฐ€๋Šฅ.
  • ์ฃผ ์‚ฌ์šฉ์ฒ˜: CIRR Subset Evaluation.

3. mAP@K (mean Average Precision at K)

  • ์ •์˜: Top-K ๊ฒฐ๊ณผ์—์„œ ๋ชจ๋“  ์ •๋‹ต ์ด๋ฏธ์ง€์˜ ์œ„์น˜(rank)๋ฅผ ๊ณ ๋ คํ•ด ํ‰๊ท  ์ •๋ฐ€๋„๋ฅผ ๊ณ„์‚ฐ.
    \(mAP@K = \frac{1}{N}\sum_{i=1}^N \frac{1}{|G_i|}\sum_{j=1}^{|G_i|} Precision@r_{ij}\)

    (์—ฌ๊ธฐ์„œ (r_{ij})๋Š” i๋ฒˆ์งธ ์ฟผ๋ฆฌ์˜ j๋ฒˆ์งธ ์ •๋‹ต ์ด๋ฏธ์ง€์˜ ์ˆœ์œ„)

  • ์žฅ์ : ๋‹ค์ค‘ ์ •๋‹ต(Multi-Ground Truth) ์ƒํ™ฉ์—์„œ๋„ ๊ณต์ •ํ•˜๊ฒŒ ํ‰๊ฐ€ ๊ฐ€๋Šฅ, Recall@K๋ณด๋‹ค ์„ธ๋ฐ€ํ•˜๊ฒŒ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ์„ ๋ฐ˜์˜.
  • ํ•œ๊ณ„: ๊ณ„์‚ฐ์ด ๋ณต์žกํ•˜๋ฉฐ, ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ๋Š” Recall๋ณด๋‹ค ์–ด๋ ค์›€.
  • ์ฃผ ์‚ฌ์šฉ์ฒ˜: CIRCO(ICCV 2023), GeneCIS.

4. TIFA (Text-to-Image Faithfulness Assessment, Hu et al. 2023)

TIFA์— ๋Œ€ํ•˜์—ฌ๋Š” ๋‹ค์‹œํ•œ๋ฒˆ ํฌ์ŠคํŒ…ํ•ด๋ณด๊ฒ ์“ฐ๋ฏ€๋‹ˆ๋‹ค!!

  • ์ •์˜: ํ…์ŠคํŠธ ์กฐ๊ฑด(prompt)๊ณผ ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€์˜ ์ถฉ์‹ค๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ์ง€ํ‘œ.
    • VQA(Visual Question Answering) ๋ชจ๋ธ์„ ์ด์šฉํ•ด โ€œ๊ฐ•์•„์ง€๊ฐ€ ์žˆ๋‚˜์š”?โ€, โ€œ๊ฐ•์•„์ง€๋Š” ๊ฐˆ์ƒ‰์ธ๊ฐ€์š”?โ€ ๊ฐ™์€ ์งˆ๋ฌธ์„ ์ž๋™ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฏธ์ง€ ๊ฒฐ๊ณผ์™€ ๋น„๊ต.
  • ์žฅ์ : Recall์ฒ˜๋Ÿผ ๋‹จ์ˆœ ์ •๋‹ต ์—ฌ๋ถ€๊ฐ€ ์•„๋‹ˆ๋ผ, ํ…์ŠคํŠธ ์กฐ๊ฑด ์ถฉ์‹ค์„ฑ(faithfulness)์„ ์ •๋ฐ€ํ•˜๊ฒŒ ํ‰๊ฐ€ ๊ฐ€๋Šฅ.
    • ์‚ฌ๋žŒ ํ‰๊ฐ€์™€ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋ณด์ž„.
  • ํ•œ๊ณ„: VQA ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ํฌ๊ฒŒ ์˜์กดํ•˜๋ฉฐ, ์—ฐ์‚ฐ ๋น„์šฉ์ด ์ถ”๊ฐ€๋จ. ์•„์ง CIR์—์„œ ์™„์ „ํ•œ ํ‘œ์ค€์€ ์•„๋‹˜.
  • ์ฃผ ์‚ฌ์šฉ์ฒ˜: Hu et al. (ICCV 2023), Vision-by-Language (CIReVL, ICLR 2024).

๐Ÿ‘‰ ์ •๋ฆฌ:

  • CIR ๋ถ„์•ผ๋Š” Recall@K๋ฅผ ๊ธฐ๋ณธ ์ง€ํ‘œ๋กœ ํ™œ์šฉ.
  • Subset Recall โ†’ CIRR์—์„œ ์„ธ๋ฐ€ํ•œ ๋น„๊ต์šฉ.
  • mAP@K โ†’ CIRCO, GeneCIS์ฒ˜๋Ÿผ ๋‹ค์ค‘ ์ •๋‹ต ํ™˜๊ฒฝ์—์„œ ๊ณต์ •์„ฑ ํ™•๋ณด.
  • TIFA โ†’ ํ…์ŠคํŠธ ์กฐ๊ฑด ์ถฉ์‹ค์„ฑ๊นŒ์ง€ ํ‰๊ฐ€ํ•˜์—ฌ ์ธ๊ฐ„ ์นœํ™”์  ํ•ด์„ ๊ฐ€๋Šฅ.
This post is licensed under CC BY 4.0 by the author.