๐ Evaluation Metrics in CIRR - CIRR๋ถ์ผ์ Metrics ์์๋ณด๊ธฐ
๐ Evaluation Metrics in CIRR - CIRR๋ถ์ผ์ Metrics ์์๋ณด๊ธฐ
๐ง Understanding Metrics in the CIRR Domain
- While exploring OSrCIR and CIReVL, several evaluation metrics commonly used in the CIRR domain appeared.
- Since these metrics also frequently appear in other research, I studied their definitions more carefully!
1. Recall@K (R@K)
- Definition: A metric that measures the proportion of queries where the ground-truth image is within the Top-K retrieved results.
- Strengths: Simple to compute and highly intuitive; widely adopted as the standard in most CIR benchmarks.
- Limitations: Fails to fully capture cases with multiple correct answers, and may underestimate performance in datasets with many false negatives (e.g., CIRR).
- Common Usage: CIRR (ICCV 2021), FashionIQ.
2. Subset Recall@K
- Definition: Similar to Recall@K, but instead of the whole database, evaluation is restricted to a smaller predefined subset of related images (e.g., the same scene group).
- The subset is not chosen arbitrarily by the researcher, but is predefined within the dataset itself.
- Strengths: Allows for fine-grained comparison of model performance within a small candidate pool.
- Limitations: Does not directly reflect retrieval performance across the entire DB, and results may vary depending on how subsets are defined.
- Common Usage: CIRR Subset Evaluation.
3. mAP@K (mean Average Precision at K)
Definition: Considers the ranking positions of all correct answers within the Top-K results and computes the average precision.
\(mAP@K = \frac{1}{N}\sum_{i=1}^N \frac{1}{|G_i|}\sum_{j=1}^{|G_i|} Precision@r_{ij}\)(where (r_{ij}) is the rank of the j-th ground-truth image for the i-th query)
- Strengths: Fair in multi-ground-truth scenarios; reflects retrieval quality more precisely than Recall@K.
- Limitations: More complex to compute and less intuitive than Recall.
- Common Usage: CIRCO (ICCV 2023), GeneCIS.
4. TIFA (Text-to-Image Faithfulness Assessment, Hu et al. 2023)
I will write a separate post on TIFA in more detail soon!!
- Definition: A metric for evaluating the faithfulness of generated images to the input text prompt.
- Uses a VQA (Visual Question Answering) model to automatically generate questions like โIs there a dog?โ, โIs the dog brown?โ and compares them with the image.
- Strengths: Goes beyond Recall by measuring textual faithfulness, and shows strong correlation with human evaluation.
- Limitations: Highly dependent on the VQA modelโs performance; incurs additional computational cost; not yet a fully standardized evaluation metric in CIR.
- Common Usage: Hu et al. (ICCV 2023), Vision-by-Language (CIReVL, ICLR 2024).
๐ Summary:
- Recall@K โ the basic standard metric in CIR.
- Subset Recall โ used in CIRR for fine-grained model comparison.
- mAP@K โ ensures fairness in multi-ground-truth settings (CIRCO, GeneCIS).
- TIFA โ evaluates textual faithfulness, providing a more human-aligned interpretation.
๐ง (ํ๊ตญ์ด) CIRR๋ถ์ผ์ Metrics ์์๋ณด๊ธฐ
- OSrCIR, CIReVL์ ์งํํ๋ฉฐ ๋์๋ CIRR ๋ถ์ผ์ ๋ชจ๋ธ ํ๊ฐ Metrics!
- ๋ค๋ฅธ์ฐ๊ตฌ์์๋ ๋ง์ด ๋์ค๊ธฐ์ ์ ์์ ๋ํ์ฌ ๊ณต๋ถํด๋ณด์์ต๋๋ค!!
1. Recall@K (R@K)
- ์ ์: ๊ฒ์๋ Top-K ๊ฒฐ๊ณผ ์์ ์ ๋ต ์ด๋ฏธ์ง๊ฐ ์กด์ฌํ๋ ๋น์จ์ ์ธก์ ํ๋ ์งํ.
- ์ฅ์ : ๊ณ์ฐ์ด ๋จ์ํ๊ณ ์ง๊ด์ ์ด๋ฉฐ, ๋๋ถ๋ถ์ CIR ๋ฒค์น๋งํฌ์์ ํ์ค์ ์ผ๋ก ์ฌ์ฉ๋จ.
- ํ๊ณ: ์ ๋ต ์ด๋ฏธ์ง๊ฐ ์ฌ๋ฌ ๊ฐ์ธ ๊ฒฝ์ฐ ์ด๋ฅผ ์ถฉ๋ถํ ๋ฐ์ํ์ง ๋ชปํ๋ฉฐ, False Negative๊ฐ ๋ง์ ๋ฐ์ดํฐ์ (CIRR)์์๋ ์ฑ๋ฅ์ ๊ณผ์ํ๊ฐํ ์ ์์.
- ์ฃผ ์ฌ์ฉ์ฒ: CIRR(ICCV 2021), FashionIQ.
2. Subset Recall@K
- ์ ์: Recall@K์ ์ ์ฌํ์ง๋ง, ์ ์ฒด DB๊ฐ ์๋ ์ฟผ๋ฆฌ์ ๊ด๋ จ๋ ์์ ์๋ธ์ (์: ๋์ผํ ์ฅ๋ฉด์ ์ด๋ฏธ์ง ๊ทธ๋ฃน)์์๋ง ํ๊ฐ.
- Subset์ ์ฐ๊ตฌ์๊ฐ ์์๋ก ๋ฝ๋ ๊ฒ ์๋๋ผ, ๋ฐ์ดํฐ์ ์์ฒด์ ๋ฏธ๋ฆฌ ์ ์๋์ด ์๋ ๊ด๋ จ ์ด๋ฏธ์ง ๊ทธ๋ฃน์ ์๋ฏธ
- ์ฅ์ : ์์ ํ๋ณด๊ตฐ ๋ด์์ ๋ชจ๋ธ์ ์ธ๋ฐํ ์ฑ๋ฅ ์ฐจ์ด๋ฅผ ํ๊ฐํ ์ ์์.
- ํ๊ณ: ์ ์ฒด DB ๊ฒ์ ์ฑ๋ฅ์ ์ง์ ์ ์ผ๋ก ๋ฐ์ํ์ง ๋ชปํ๊ณ , Subset ์ ์ ๋ฐฉ์์ ๋ฐ๋ผ ํธํฅ ๋ฐ์ ๊ฐ๋ฅ.
- ์ฃผ ์ฌ์ฉ์ฒ: CIRR Subset Evaluation.
3. mAP@K (mean Average Precision at K)
์ ์: Top-K ๊ฒฐ๊ณผ์์ ๋ชจ๋ ์ ๋ต ์ด๋ฏธ์ง์ ์์น(rank)๋ฅผ ๊ณ ๋ คํด ํ๊ท ์ ๋ฐ๋๋ฅผ ๊ณ์ฐ.
\(mAP@K = \frac{1}{N}\sum_{i=1}^N \frac{1}{|G_i|}\sum_{j=1}^{|G_i|} Precision@r_{ij}\)(์ฌ๊ธฐ์ (r_{ij})๋ i๋ฒ์งธ ์ฟผ๋ฆฌ์ j๋ฒ์งธ ์ ๋ต ์ด๋ฏธ์ง์ ์์)
- ์ฅ์ : ๋ค์ค ์ ๋ต(Multi-Ground Truth) ์ํฉ์์๋ ๊ณต์ ํ๊ฒ ํ๊ฐ ๊ฐ๋ฅ, Recall@K๋ณด๋ค ์ธ๋ฐํ๊ฒ ๊ฒ์ ํ์ง์ ๋ฐ์.
- ํ๊ณ: ๊ณ์ฐ์ด ๋ณต์กํ๋ฉฐ, ์ง๊ด์ ์ผ๋ก ์ดํดํ๊ธฐ๋ Recall๋ณด๋ค ์ด๋ ค์.
- ์ฃผ ์ฌ์ฉ์ฒ: CIRCO(ICCV 2023), GeneCIS.
4. TIFA (Text-to-Image Faithfulness Assessment, Hu et al. 2023)
TIFA์ ๋ํ์ฌ๋ ๋ค์ํ๋ฒ ํฌ์คํ ํด๋ณด๊ฒ ์ฐ๋ฏ๋๋ค!!
- ์ ์: ํ
์คํธ ์กฐ๊ฑด(prompt)๊ณผ ๊ฒฐ๊ณผ ์ด๋ฏธ์ง์ ์ถฉ์ค๋๋ฅผ ํ๊ฐํ๋ ์งํ.
- VQA(Visual Question Answering) ๋ชจ๋ธ์ ์ด์ฉํด โ๊ฐ์์ง๊ฐ ์๋์?โ, โ๊ฐ์์ง๋ ๊ฐ์์ธ๊ฐ์?โ ๊ฐ์ ์ง๋ฌธ์ ์๋ ์์ฑํ๊ณ , ์ด๋ฏธ์ง ๊ฒฐ๊ณผ์ ๋น๊ต.
- ์ฅ์ : Recall์ฒ๋ผ ๋จ์ ์ ๋ต ์ฌ๋ถ๊ฐ ์๋๋ผ, ํ
์คํธ ์กฐ๊ฑด ์ถฉ์ค์ฑ(faithfulness)์ ์ ๋ฐํ๊ฒ ํ๊ฐ ๊ฐ๋ฅ.
- ์ฌ๋ ํ๊ฐ์ ๋์ ์๊ด์ฑ์ ๋ณด์.
- ํ๊ณ: VQA ๋ชจ๋ธ์ ์ฑ๋ฅ์ ํฌ๊ฒ ์์กดํ๋ฉฐ, ์ฐ์ฐ ๋น์ฉ์ด ์ถ๊ฐ๋จ. ์์ง CIR์์ ์์ ํ ํ์ค์ ์๋.
- ์ฃผ ์ฌ์ฉ์ฒ: Hu et al. (ICCV 2023), Vision-by-Language (CIReVL, ICLR 2024).
๐ ์ ๋ฆฌ:
- CIR ๋ถ์ผ๋ Recall@K๋ฅผ ๊ธฐ๋ณธ ์งํ๋ก ํ์ฉ.
- Subset Recall โ CIRR์์ ์ธ๋ฐํ ๋น๊ต์ฉ.
- mAP@K โ CIRCO, GeneCIS์ฒ๋ผ ๋ค์ค ์ ๋ต ํ๊ฒฝ์์ ๊ณต์ ์ฑ ํ๋ณด.
- TIFA โ ํ ์คํธ ์กฐ๊ฑด ์ถฉ์ค์ฑ๊น์ง ํ๊ฐํ์ฌ ์ธ๊ฐ ์นํ์ ํด์ ๊ฐ๋ฅ.
This post is licensed under CC BY 4.0 by the author.