Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/175186
Title: Evaluating vision-language models long-chain reasoning ability with multiple ground truths
Authors: Setiadharma, Christopher Arif
Keywords: Computer and Information Science
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Setiadharma, C. A. (2024). Evaluating vision-language models long-chain reasoning ability with multiple ground truths. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175186
Project: SCSE23-0243 
Abstract: With the recent advancements in vision-language models, many researchers start to evaluate their various zero-shot capabilities to answer questions given a video input. However, there has not been a standardised and “best practice” method to evaluate the quality of a model’s open-ended answer given a question and multiple ground truths. We reviewed some current methods which includes using n-gram based metrics and using LLM (Large Language Model) as a judge. While n-gram based metrics scored some models answer on par with a human’s answer, these scores do not have high correlation with humans preference when used to rank the models from best to worst. The highest scoring models are found to only have 0.21 Spearman correlation score with human preference. We also designed prompts to get LLM to judge which model answers is better given multiple reference answers through (1) head-to-head which found to have some consistency with human preference (2) ranking all possible answers which found to have higher correlation than n-gram based metrics. We offer a perspective that while additional ground truth would be useful for traditional (n- grams based) metrics, but given a sophiscated LLM, one ground truth might be sufficient to judge the quality of a model’s answer. This is especially moving forward with the rapid advancement of capability of such Language Models.
URI: https://hdl.handle.net/10356/175186
Schools: School of Computer Science and Engineering 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Christopher_Arif_Setiadharma_FYP_Report.pdf
  Restricted Access
1.78 MBAdobe PDFView/Open

Page view(s)

179
Updated on May 7, 2025

Download(s) 50

34
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.