Sidebar

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Speaker: Dr. Ramayya Krishnan – W. W. Cooper and Ruth F. Cooper Professor, Carnegie Mellon University

Date: Aug 6 2025, Wednesday
Time: 11:15 AM – 12:30 PM SGT
Venue: COM1 Seminar Room 1, (#02-06), 13 Computing Drive, Singapore 117417

 

Please register for the lecture here.

Abstract: 

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains relies on sound evaluation. In this talk, I will present work on evaluating one aspect of the robustness of LLM models – their consistent and coherent behavior across multiple rounds of user interaction. This work introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Focusing on the case when models switch from correct answers to incorrect answers in the face of adversarial prompting, we statistically analyze the behavior of the models using ideas drawn from survival analysis to develop actionable insights for societally consequential applications. More information is available in a forthcoming ACL paper.

This research was conducted in collaboration with Yubo Li, Yidi Miao, Xueying Ding, and Rema Padman.

Biography:

Dr. Ramayya Krishnan is the Dean Emeritus and W. W. Cooper and Ruth F. Cooper Professor of Management Science and Information Systems at the Heinz College Information Systems and Public Policy at Carnegie Mellon University. He is an expert in data and decision analytics and digital transformation. He served as President of INFORMS in 2019 and helped lead the creation of its AI strategy.

He is an AAAS Fellow (section T), an INFORMS Fellow, and an elected member of the National Academy of Public Administration. He chaired the AI futures Committee of the National AI Advisory Committee to the President and the White House office of AI Initiatives office and is chair of the DOD’s RAI academic council. He directs the CMU-NIST cooperative research center on AI measurement science and engineering. Please see his full biography here: https://www.heinz.cmu.edu/faculty-research/profiles/krishnan-ramayya

  • Home
  • Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions