At first glance, LLMs seem impressive. LLMs are able to generate creative human-like text, images and video using simple, open-ended input in natural language across multiple domains, tasks and contexts. Nevertheless, beyond anecdotal first impressions, how can LLMs’ capabilities be objectively and rigorously measured?
Evaluations are designed to meet such a need. Evaluations are methodologies and functions to evaluate a system-under-test (i.e. LLM) for a particular purpose and interpret the results. It consists of (a) a set of tests with metrics, and (b) a summarisation of the tests using the metrics (MLCommons, 2024). Evaluations attempt to measure performance of LLMs on varying tasks such as reasoning, coding, knowledge retrieval etc. Further, measuring LLM safety concerns such as toxicity or “dual-use capabilities” (Barrett et al., 2024) are of increasing importance. Through testing specific LLM capabilities, evaluations also serve multiple broader purposes within the genAI ecosystem. These include marketing (e.g. GPT-4o mini has “superior textual intelligence and multimodal reasoning” (OpenAI, 2024)), regulatory compliance such as under the EU AI Act, assessing use-case suitability by app developers and consumers, and as an indicator of improvement of model inference capabilities (Hoffmann et al., 2022).
LLM evaluation can be thought of as similar to traditional software testing which aims to measure system behaviour that produce metrics that are reproducible, generalisable across different contexts, consistent over time and objective (McIntosh et al., 2024). However, unlike traditional software testing, LLM evaluations are especially challenging as LLMs are black boxes and their outputs are probabilistic. Errors cannot be traced back to a specific part of the code and system characteristics can only be generalised through repeated multiple testing of prompts for a particular task at scale rather than only through anecdotal examples.
Measurement theory concerns the conditions under which mathematical objects (including numbers) can be used to represent the properties of objects. These representations can then be used to express relationships between these objects (Tal, 2020). Measurement theory provides a suitable theoretical framework to assess evaluations because evaluations use numbers to claim certain kinds of relations between different LLMs (i.e. GPT-4o is more capable at reasoning than Claude 3.5). A good measurement has to be reliable and valid. Reliability is often synonymous with terms such as consistency, stability, and predictability. Validity is usually synonymous with truthfulness, accuracy, authenticity and soundness. A good measurement can be reliable but invalid, but not the converse. In other words, reliability is a necessary but insufficient condition for validity (Hubley & Zumbo, 1996).

Figure 1: Reliability vs validity (Trochim, n.d.)
When evaluations are reliable and valid, these evaluations can be useful predictors of future performance on similar tasks. However, as explained below, LLM evaluations have various reliability and validity issues that undermine their robustness.
First, LLM evaluations have low internal consistency reliability. This type of reliability assesses consistency between different evaluations that purport to measure the same construct (Hubley & Zumbo, 1996). Some examples that demonstrate low internal consistency reliability are provided below:
Second, models may be overfitting to popular evaluations because most of such evaluations like MMLU (Massive Multitask Language Understanding) and HellaSwag are open-sourced and are likely included in the training data. This contravenes a principle of machine learning: the model’s capabilities should be tested with an out-of-distribution dataset to assess whether the model is able to generalise beyond its training data. The prevailing data science practice separates datasets into train, validation & test sets. The test set is not seen by the model during training and used after the model’s parameters have been optimised on the validation set to provide an unbiased indicator of model performance (Baheti, 2021). However, when answers are included in training data, the evaluation is no longer an out-of-distribution dataset and does not provide an unbiased indicator of model performance. Overfitting thus causes the evaluation to have reduced predictive validity because the evaluation is now not measuring the LLM’s capability to generalise well to unseen data distributions.
Third, evaluations may not actually be measuring the intended construct (construct validity). As LLMs are black-boxes, though they may return high scores on popular evaluations, these responses may be an imitation of capability (i.e. “stochastic parrots”) rather than a reflection of the models’ actual capabilities. There is no easy or established method to discern the former from the latter. Evaluations may be measuring other irrelevant constructs rather than what was intended.
For example, chain-of-thought (CoT) prompting is frequently used to increase performance on evaluations. Step-by-step explanations generated by the model could be seen as a demonstration of the model’s underlying reasoning capabilities. However, these explanations can be convincing, but unfaithful to the LLMs’ underlying reasoning when using certain biasing features in the CoT prompts. For instance, when all the answers to the examples used to pre-prompt the model were provided as (A), the model similarly generated CoT explanations that justified (A) as the correct option even though (A) was incorrect without explicitly stating that it was, in fact, the biaising features that caused the LLM to choose (A) as the answer (Turpin et al., 2023). Therefore, when CoT is used in evaluations, though CoT produces higher scores, the model was unlikely to have been exhibiting underlying reasoning capability through use of CoT. Rather, CoT seems more likely to produce explanations based on irrelevant statistical correlations in the CoT prompts.
Typical scoring methods may also lack construct validity. Responses are usually aggregated by predefined metrics to summarise the model performance for that particular evaluation. Typical metrics used in machine learning include RMSE / R-squared (for regression models) or precision / recall / F1 (for classification models). The prevailing convention in LLM evaluation restricts the output of the models to a MCQ format, allowing conventional P/R/F1 scores to be used. “The critical weakness is whether the metric actually reliably tracks the property of interest, not the rigour with which the metric is evaluated” (Olah & Jermyn, 2024). Using such restricted metrics may risk reducing high-dimensional concepts such as creativity, coherence or reasoning into a single number.
Fourth, popular benchmarks such as MMLU & HellaSwag seem to have poor convergent validity with other evaluations that aim to evaluate reasoning constructs closely related to those being evaluated using MMLU & HellaSwag.
Convergent validity refers to how well the scores of a test converge with other constructs based on what would be expected from theory (Shou, Sellbom & Chen, 2022). Convergent validity is also used as evidence to demonstrate high construct validity. For instance, depression is theoretically closely linked to anxiety. To demonstrate that a measure for depression has high construct validity, an established measure for anxiety can be used to show correlation between the scores of the different measures (Hubley, 2014).
A simple reasoning problem was formulated called AIW (Alice-In-Wonderland): “Alice has X brothers and Y sisters. How many sisters does Alice’s brother have?” [where X and Y are integers varied across prompts] (Nezhurina et al., 2024). There was a high discrepancy discovered between models’ performance on popular evaluations (i.e. MMLU, HellaSwag) and AIW. While the latest and biggest models (GPT-4o, Claude 3 Opus, GPT-4) performed consistently well on these popular evaluations (85% and above), on AIW, only these latest and biggest models scored around 40% and more, with the highest, GPT-4o scoring only 65%. AIW, MMLU & HellaSwag all purport to measure various related constructs of reasoning capabilities. Though AIW is not an established measure for reasoning, AIW’s problem is intuitively straightforward to solve. Hence, the high discrepancy between scores on AIW and MMLU raises concerns about the potential lack of construct validity of MMLU & HellaSwag.
Last, even if there is construct validity, external validity may be an issue as evaluation results may not be generalisable. The most common use-cases for such models are on downstream chatbot applications such as drafting text across different disciplines, asking for advice, and explanation of concepts that take place over multi-turn conversations which require more creative output (Wiggers, 2024). Though capability to conduct natural, factual and helpful conversations to generate varying textual outputs undoubtedly requires competency in knowledge, reasoning and inference as measured by existing popular evaluations, measuring these characteristics in isolation through MCQ format does not necessarily relate to competency in downstream applications. App developers also consider more than raw performance metrics when integrating foundation models into software. Some considerations include a balance between costs, type (e.g. multimodal, text, multi language etc.), and latency of generation (Spisak et al., 2024) which are not usually assessed as part of existing evaluations.
Combined with the necessity of evaluations to achieve essential broader societal objectives, these issues with LLM evaluations have wide-ranging downstream impact. For instance, some financial experts note that genAI are not “designed to solve the complex problems that would justify the costs [of investment]” and predict that the genAI hype bubble may be about to burst (Goldman Sachs, 2024). Whether genAI indeed has such capability or is merely hype depends, amongst other things, on reliable and valid evaluations.
Evaluations also risk being viewed as ends for model development in themselves rather than merely fallible means of measuring broader, abstract concepts. One well-examined instance of this phenomenon outside AI is how IQ tests became relied upon as a determinate predictor of future educational performance rather than how they were originally designed – to be merely a measure of current cognitive abilities, used to curate appropriate educational aids and tutoring for the child (van Hoogdalem & Bosman, 2023). Muller 2020 terms this as “metric fixation”. A similar phenomenon could arise for the stakeholders in the genAI ecosystem.
Existing “quick fixes” can be instituted to improve reliability and validity. For instance, qualitative human review of a subset of prompts & model responses instead of relying on a single metric may improve reliability. Some evaluation suites also implement various types of irrelevant perturbations to design “adversarial” prompts which still aim to measure the same construct. Such perturbations change tokens but maintain semantic, reasoning and answer invariance which would increase internal consistency reliability (Li et al., 2024). To increase predictive validity, evaluations should not be entirely open-sourced to avoid overfitting. Instead, a hold-out validation set should be kept closed-sourced. Formalising these quick fixes as standardised protocols can improve reliability and predictive validity.
However, evaluations can be reliable but still invalid. To substantially improve validity in light of the multifaceted nature of evaluations, evaluations should shift away from the current paradigm of testing. Some have proposed to adapt methodology used for testing human capability within cognitive sciences to test LLMs (Burden, 2024; Zhuang et al., 2023). Cognitive science actively grapples with the measurement of human intelligence as an abstract, social construct. For instance, psychometric testing has designed tests with standardised test protocols that allow the comparison of an individual’s performance with an appropriate comparison group across varying purposes such as cognitive impairment or functional capacity for everyday tasks. Test scores have to be interpreted and applied differently depending on the specific purpose and context (Institute of Medicine, 2015). In this spirit, the MLCommons LLM evaluation test specification schema (MLCommons, 2024) tries to achieve such clarity in the design of evaluations, though more can be done to learn and adapt from such existing experimental controls identified in cognitive science.
In the same vein, instead of considering evaluations as akin to comparing the speed & acceleration of simple technical objects such as cars, a more suitable comparison could be with the mature fields of sovereign credit ratings, university rankings or cognitive science. While each of these fields aim to achieve objective measurement of their respective systems, these measurements have faced similar problems. Sovereign credit ratings have been criticised for their failure to anticipate financial crises (external validity) (Haspolat, 2015). University rankings have similarly been critiqued for their inability to measure educational quality as educational experience cannot be reduced to a simple ranking (construct validity). Some universities have reallocated resources to improve their ranking at the expense of quality research and teaching (metric fixation) (Robinson, 2014).
LLMs should be evaluated as sociotechnical systems, rather than merely technical objects like conventional software. Sociotechnical systems are a configuration of technologies, services and infrastructures, regulations and actors that fulfil a societal function (Schot et al., 2016). The interaction between the social and technical components determine the risks that manifest (Weidinger et al., 2023). As finance and education are sociotechnical systems, the evaluation of credit risks is not a simple application of financial forecasting models and the assessment of educational quality cannot be fully captured by university rankings. Similarly, evaluations aim to measure capabilities with existing societal meanings such as intelligence, reasoning, toxicity, and bias. GenAI is more than just technology, but includes services, infrastructures and regulators. Accordingly, typical software testing approaches may no longer apply. LLM evaluations should eventually move towards socio-technical methods such that LLM evaluations can sufficiently measure all that we need.
