Back to Expertise

Who Evaluates the Evaluators? GenAI, Legal, and You

August 25, 2025

legal tech innovation artificial intelligence data and ai

Note: This post draws from a soon-to-be-published article ‘Vibe Evaluation: A Reflection on the Vals GenAI Legal Study,’ co-authored with Dr. Adam Roegiest, analysing in-depth Val AI’s Legal AI Report released earlier this year.

This past February, Vals AI released the first ever Legal AI Report, evaluating human and GenAI performance on various legal tasks. The report prompted declarations that AI is ‘better than humans’, undoubtedly triggering countless lawyers to shudder about their long-term career prospects.

In all the hullabaloo, a crucial point was missed: whether the methodology justified the sensational headlines.[1] After all, evaluation is bounded by the systems and contexts in which it is done.  Before jumping to conclusions about GenAI in any context, let alone the legal context, it is crucial to take a step back and consider what it means to solve a problem and how quality is measured.

Matching Methodologies to What You Need to Decide

Evaluation is the basis for understanding a tool’s effectiveness and therefore is the driver of better solutions, superior decision-making, and true accountability.  Given that an evaluation’s purpose is to facilitate decisions, a key issue is the relationship between findings and the decisions those findings inform. Put another way, does the evaluation capture what you need to know to decide the question(s) at hand?

The Vals AI report assessed human and AI performance across eight tasks deemed representative of tasks that lawyers might conduct. Our question is whether the evaluation enables the reader to apply the findings to their own problems and decision-making.

In our upcoming paper, Adam Roegiest and I examine that question in detail. But for purposes of this post, the point is to always understand both the reach and the limitations of an evaluative methodology. Our upcoming paper will shed light on the critical thinking processes that practitioners need to adopt to successfully adapt lessons taught during evaluation to their business objectives.

Transparency

The fewer details furnished by evaluators, the more you must trust that your assumptions about the methodology and its applicability to your needs are aligned with those of the evaluator.

In our article we will highlight that the lack of details in the original study may change the interpretation of the results, either positively or negatively depending on one’s assumptions.  We offer the reader guidance on how to think through these questions for themselves.

The particulars of scoring methodologies matter. The devil is in the details.

Conclusion: See the Forest and the Trees

Conducting an evaluation of Generative AI tools is tremendously difficult and will remain so. No doubt more studies on the legal application of AI are yet to come. Whatever results they provide, the applicability of those results will depend on the methodology employed, the extent to which the results support the conclusions, and most importantly that the reader does not have to be clairvoyant in order to apply those conclusions to their own work.  Our forthcoming article will explore these questions in detail.

[1] For Dr. Roegiest’s and my answer to that question, you’ll have to wait for the publication of the aforementioned paper. That said, note that the report itself did not claim to be anything more than a ‘vibe check’ on the state of AI for legal as far results quality, rather than a definitive statement vis-a-vis human effectiveness.

Evaluations of legal tech AI generate headlines – but before jumping to conclusions about GenAI in any context, it is crucial to take a step back and consider what it means to solve a problem and how quality is measured.

Back to Expertise