Introduction - Legal AI Benchmarking

AI is entering the in-house legal domain quickly, but it's still unclear how well it performs on the work lawyers actually do. Most resources available to in-house lawyers today on AI performance in the legal domain are polished vendor demos, academic papers, or ideal-condition benchmarks, which don’t reflect everyday workflows.

And they rarely answer the question:

Can the tool do the job, and how much human input does it need?

This report is a step in that direction of answering these questions.

We focused on putting AI to the test on Information Extraction Tasks first because it underpins much of legal work from contract review to issue spotting, and is often a starting point for legal AI adoption.

We designed the evaluation around 2 principles:

Real-world inputs. Tasks were submitted by in-house counsel, using documents with redactions and formatting issues.
Practical usefulness. We assessed not just accuracy, but whether the output was usable—clear, scoped appropriately, and supported by features like citations or multi-document processing. [2]

The key findings are as follows:

[2] As much as we considered mirroring classic NLP-style Q&A tasks, that approach didn't reflect the realities of in-house legal work. A question like "Is there a limitation of liability clause in this agreement?" might warrant a yes or no, but in practice, lawyers need more: what the clause says and where to find it etc.

1.1 Introduction