3.2 Overall Performance Ranking

Looking at accuracy alone is not the full story. A simple pass/fail doesn't capture how the answer was given. In practice, 2 AI assistants might both be correct (earning a pass), but one assistant's answer might be phrased much more clearly or provide useful citations, making it far more valuable to the lawyer in daily routine work.

When we factored in the Usefulness Factors, the overall performance ranking of AI Assistants shifts significantly. Out of a theoretical maximum of ~432 points, the rankings are as follows:

Loading chart...

In this composite score, Oliver and GC AI, the 2 legal AI Assistants, ranked highest overall, despite NotebookLM achieving the most passes for accuracy. Their advantage lay in stronger performance on the Usefulness Factors, especially through clear, concise answers, pinpoint citations, and robust handling of multi-file queries. While NotebookLM also supported features like citations and multi-file input, its responses were occasionally too long or less structured. ChatGPT and DeepSeek lost points for overly verbose or generic outputs that lacked legal-specific framing.

Copilot, though powered by GPT models, underperformed on both accuracy and quality factors. [5]

[5] Copilot's lower accuracy score was partially due to technical constraints that prevented it from handling some documents.