2.3 Methodology

We developed a human evaluation framework to assess each AI Assistant's performance on each task. 2 evaluators (with in-house legal backgrounds) reviewed the outputs. The evaluation was structured in 2 rounds, which consists of an Accuracy Assessment and Qualitative Assessment:

Accuracy Assessment (Pass/Fail)

For each Assistant's output, evaluators independently and blindly assessed whether the response met the minimum level of quality a competent in-house lawyer would expect for a satisfactory answer (Accuracy Standard) based on the Accuracy Attributes. The Accuracy Standard is characterized by 3 factors:

  • Factual Correctness: The response must be accurate and free from material errors.
  • Relevance to the Query: The response must directly address the query.
  • Completeness: The response must provide enough information to address the query.

If the evaluators disagreed on whether the output met the Accuracy Standard, they discussed their reasoning, and the decision was escalated to another evaluator, also a legal practitioner, for final judgment. In total, each task was reviewed by a minimum of 2 and up to 4 lawyers with diverse legal backgrounds to ensure fairness and reduce individual bias.

Qualitative Assessment

In addition to the binary pass/fail, each output was reviewed by 2 evaluators and scored based on quality dimensions that matter in the legal practice:

  • Helpfulness (0 to 2 points): Did the answer help solve the lawyer's problem? This covers whether the AI Assistant went beyond copy-pasting text to provide the information in a useful format/ structure.
  • Adequate Length (0 to 2 points): Was the answer appropriately detailed, not too terse to be unhelpful, but not overly verbose?
  • Feature Support (0 to 2 points): Did the AI Assistant effectively leveraged any special features during either input handling or output generation to make the response more useful, accurate, or trustworthy?

These 3 dimensions - Helpfulness, Adequate Length, and Feature Support - together comprise the Usefulness Factors (Usefulness Factors).

Scoring System

Based on this evaluation framework, each AI Assistant was assessed across all tasks using 2 key metrics:

  • The number of tasks it answered accurately (i.e. where both evaluators marked the answer Pass), used to compute an accuracy rate (percentage of tasks passed)
  • An overall performance score, which combines accuracy and qualitative metrics. Each passed task earned 6 points for meeting the Accuracy Standard (0 for a Fail), and up to 6 additional points were awarded based on the Usefulness Factors, yielding a maximum of 12 points per task.

This scoring system gives accuracy the greatest weight, as a pass is worth as much as all 3 quality dimensions combined. This reflects the feedback from in-house counsel we spoke to.