Benchmarking Humans & AI in Contract Drafting

Preliminary Findings, September 2025

1. Executive Summary

Many legal teams today face an undeniable challenge: how to deliver more with less.

Contract drafting, a cornerstone of legal value, remains one of the most time-intensive parts of legal work, and lawyers have already turned to AI to extend their capacity.

Our findings reveal the following:

%
of lawyers use AI tools for legal work, making AI usage nearly universal in the legal profession.
%
of lawyers use two or more AI tools, reflecting active experimentation across multiple platforms and solutions.
%
of lawyers use legal AI tools, showing awareness but limited and fragmented adoption despite a crowded market of entrants.

With lawyers experimenting across general and legal AI, the critical question is how these tools actually perform in contract drafting.

The answer is not straightforward. Contract drafting is both an art and a science: there is rarely a single "correct" draft, only one that is "fit for purpose" in context. Yet within that subjectivity, lawyers share common unspoken standards for what makes a draft robust, or "way off."

This benchmarking report introduces a framework for measuring those unspoken, subjective standards. It translates subjective judgments about draft quality into quantifiable metrics, making visible the elements of professional legal standards and testing how both AI solutions and human lawyers measure up.

We focus on three important dimensions of a tool’s usability for contract drafting - whether the outputs are legally and factually sound (Output Reliability), the extent to which they aid lawyers in delivering a good working draft (Output Usefulness), and how well the platform itself supports context, verification, and day-to-day workflow (Platform Workflow Support).

Three Dimensions of AI Tool Utility

Platform WorkflowSupportOutputReliabilityOutputUsefulnessMax Utility

Key Findings

Based on the evaluations of 450 task outputs, 72 legal community survey responses, and 12 interviews with in-house legal leaders, our key findings are as follows:

  1. AI tools matched and, in some cases, outperformed lawyers in producing reliable first drafts. Humans were reliable in 56.7% of tasks, but several AI solutions met or exceeded this baseline.
  2. The top AI tool marginally outperformed the top human. The top human lawyer produced a reliable first draft 70% of the time, whereas the top AI tool produced a reliable first draft 73.3% of the time.
  3. Legal AI tools surfaced material risks that lawyers missed entirely. In drafting scenarios with high risks, legal AI tools were far more likely to exercise legal judgment, raising explicit risk warnings in 83% of the outputs compared to 55% for general-purpose AI tools. Humans, by contrast, raised none.
  4. Specialized legal AI tools did not meaningfully outperform general-purpose AI tools in both output reliability and usefulness. General-purpose AI solutions had a slight edge in output reliability, while legal AI solutions scored marginally higher on output usefulness.
  5. Platform Workflow Support is the key differentiator for specialized tools, not output performance. Two-thirds of the solutions we tested (66.7%) integrate into Microsoft Word, the primary work environment for lawyers, and most also provide context handling functionalities that most general-purpose AI tools lack.

2. What We Benchmarked

We evaluated 13 AI tools that reflect what lawyers can realistically access today, spanning purpose-built legal AI platforms and general-purpose AI assistants/chatbots, and a Human Baseline.

Selection prioritized global accessibility, category coverage (legal-specific and general), market maturity (emerging to established), and practical relevance for day-to-day contracting. Platform and model performance/features were evaluated in July–August 2025; readers should note that capabilities constantly evolve. See Appendix A for task dataset, methodology & benchmark limitations.

2.1. What's in Scope

Legal AI Platforms

Note: Tasks were collected without enabling all available features on each platform (e.g., prompt enhancers, agent modes, deep research & jurisdiction toggles). Results reflect a basic user experience rather than an advanced user experience.

Anon

An anonymized, long-standing enterprise legal-AI platform, included to broaden comparison beyond emerging tools to reflect more mature options large legal teams consider. We did not receive permission to name the product by publication time. To respect confidentiality while retaining comparative value, we assessed drafting outputs only and excluded it from round 3 workflow support scoring.

August

August is a web-based legal AI platform offering modular workflows that adapt to each organization’s playbooks and standards.

Brackets

Brackets AI is a generative AI-powered Microsoft Word add-in that helps legal teams draft, review, and redline contracts faster and smarter. By leveraging your organization's templates, model clauses, and playbooks, it generates tailored output that aligns with your standards and requirements. Brackets AI provides purpose-built features for contract drafting, reviewing, redlining, language checks, summarization, translation, comparison, and more – all seamlessly integrated into your existing Microsoft Word workflows.

GC AI

GC AI is the leading legal AI for in-house counsel, helping them work faster, accomplish more, and drive their companies' business. With industry-leading accuracy, source citations, AI in Word, plus security and compliance, GC AI serves 650+ legal teams globally. More than 50 public companies, 25 unicorns, and top enterprises like Webflow, Hitachi, Vercel, Logitech, and Liquid Death, are GC AI customers.

InstaSpace

InstaSpace is an AI-native contract management platform designed to streamline drafting, review, negotiation, signing, and tracking of contracts in both Arabic and English. It automates repetitive tasks and centralizes contract data into a single source of truth, helping businesses reduce financial and legal risks, ensure compliance, and accelerate deal cycles. The current Alpha release marks an early stage in its development.

SimpleDocs

SimpleDocs is an AI-native contract automation platform built for in-house legal teams and law firms. The platform combines AI-powered drafting, redlining, and review with configurable playbooks and an AI-first contract repository, enabling legal teams to manage contracts from first draft to negotiation and storage. By grounding its tools in real-world contract data and adapting to each team’s standards, SimpleDocs ensures faster turnaround, greater accuracy, and more consistent results—delivering measurable ROI and confidence across every stage of the contracting lifecycle.

Wordsmith

Wordsmith is an Edinburgh-based legal tech startup that builds AI tools primarily designed for in-house legal teams.

General AI Assistants

ChatGPT(GPT-4.1 & GPT-5)
Claude(Opus-4.1)
Copilot(free version)
Gemini(2.5 Pro)
Le Chat(Mistral mistralai/magistral-medium-2506)
Qwen(qwen3-235b-a22b)

Human Lawyers

A group of In-house commercial lawyers with an average of 10 years of working experience each completed the same tasks based on the same instructions provided to the AI Solutions. The human outputs were then scored under the same criteria as the AI tools, and formed a reference point for the performance of the average in-house lawyer.

2.2.Three Evaluation Rounds

We evaluated the AI tools across three dimensions, each assessed in a separate evaluation round.

Evaluation Framework

DimensionWhat it answersHow it was scored
Output ReliabilityIs the output accurate and legally adequate as a starting point?Instruction compliance, factual accuracy, and legal adequacy on a Pass/Fail basis per task.
Output UsefulnessTo what extent does the draft ease the reviewer's workload while elevating quality?Clarity (1-3), Helpfulness (1-3), and Appropriate Length (1-3), summed to a total of max 9 points.
Platform Workflow SupportHow well does the solution support the full drafting workflow from generation (e.g., context management and draft output) to quality assurance (e.g., refinement & proofreading)?Draft Generation Support (1–5 points) and Quality Assurance Support (1–5 points), summed to a total of max 10 points.

3. Results & Analysis

3.1. Overall Performance

Key Insight: The story of applied AI is still being written: legal-specific platforms are not far ahead of general-purpose AI assistants in output quality, but they are beginning to distinguish themselves through platform functionalities built for lawyers.

Overall, AI tools match, and in some cases surpass, the human lawyers for producing reliable, useful first drafts.

Overall Performance Matrix

Platform Workflow Support

Band 1 (Highest support)
Band 2 (Moderate support)
Band 3 (Lowest support)

Highlights:

  • Output Reliability: Top performing tools consisted of Gemini 2.5 Pro, GPT-5, GC AI, Brackets, August, and SimpleDocs. Gemini 2.5 Pro was the top performer with a Reliability Rate of 73.3%, 16% above the overall average of 57.3%. The human lawyers achieved 56.7% (rising to 61.5% with AI assistance).
  • Output Usefulness: Top performing tools consisted of August, GC AI, and Gemini 2.5 Pro, with August leading on average total score (8.13/9 points). The human lawyers scored 7.53/9. The top legal-AI platforms produced more immediately useful outputs than the human lawyers and most general-purpose AI assistants.
  • Platform Workflow Support: Brackets, GC AI, and SimpleDocs were the top performers, with Brackets obtaining the highest marks, scoring 8/10 points. General purpose AI assistants trailed because they required more manual "glue work" from lawyers to fit into workflows.

🧑‍💻 Tool selection is multi-dimensional, as one Fortune 500 GC puts it:

Accuracy is only one [of several] factors we consider. Every lawyer is responsible for reviewing the work product they produce, with or without AI.

3.2. Output Reliability Assessment

Key Insight: AI did not achieve perfect accuracy, but neither did humans. Top AI tools can match or beat skilled lawyers in the Reliability Rate of preparing first drafts. Gemini 2.5 Pro achieved the highest overall Output Reliability Rate at 73.3.%, 16.6% higher than the human lawyers.

Context: The Output Reliability Assessment establishes whether a draft is factually accurate, relevant, and legally adequate to qualify as a draft that lawyers can trust without having to second-guess the content of every clause (Output Reliability).

The rankings below report the percentage of tasks where a draft met all reliability criteria (instruction compliance, factual accuracy and legal adequacy) (Output Reliability Rate).

Output Reliability Ranking

Band 1
Band 2
Band 3

Methodology Note:

A draft passed only if it met all Key Elements and each of the defined reliability criteria (see Appendix A for details). Banding is provided as a visual aid to give readers a quick snapshot of relative performance. The groupings are based on observed data gaps but are not definitive. Readers are encouraged to review the underlying results for a full picture of each tool’s performance.

Highlights:

  • Top performers include Gemini 2.5 Pro, GPT-5, GC AI, Brackets, August, and SimpleDocs. Together, they achieved an average Adequacy Pass Rate of 66.1%, which is about 8.8 percentage points above the overall average of 57.3%. The top individual performer is Gemini 2.5 Pro at 73.3%.
  • GPT-4.1, Anon, Opus 4.1, and the human lawyers were in the middle of the pack. The human lawyers sits at 56.7% and rises to about 61.5% with AI assistance.

Discussion:

We observed the following patterns in Output Reliability based on an analysis of reviewer critiques of the task outputs:

✅ A reliable draft…❌ An unreliable draft…
Uses the exact facts provided in the instructionMisstates a key figure or fact
Converts colloquial phrasing to the proper legal termDeviates from the scope of the task, misses key elements expected in the output
Preserves the pro-client stance as instructed and avoids concessionsSlips in counterparty-friendly language or creates unnecessary concessions that advantage the other side
Covers expected key elements with precise termsCopies ambiguity verbatim, leaves non-standard phrasing in the draft output

Output Snapshots — Task 22: Jury Trial Waiver (NY)

Query:

Modify the following clause:

Governing Law; Dispute Resolution. The provisions regarding the governing law, dispute resolution methods, renounce of jury trial, and arbitration are expressly incorporated herein.

✅ Gemini 2.5 Pro Output Excerpt:

...Governing Law; Dispute Resolution. The terms and provisions related to governing law, dispute resolution, waiver of a jury trial, and arbitration, as set forth in…

This version directly incorporates your requested change. It also uses the more standard term "waiver" instead of "renounce" for legal clarity.

Reviewer Commentary: Converted the colloquial "renounce/renunciation" phrasing to the accepted legal term, placed it in the right section, and stayed within scope.

❌ Qwen Output Excerpt:

Governing Law; Dispute Resolution

The parties hereby expressly incorporate by reference the provisions governing law, dispute resolution methods,renouncement of a jury trial, and arbitration contained in…

Reviewer Commentary: Echoed the user's non-standard wording instead of converting to the accepted term.

Implications:

Output Reliability Assessment across tools varies widely, from 44% to 73.3%. Tool choice directly changes how much early verification and rework a lawyer faces. In short, the reliability assessment serves as a guide for how much oversight a given AI's output requires.

Importantly, an AI's imperfections are not a reason to dismiss its value outright – they are a reality to be managed.

In other words, the fact that AI drafts sometimes miss the mark doesn't negate their significant benefits. It simply means lawyers must continue to exercise critical judgment. By knowing the typical failure points, lawyers can double-check those areas or provide more precise prompts upfront.

👩‍💻 As one senior legal counsel puts it:

I wouldn't fire an intern, junior, or an AI over citing a wrong fact. The draft still saves me more time than doing it myself. I'm a scared user who always checks the key points, and I'll often cross-check an output with multiple other AIs. It's not a big issue, it's just part of the work.

Over time, as AI models improve and as lawyers become more adept at using them, we expect the "reliability gap" to narrow. But until then, oversight remains the safety net.

3.3. Output Usefulness Assessment

Key Insight: The most useful AI outputs make lawyers faster, sharper, and more confident in their judgment.

Context:

The Output Usefulness Assessment is a rubric-guided evaluation that quantifies the subjective elements of practical utility in contract drafting assisted by an AI tool (Output Usefulness). The rankings below report each tool's average usefulness score based on a rubric of clarity, helpfulness, and appropriate length (Output Usefulness Score).

Output Usefulness Ranking

Band 1
Band 2
Band 3

Methodology Note: A panel of legal experts scored each output manually based on clarity, helpfulness, and adequate length (each 1–3 points where 1 = not good, 2 = ok, 3 = great; total 3–9 points). Scores were averaged by solution. To mitigate bias, we used double-blinded reviewer assignments (see Appendix A for details). Banding is provided as a visual aid to give readers a quick snapshot of relative performance. The groupings are based on observed data gaps but are not definitive. Readers are encouraged to review the underlying results for a full picture of each tool’s performance.

Highlights

  • August, GC AI, and Gemini are the top performers. Together, they achieved above-average Output Usefulness Scores.
  • SimpleDocs, GPT-4.1, Brackets, Qwen, and Copilot's results cluster around the overall average.
  • Opus 4.1, Mistral and Anon averages roughly 6.5–7.03/9 points, below the overall average.
  • The chart below shows how each platform performed across these sub-dimensions. Individually, August was the top performer for clarity with an average score of 2.83 out of 3. Gemini 2.5 Pro ranked highest for helpfulness at 2.6 out of 3 points. On length, August, SimpleDocs, GPT-4.1, and Copilot all shared the lead with an average of 2.77 out of 3 points.

Output Usefulness Assessment by Platform

Clarity
Helpfulness
Adequate Length

Discussion:

An analysis of expert reviewer critiques across 450 task outputs revealed that lawyers perceive AI output utility mainly through the following two factors:

  • Efficiency gains: reduced time and effort to verify and edit, including right-sizing length.
  • Quality gains: the uplift in drafting standard, including clarity, structure/formatting, and, when needed, traceable reasoning and targeted risk flags.

The tools that proved most practically useful lightened not only the mental strain of drafting and sharpened lawyers' ability to review, reason, and spot issues.

Observed Usability Patterns

Output Value✅ A high utility draft…❌ A low-utility draft…
Appropriate LengthRight-sized content with no filler or gapsOver- or under-length requiring trimming or expansion
Reasoning TransparencyShows work with traceable edits and reasoningNo reasoning trail or assumption cues
Risk AwarenessFlags ambiguity-driven risks and offers alternativesMissed obvious enforceability issues
Structure & ClarityClear clause structure with focused contentMeandering structure that obscures key points
Quality EnhancementOptional targeted risk notes and compliant alternatives for high-risk requestsIgnores risks, drafts blindly without flagging concerns or safer alternatives

Output Snapshots — Task 15: New York Late Payment Penalty

Query: "Draft a clause that adds a 10% fine if the debt is not paid by the 5th of each month under NY law."

✅ High Output Usefulness Rated Output (GC AI)

Output Excerpt:

Here's the late payment penalty clause:

Late Payment. All payments due hereunder must be received on or before the due date specified. In the event any payment is not received in full by the fifth (5th) day of the month in which it is due, a late fee equal to ten percent (10%) of the overdue amount shall be assessed…

Risk considerations: The 10% late fee is quite high and could potentially be challenged as punitive rather than compensatory. NY courts generally uphold reasonable late fees, but 10% monthly could be seen as excessive depending on the underlying debt amount and industry standards. Consider whether this percentage aligns with actual administrative costs and damages.

Expert reviewer commentary: Clean clause in standard form plus a concise enforceability flag. While it would be even stronger if grounded in reliable sources, the flag already provided targeted guidance that enhances legal judgment.

❌ Low Output Usefulness Rated Output (Anon)

Output Excerpt: "In accordance with New York law, if any debt due under this Agreement is not paid by the fifth (5th) day of each month, the debtor shall incur a penalty of ten percent (10%) of the outstanding amount. This penalty shall be in addition to any other interest or charges applicable under this Agreement…"

Expert reviewer commentary: This is a less helpful response, although it satisfied the user's request and is accurate, because it offered no insight that the clause could potentially be legally problematic.

Implications:

AI value is usually defined in terms of efficiency gains because it is easier to measure. But that is only half the equation. The harder, often overlooked, value gained from using AI in legal work is quality: outputs that sharpen reasoning, clarify assumptions, and flag risks that the lawyer would otherwise overlook.

The quality gains are harder to measure in practice because the improvements are folded into the lawyer’s final draft and therefore not attributed to AI., This study, however, provides rare evidence of measurable quality gains, reflected in evaluation score differences. Community surveys further echo this finding, showing that lawyers universally expect AI to exercise legal judgment—improving reasoning rather than blindly executing directions.

How should an AI tool handle a user's request to draft an unenforceable/illegal clause?

For AI builders, this can be designed in: outputs can default to showing their work, surfacing assumptions, and flagging risks so lawyers can focus on judgment. For lawyers, the same transparency and assistance can be prompted explicitly by asking AI to show its work, provide rationales, and assumptions. Together, these practices can increase drafting efficiency and draft quality, shortening the path to a final draft while sharpening professional judgment.

👨‍💻 Many lawyers are already working this way. As one junior in-house lawyer put it:
"I see AI as an extension of my brain to supercharge my legal judgment."

3.4. Platform Workflow Support

Key Insight: Legal AI tools are moving the battleground from competing on outputs to becoming deeply integrated into the lawyer workflow to become a brain where "legal" knowledge is stored and utilized.

Context: We grouped AI tools into three bands based on the degree of drafting workflow support they provide at the platform level.

Methodology Note: This round was a qualitative review of how well each platform supports the contract drafting workflow. Our team of in-house lawyers used every tool on real drafting tasks, noting both useful features and friction points. Each platform was graded on two factors:

  • Draft generation support (how well the tool helps lawyers create and adapts drafts); and
  • Review support (how well it helps lawyers check, refine, and finalize drafts).

Scores for these factors were combined into an overall Platform Workflow Support rating.

Platform Workflow Support Ranking

Band 1
Band 2
Band 3

Highlights:

  • The top performing tools were Brackets, GC AI, and SimpleDocs, which all offered integrations within Microsoft Word, where most lawyers do their work. Additionally, these top performing tools provided some combination of template and clause storage and drafting/redlining specific capabilities, minimizing the need for lawyers to switch apps or manually stitch together outputs.

The table below highlights each top performing tool and the unique workflow-centric feature that distinguishes it:

Tool🌟 Standout Features
BracketsNative Word integration with adaptive workflow guidance that automatically analyzes an open contract in Word and proactively suggests next drafting or review actions.
GC AICross-device continuity linking a Word add-in and a web app – lawyers can start drafting in Word on a desktop and seamlessly continue on a phone or browser, with context preserved.
SimpleDocsBuilt-in clause library & benchmarking via Law Insider integration – as lawyers draft, it surfaces market-standard clause alternatives and shows how draft language compares to common practice, like an AI-assisted clause librarian.
  • August and some general AI assistants provide solid drafting support, but in narrower ways than the top performers. Many in this cluster are general AI assistants that lack drafting-specific platform capabilities (e.g. they do not integrate into Word (apart from Copilot), won't automatically spot missing definitions or ensure a contract conforms to legal style, since they're built for broad use. August does offer legal-specific capabilities but it is still chat-based in the web app.

Implications:

In real-world adoption, a strong AI draft isn't enough to drive adoption; the tool also has to fit into the way lawyers work and assist with the collection, storage, and use of relevant context.

Legal AI vendors are already moving in this direction, building beyond model quality toward end-to-end workflows designed specifically for lawyers, while most general-purpose AI still trails behind.

Looking ahead, as accuracy becomes table stakes, legal teams will be evaluating AI less on "what it can draft" and more on how seamlessly and scalably it supports the drafting workflow.

4. Humans vs. AI

Key Insights:

  • Overall, AI tool outputs were more reliable than human outputs, but human outputs are more useful than AI tool outputs.
  • Humans excel at interpreting legal instruction, exercising commercial and legal judgment, context-heavy drafting, and avoiding unneeded concessions.
  • AI is both a valuable collaborator (handling drudge work, providing suggestions) and a competitive challenger (threatening some junior-level tasks and requiring new oversight skills).

4.1. Performance Summary

On average, human lawyers achieved 56.7% Reliability Rate (rising to 61.5% with AI assistance) compared to AI tools at 57%. In usefulness, humans scored 7.53/9 points, just above AI's 7.25/9 points, reflecting their edge in contextual and judgment-heavy drafting. But speed is the starkest difference: humans took nearly 13 minutes per task, while AI produced drafts in seconds.

Crucially, the best AI tools individually surpassed the best human lawyer. Gemini 2.5 Pro reached 73.3% reliability, and GPT-5 achieved about 73%, both ahead of the top human at 70%. This shows that, at the high end, AI now leads on first-draft reliability.

MetricHuman LawyersAI Tools
Average Output Reliability Rate56.7%57%
Average Output Usefulness Score (3-9)7.537.25
Avg. Drafting Time per Task12 minutes 43 seconds< 1 minute

4.2. Humans vs. AI: Strengths and Weaknesses

Our analysis of reviewer critiques and scores revealed distinct patterns in how humans and AI tools approach contract drafting challenges.

4.2.1. Human Strengths

Human lawyers demonstrated clear advantages in tasks that required commercial understanding and complex context management.

Humans were consistently more reliable at:

  • Interpreting intent and tailoring drafts to reflect user goals without unnecessarily conceding to the counterparty.

  • Exercising commercial judgment, avoiding drafts that were overly aggressive or detached from real-world deal dynamics.

  • Managing multi-form inputs, such as combining templates, term sheets, and informal communications into a coherent draft.

  • Producing legally precise text, with polished style, appropriate phrasing, and minimal ambiguity.

By contrast, AI outputs in similar scenarios could be unpredictable. They sometimes produced drafts that were either too aggressive or too favorable to the counterparty, misaligned with the user's intent.

Human lawyers did not make these kinds of concessions.

Example 1: MOU Drafting Task

In one MOU drafting task, the drafter needed to integrate multiple sources: a template, a term sheet, and an email thread containing a screenshot of company details. Only the human lawyer successfully extracted the complete and correct party information from the screenshot and applied it to the draft. All other AI outputs did not provide completely accurate company details.

Example 2: Task 37 - License Clause

When asked to draft a license clause from a licensee-friendly perspective, Mistral's draft included the phrase "Licensee acknowledges that the Licensed Technology is provided 'as is'"—a concession that leaned unnecessarily in favor of the licensor, according to the expert reviewer. Another AI draft added sublicensing restrictions and omitted an implied license, leading one expert reviewer to comment that it was "licensor-friendly without rationale."

4.2.2. AI Tool Strengths

AI Solutions excelled in speed, consistency, and routine drafting. They were able to produce correct outputs in a fraction of the time it took human lawyers and showed particular strength in boilerplate and formulaic drafting.

In task 15, lawyers were asked to "Draft a clause that adds a 10% fine if the debt is not paid by the 5th of each month under NY law." Every AI tool reproduced the 10% figure correctly. By contrast, one human lawyer mistakenly wrote "9%," a slip expert reviewers jokingly called a "human hallucination."

AI tools were also less prone to underdrafting. While humans often avoided over-explaining, their outputs sometimes lacked the context or verification markers needed for practical use. Expert reviewers described several human drafts as "way too brief" or insufficiently developed, requiring additional work to be usable. By comparison, AI-generated clauses—while not always nuanced—typically provided fuller coverage.

🧑‍💻 One senior in-house counsel captured this trade-off bluntly:

I would not hire the 25-year-old me today. AI is much faster and its work is more verifiable.

4.3. Humans and AI: SWOT Analysis

Taken together, the findings illustrate a complementary balance of strengths and weaknesses between humans and AI tools. Each succeeds where the other falters, and each reveals the other’s blind spots.

The SWOT analysis summarizes how human lawyers and AI tools perform in contract drafting, and what this means for legal teams designing future workflows.

Strengths

Humans: Strong judgment, commercial intuition, and the ability to collect and process unstructured inputs.

AI: Speed, consistency, and accuracy in routine/boilerplate drafting.

Implication: Each covers blind spots of the other, pointing to complementarity rather than substitution.

Weaknesses

Humans: Occasional underdrafting and avoidable slips.

AI: Struggles with legal nuance, commercial realism, and multi-source context.

Implication: Blind spots are predictable, meaning teams can allocate tasks deliberately.

Opportunities

Workflow design: Hybrid models where AI handles coverage/boilerplate while experts provide judgment and oversight.

Two-way auditing: AI can audit humans on simple precision (percentages, defined terms) and complex issues (flagging risks, enforceability), while humans audit AI for commercial realism and contextual nuance.

New skill emphasis: Verification, contextual reasoning, and commercial framing become core lawyer tasks.

Implication: Teams that systematize this complementarity will outperform ad-hoc adopters or those relying only on AI or only on human effort.

Threats

Raised expectations: AI sets a new baseline, clients will increasingly expect faster response times and become less tolerant of "human errors".

Training gap: Junior lawyers lose routine drafting as a proving ground, risking slower development of judgment skills.

False confidence: Teams that rely on AI without adequate review risk commercially unusable, one-sided, or risky drafts.

Implication: Both training pathways for young lawyers and risk management practices could be disrupted if balance is lost.

4.4. Man vs. Machine: Separating Hype from Reality

Several widely held beliefs about humans and AI in legal drafting were put to the test in this study. The results aim to add more nuance to these soundbites.

Belief 1: AI may replace paralegals and junior lawyers in low-level work, but not higher-level judgment work.

  • Findings: Mostly true, but incomplete. AI did outperform human lawyers in routine tasks, producing boilerplate faster and with fewer slips. But it also showed judgment in some high-risk tasks, flagging enforceability issues that human lawyers missed. At the same time, humans outperformed AI when commercial context, negotiation strategy, or messy multi-source inputs were involved.

Belief 2: Humans will not be replaced by AI.

  • Findings: True, with limits. Lawyers focused only on routine, low-risk, standardized work (e.g., NDAs, SOWs, marketing agreements in non-regulated fields) face the highest risk of replacement, not just by AI but also by cheaper outsourced labor. For most other contract drafting, AI outputs still need expert review.

Belief 3: Humans working with AI will always outperform AI alone.

  • Findings: False. Human + AI was more reliable than humans alone, but not more reliable than several of the best AI tools. The assumption that "human + AI is automatically better" does not hold.

Belief 4: Humans must always be in the loop.

  • Findings: True, but it's bidirectional. Humans are essential for reviewing AI outputs for nuance and commercial balance. But the reverse is also true: AI avoided simple errors that humans made and even surfaced complex oversights that humans missed. Oversight works best when it flows both ways—and when the right expertise is applied. As one reviewer noted about an AI-drafted IP licensing clause with hidden pro-counterparty terms, "these (hidden provisions) might be hard for a non-IP licensing expert to understand [and know] how to interpret." Others admitted it was difficult to review some tasks, commenting that "I felt like the AI is smarter than me."

5.2.2. General AI Strengths

On average, foundational models performed slightly better than legal AI solutions on raw reliability scores. While the reason why isn't conclusive from our data, it does bring frustrations to lawyers.

👩‍💻 As one in-house counsel observed:

Why isn't the [enterprise-grade legal AI solution] as smart as ChatGPT? It often misses points that ChatGPT can find.

5.3. Lawyers are Stacking AI Tools

Rather than asking which single tool is best, lawyers now ask: which tool is best for which use case? Performance varied across the three dimensions of reliability, usefulness, and workflow fit, so practitioners combine tools to maximize productivity.

6. Conclusion

This benchmark shows that contract-drafting AI has crossed an important threshold: it is no longer an experiment, but a practical tool with measurable strengths and weaknesses.

Our data is a snapshot in time of a story that is still unfolding. Specialized legal platforms are not far ahead of general-purpose models in output quality, but they are beginning to stand apart in workflow integration and risk-aware features that matter in practice. Lawyers, meanwhile, continue to excel in judgment-heavy and context-rich tasks that AI cannot replicate.

The future of drafting will not be decided by one side or one tool. It will be shaped by orchestration: combining the speed and consistency of general AI, the workflow fit of legal AI, and the judgment of lawyers. The real advantage will belong to teams that learn to design and manage this collaboration.

7. Notes from the Authors

Anna Guo

Anna Guo

Legal AI Researcher

All of us (the vendors, human lawyers, advisors, and community members) who helped shape this benchmark are just small actors within one of the most exciting and fast-moving technological shifts of our time.

The saying goes: keep your friends close, and your enemies closer. Through benchmarking, I try to do exactly that: keep close to the tools reshaping law, so I can see both the opportunities and the risks they bring to my profession.

Arthur Souza Rodrigues

Arthur Souza Rodrigues

Securities and Technology Attorney

Who are these folks evaluating Round Z legal tech?

My answer: We are your future users, regardless of will. We will endure glacial loading times, "I can't do that, I'm a language model," and verbose emoji-riddled responses. We are all burning gigantic quantities of cash, time, and—more (<-en'dash beautifully added by the AI->) critically—the future of the legal profession.

Shouldn't we at least kick the tires?

Mohamed Al Mamari

Mohamed Al Mamari

In-house Counsel

Junior lawyers should spend more time with the business: listen, learn, and build relationships. AI can help, but over time, relying on it too much will dull your ability to think critically. What will truly set you apart is the ability to blend people skills, business acumen, and thoughtful use of technology.

Sakshi Udeshi

Sakshi Udeshi

AI Trust & Safety Expert, PhD in ML

1. Evaluations are imperfect, but essential: Today's evals are still a weak signal. This is not because they're futile, but because they must evolve with the systems they measure. They're worth doing, and doing well. As a community, we should invest in making them reliable and consistent. I'm grateful to help push this forward.

2. Competitive advantage is shifting beyond model quality: "Best base model" won't be the moat. Differentiation will come from deep workflow integration, domain-aware UX, and purpose-built features, especially as general-purpose models converge (and sometimes outperform) specialized tools in capability.

3. Industry academia collaboration must deepen: We need tighter loops. Industry grounds problem definitions and access to real workflows; academia builds the "intellectual infrastructure," i.e. methods, datasets, and open baselines. More joint commitments and shared benchmarks will move the field faster.

Marc Astbury

Marc Astbury

CPO at Jenni AI

The most effective humans have always been the best tool callers.

8. Acknowledgements

This benchmark was a collective effort. It could not have come together without the partnership, guidance, and contributions of many.

HumanSignal

Our sincerest thanks to HumanSignal, whose platform powered the task review process at the heart of this benchmark. Their platform made it possible to evaluate 400+ AI & human outputs at scale for this benchmark.

HumanSignal empowers organizations to build trustworthy AI with humans in the loop at every phase of the development lifecycle. With Label Studio, teams can create custom benchmarks to compare models, measure performance over time, and maintain the transparency and accountability essential for legal, regulatory, and ethical compliance.

Advisors

We are also grateful to our advisors, Nada Alnajafi, Nate Kostelnik, Jason Tamara Widjaja, and Gabriel Saunders. Each brought a different lens of expertise, challenged our assumptions, and helped keep the evaluation anchored to the standards that matter most in practice. Their guidance has been instrumental in shaping both the design and integrity of this project.

Nada Alnajafi
Nada Alnajafi
Senior Contract Expert
Nate Kostelnik
Nate Kostelnik
Senior Contract Expert
Jason Tamara Widjaja
Jason Tamara Widjaja
Executive Director of AI
Gabriel Saunders
Gabriel Saunders
Legal Ops Expert

Community Contributors

Finally, to the community: thank you. Contributors to this benchmark include:

Aayana Rai BhojaniAdam JanesAgustín Silva ZambranoAlfonso LinaresAndrea GaspariAntti InnanenAris MilentisBill McCormickBo KinlochBryan O'BrienCarolyn ElefantCelia ReinsvoldCheng Yew ChuaChris WernerChristos MakrisDavid BuzekDr. David Svoboda, LL.MDavid TollenEeshaan NayakElfur LogadóttirImrana SarwarJamie TsoJean LiJennifer CaseJonathan TayJoshua OngJustin HoKamel AlomariKevin KellerLily TsenLucy PowellMariette ClardyMasai Brown-AndrewsMathias BockMeena ParbhuMichal KristufekMohammed BaaboodNastia Chiganov-ZalesskaiaRebecca ChenPauline TangRachel ChewRajan SinghaniaRajat SharmaRaymond BlydRodney YapShamskhoSheree ZhangSonakshi FaujdarSulaiman Al ShihhiTan Xuan MingUri BarakYi Lyn TanZaki Bin IskandarZachary MolloieZilin

Appendix

A.1. Task Dataset

Our dataset comprised 30 real-world drafting tasks contributed by in-house and private practice lawyers across a wide range of industries, covering partnerships, marketing, financing, and employment contract drafting requests. Each task was provided in the contributor's original words, including shorthand, misspellings, and incomplete instructions, to better reflect the messy realities of legal practice. To ensure coverage of different drafting contexts, tasks were classified along two dimensions:

By Task Type

Task TypeDescriptionExample
Basic Clause DraftingGenerating basic clauses that follow widelyDraft a confidentiality clause requiring both parties to keep information secret for 3 years.
Template-based DraftingAdapting an existing contract template based on the provided factsAdapt a services agreement template based on the party details
Bespoke DraftingDrafting bespoke clauses/agreements based on specific commercial arrangements.Draft a revenue-sharing clause for a joint venture where Party A contributes technology and Party B provides distribution, with profits split 70/30.

By Difficulty Level

Task LevelDescriptionExample
Junior-LevelApply literal instructions to produce a short, accurate clause or fill-in.Draft a notice provision using the specified contact details.
Mid-LevelDraft a clause requiring allocation of risk, integrated definitions, or commercial judgment.Redraft a licensing fee clause to be more favorable to the licensor.

A.2. Methodology

We outline below the methodology for each dimension assessment.

Output Reliability Assessment

This assessment checked whether each draft met the minimum legal and factual requirements to serve as a reliable first draft. We focused on three criteria:

  • Instruction Compliance: Whether the draft stays relevant and within the scope of the request, interpreting any unclear instructions reasonably.
  • Legal Adequacy: Whether the draft correctly applies relevant legal principles and uses appropriate legal terminology.
  • Factual Accuracy: Whether all factual details (names, dates, numbers, etc.) are correctly incorporated, with no new errors introduced.

Evaluation Process

The task contributing lawyer included an answer key containing the basic elements of what an acceptable draft must include (Key Elements). An LLM jury evaluated each AI-generated and human-generated draft against these Key Elements and the above criteria. A draft received a "Pass" only if all Key Elements were present and all three reliability criteria were satisfied. To mitigate errors, any borderline cases or inconsistent AI evaluations were escalated to expert reviewers for a final decision. Each tool's Output Reliability Rate was then calculated as the percentage of tasks passed.

Round 2 – Output Usefulness Assessment

The second round examined the practical usefulness of each draft. Expert reviewers scored each draft on three attributes:

  • Clarity: Is the language and formatting clear, unambiguous, and professional in tone?(rated on a 3-point scale, 1 = Not good, 2 = Okay, 3 = Great)
  • Helpfulness: Does this output reduce the lawyer's review and editing burden?(rated on a 3-point scale, 1 = Not good, 2 = Okay, 3 = Great)
  • Adequate Length: Is the content appropriately scoped – neither missing key points nor over-included or verbose(rated on a 3-point scale, 1 = Not good, 2 = Okay, 3 = Great)

Evaluation Process

Outputs were evaluated in a double-blind process by expert reviewers on clarity, helpfulness, and appropriate length. For each tool, per-output totals were averaged across all tasks to yield its average output usefulness score.

Round 3 – Platform Workflow Support Assessment

The third round evaluated how well each AI platform supports the contract drafting workflow beyond just generating text. This was a qualitative assessment centered around the following 2 factors:

  • Draft Generation Support: Measures the platform's ability to manage context, generate, edit, and adapt contract drafts within the drafting environment.(1-5 points: 1 = Minimal support (raw text only), 5 = Robust drafting integration (Word plug-ins, template handling, clause libraries))
  • Quality Assurance Support: Features that help verify and refine drafts (proofreading, & consistency checks).(1-5 points: 1 = Minimal QA features, 5 = Comprehensive review capabilities integrated in workflow)

Evaluation Process

Our team of in-house lawyers each independently tested every platform, using its chat interface and features to perform multiple drafting tasks from the dataset. The lawyers noted how each tool handled different contract drafting requests above. After hands-on testing, the lawyers compared observations and consolidated them into a joint workflow support score for each platform.

A.3. Limitations

While this benchmark provides valuable insights, several limitations must be acknowledged when interpreting the results:

  • Clusters and Threshold Effects: We grouped tools into 3 clusters for ease of comparison in each round. While these bands are helpful to summarize performance tiers, they also simplify a continuous spectrum of results. A tool near the cutoff between cluster 1 and cluster 2 might be nearly as good as those in cluster 1, and small score differences can lead to band shifts. Moreover, the band labels are qualitative descriptors meant for convenience. They should not be over-interpreted as absolute rankings or definitive grades. Readers should look at the underlying scores and context in addition to band placements, especially if comparing tools that ended up in adjacent bands.

  • No Feature Maximization: Platforms differ in features (e.g., drafting modes, jurisdiction toggles, prompt enhancers, agent modes) and interfaces (web apps vs. Word add-ins). Outputs were collected without enabling each vendor's full feature set, so scores reflect baseline use rather than fully optimized potential. Future research should test features in layers—first at the base level, then optimal features enabled to measure whether there is any incremental lift.

  • Sample Size & Task Representativeness: The evaluation covered 30 tasks and 13 AI solutions. While we included a variety of contract types and industries, a few dozen tasks cannot capture the full diversity of contract drafting scenarios. The tasks were primarily junior to mid-level in complexity.

  • Subjectivity in Human Scoring: Parts of our evaluation, especially the Output Usefulness scores (Round 2) and the Platform Workflow Support judgments (Round 3), relied on expert reviewers' opinions. Even with scoring guidelines, these judgments involve some subjectivity. Different lawyers might weigh clarity or helpfulness differently, and what one reviewer finds acceptable, another might critique.

  • Evolution of AI Models & AI Applications: The AI landscape is changing rapidly. Our tests captured a snapshot in time (July to August 2025) of each solution's performance. Since then, some underlying models may have been updated or fine-tuned, and new versions (or entirely new tools) could alter the competitive ranking. An AI solution that underperformed in our benchmark might improve with a newer model or better training data after this evaluation. Readers should consider that the results could shift as the technology evolves post-evaluation.

  • Prompting Style and User Interaction: We standardized the prompts and inputs given to each AI tool for fairness, using the original task wording provided by contributors. In practice, however, end-users might iterate on prompts or use follow-up queries to get better results. The performance we measured reflects a single-pass output in response to a single prompt. It does not capture potential improvements from a lawyer skillfully re-prompting, clarifying instructions, or using a tool's interactive features. Variability in user prompting style and experience can lead to significantly different outcomes with the same AI solution.

  • Changes in Platform Features: Our Platform Workflow Support assessment is based on the feature sets available at the time of testing. Many platforms are in active development. Since our evaluation, tools may have added integrations, improved their user interface, or introduced new capabilities (or, conversely, removed or altered features). Thus, the workflow advantages or gaps noted in our report might not hold true as platforms evolve.

Despite these limitations, the benchmark offers a useful comparative view of AI capabilities in contract drafting. Readers should consider these factors when applying the findings to real-world decisions. This benchmark should be viewed as a foundation that can be built upon with further research and more extensive testing.

Cite This Report

APA

Guo, A., Rodrigues, A., Mamari, M., Udeshi, S., & Astbury, M. (2025). Benchmarking Humans & AI in Contract Drafting. Retrieved from https://www.legalbenchmarks.ai/research/phase-2-research

MLA

Guo, Anna, Arthur Souza Rodrigues, Mohamed Al Mamari, Sakshi Udeshi, Marc Astbury. "Benchmarking Humans & AI in Contract Drafting." 2025. Web. Sep 18, 2025.

CHICAGO

Guo, Anna, Arthur Souza Rodrigues, Mohamed Al Mamari, Sakshi Udeshi, Marc Astbury. "Benchmarking Humans & AI in Contract Drafting." Accessed September 18, 2025. https://www.legalbenchmarks.ai/research/phase-2-research.

BibTeX

@techreport{guo2025contract,
  title={Benchmarking Humans & AI in Contract Drafting},
  author={Guo, Anna and Rodrigues, Arthur Souza and Mamari, Mohamed Al and Udeshi, Sakshi and Astbury, Marc},
  year={2025},
  url={https://www.legalbenchmarks.ai/research/phase-2-research},
  note={Accessed: 09/18/2025}
}

Note: This report is freely available for academic and professional use. Please cite appropriately when referencing this work in your research or professional materials.