Frontier labs are sprinting toward general intelligence. Models keep getting more capable, but still can’t produce work that a senior expert would sign their name to. Real-world legal work involves multi-turn reasoning workflows across different languages, multiple jurisdictions, counterparties, and decisions with cascading effects.
Revisioning designs, runs, and grades adversarial multi-turn legal evaluations against frontier models. Scenarios are written by over 100 Tier 1 lawyers and span multiple jurisdictions, counterparties, and cascading decisions. Candidate responses are graded turn-by-turn by a cross-family LLM judge against a written rubric of substantive obligations — with a senior-partner panel calibrating the judge against human ground truth.
We plug into the stacks customers already run. Multi-turn evaluations feed into Frontier Labs’ RL. The same scenarios run against AI Applications’ full agent harnesses. Law Offices point Revisioning at the vendor tools they're evaluating to validate them before deployment.
EvaluationsGithubBenchmark