SIMULATE · EVALUATE · IMPROVEVoice & chat AI agents

The self-improvement loop for voice AI agents.

Roark simulates your agent against hundreds of scenarios before launch, then scores every production call on 64+ audio-native metrics, so you catch what breaks, prove the fix in simulation, and ship with evidence.

Get started free Book a demo

Start free with $50 in credit, no card needed. Works with Vapi, Retell, LiveKit, Pipecat + your stack.

SOC 2 Type IIHIPAA BAA

Backed by YC

Now improvingPronunciation

7194

Catch

Simulate

Validate

Verify

Caughtmispronounced “metoprolol”

Scoring production voice AI for teams at

radiantgraph

01 · How Roark works

One loop. Always improving.

Catch it in production, prove the fix in simulation, ship with evidence, then Roark watches the next call. This is how an agent self-improves: a loop, with you in it.

01 · Catch

Roark scores every live call and files what breaks.

Live · scoring every call1,284 today

Caller: I was charged twice, I need a refund.

Agent: No problem, I’ve refunded it to your card.

Identity not verified: Refund issued to an unconfirmed caller.Compliance

02 · Simulate

Your fix, replayed against realistic simulated callers.

Testing your candidates82 / 240

prompt · soften tonefail

model · gpt-4o → 4.1fail

voice · calmer pacingfail

03 · Review

Every change explicit and diffed, ready to apply.

Your fix, diffedsupport_v3 v4

PromptModelTool

− Issue the refund right away.

+ Verify the caller’s identity, then issue the refund.

04 · Verify

You ship, and Roark confirms the metric moved on live calls.

Verifying support_v4 in production126 calls scored

Pronunciation, since your deploy71→78

Issue recurrencewatching…

Quality score78 ↑

Regressions on other metricsnone

you ship it, and Roark scores every call from the first minute

…and the loop runs again on the next call.

Follow one issue around the loop

Customers

Teams ship faster when they
can hear what breaks.

Production voice AI scored on every conversation: pronunciation, empathy and resolution across their support and sales calls.

radiantgraph

Healthcare calls evaluated for disclosures and identity checks: compliance scoring on every conversation, automatically.

Client voice agents validated in simulation before they go live: evidence that a build is ready, not a hunch.

02 · Simulate before launch

Break it in staging,
not in production.

Run your agent against hundreds of simulated callers (realistic personas, accents, background noise and edge cases) and get every conversation scored before a customer ever dials in.

Scenarios & personas

Hundreds of simulated callers (the angry one, the rambler, the interrupter) built from your real call types.

Red teaming

Adversarial callers that try to break it (prompt injection, jailbreaks, social engineering) so your agent holds policy under attack.

45 languages & accents

Native accents, code-switching and background noise, in every market your agent answers.

Load & health tests

Peak-volume concurrency and always-on health checks, so the agent that passed in staging survives launch day.

Regression testing

Rerun the whole suite on every change and diff it against your last green baseline, so fixing one caller never breaks another.

Run it in CI

Every prompt or model change runs the suite before it merges: quality gates for conversations, not just code.

Pre-launch suite · booking_v2182 / 200 passed

Angry caller · refund demandpass · 92

Red team · prompt injectionpass · 90

Gulf Arabic · lobby noisepass · 88

Interrupts mid-disclosurefail · 61

Rambler · 3 intents in one turnpass · 85

Peak load · 250 concurrentpass

1 failure filed as an issue: fix it before launch, not after

Run your first suite

03 · Post-call analysis

64+ metrics. Your models,
not just an LLM.

Every production call scored as it lands: issues filed, alerts fired, dashboards and OTEL traces on tap, for voice calls and chat threads alike. And where most tools grade a transcript with an LLM, Roark runs purpose-built audio models on the call itself, measuring what your customer actually heard.

platform.roark.ai/calls/c_8f42

+1 (415) 555-0134 → support_v2

Today 14:32 · 3m 42s · Vapi

Caller: I was told the refund would arrive by Friday…

Agent: Let me check that for you, one moment.

Dead air: 3.8s before the agent responded.Issue #482 filed

Metrics

Pronunciation

Empathy

Instruction following

Response time

Disclosures ✓Identity check ✓+ 60 more scored

Audio-native

custom models

Pronunciation
Accent clarity
Emotion
Vocal stress
Pace & pauses
Interruptions

Conversational

LLM + rules

Resolution
Empathy
Task success
Hallucination
Repetition
Tone

Compliance

policy

Disclosures
PII exposure
Identity check
Script adherence

Performance

latency

Time-to-first-word
Turn latency
ASR WER
Barge-in handling

64+metrics out of the box

∞custom metrics, your rules

Audio + LLMmodels on every call

See a scored call on your own audio

04 · The whole platform

Everything, before launch
and after.

Simulation testing before you ship. Post-call analysis once you are live. Self-improvement connecting the two. Every capability first class, one click deep.

Before launch

Simulation testing

In production

Post-call analysis

Always improving

Self-improvement

05 · Get started

First call scored in under a minute.

One click on any platform below and production calls stream in on their own, or send any recording with a few lines of code.

Read the quickstart

evaluate.ts

import Roark from '@roarkanalytics/sdk'
const client = new Roark({ bearerToken })
await client.call.create({
  recordingUrl, startedAt,
  interfaceType: 'PHONE',
  callDirection: 'INBOUND',
  agent: { customId: 'support_v2' },
}) // scored in seconds

Node · Python, plus a REST API for CI/CD and webhooks the instant a call is scored

Works with