SIMULATE · EVALUATE · IMPROVEVoice agents for tier-1 support, deflection & triage

Containment is not resolution.

Your deflection agent says it handled the call. The customer called back angry. Roark scores whether the issue was actually resolved, whether the escalation was clean, and whether the caller was getting more frustrated — on the audio, not just the words.

Y CombinatorBacked by YC
Live · scoring every call1,284 today

Caller: Fine. Whatever. Let’s just do it your way.

Agent: Great! Let’s continue.

Rising frustrationWords read neutral; the audio model hears a customer about to churn.Emotion

Scoring production voice AI for teams at

Google
AT&T
BCG
Spectrum
Aircall
Podium
radiantgraph
Google
AT&T
BCG
Spectrum
Aircall
Podium
radiantgraph

§01 · When the call goes wrong

Here's how a support call quietly fails.

Each one counts as a contained call in your dashboard and a repeat call in your queue — and most are inaudible to a tool that only reads the transcript.

01Deflection

Vanity containment

Caller: So my refund still hasn’t shown up.

Agent: Thanks for confirming! Is there anything else I can help with?

The agent closed the call without solving anything and logged it as contained. The customer calls back in an hour. Real resolution — not hang-up rate — is the metric that matters, and it’s the one most teams never score.

Resolution
02Escalation

The dropped handoff

Caller: Just put me through to a person.

Agent: I can help with that. What is your account number again?

A clear escalation request got swallowed by another collection loop. The handoff never carried context, so the human starts from zero and the customer repeats everything. Roark fails the turn that ignores or fumbles the escalation trigger.

Escalation
03Knowledge

The confident wrong answer

Caller: Is the annual plan refundable after 30 days?

Agent: Yes, you can get a full refund anytime in the first year.

Stated with total confidence and completely invented. A hallucinated policy answer becomes a chargeback, a complaint, or a promise your team has to honor. Roark flags assertions that aren’t grounded in the knowledge base.

Hallucination
04Triage

The loop that won’t end

Caller: I already gave you the order number twice.

Agent: No problem — can you read me the order number?

The agent re-asked for information the caller already provided, three turns running. Repetition is the fastest way to make a calm customer furious, and it reads as a perfectly polite transcript.

Repetition
05Frustration

The slow boil

Caller: Fine. Whatever. Let’s just do it your way.

Agent: Great! Let’s continue.

“Fine” and “whatever” read neutral on the page. In the audio the caller is clipped, tense, and about to churn. Only an audio model hears the rising frustration a transcript-only tool scores as agreement.

Emotion

§02 · From caught to fixed

Roark catches every one of these — and proves the fix.

Each failure above is filed with its evidence, becomes a repeatable simulation until a candidate passes, and is verified on your next thousand live calls.

01 · Catch

The ledger above — every failure filed live, evidence attached.

See what breaks

02 · Simulate

Your fix, replayed against the exact failures above.

Testing your candidates82 / 240
prompt · ground answers in KBfail
model · gpt-4.1fail
prompt · acknowledge before askingfail
03 · Review

Every change explicit and diffed — you apply it.

Your fix, diffedsupport_v3 v4
PromptToolModel
Resolve the customer’s request and close the call.
+ Confirm the issue is actually fixed before closing; if not, escalate with full context.
04 · Verify

You ship — Roark confirms the metric moved on live calls.

Verifying support_v4 in production126 calls scored
Resolution — since your deploy7178
Issue recurrencewatching…
Quality score78 ↑
Regressions on other metricsnone

you ship it — Roark verifies every call, with no change to your CCaaS

…and the loop runs again on the next call.

§03 · Simulate before launch

Break it in staging, not in production.

Run your agent against hundreds of simulated callers — realistic personas, accents, background noise and edge cases — and get every conversation scored before a customer ever dials in.

Scenarios & personas

Hundreds of simulated callers — the angry one, the rambler, the interrupter — built from your real call types.

45 languages & accents

Native accents, code-switching and background noise — in every market your agent answers.

Load & health tests

Peak-volume concurrency and always-on health checks, so the agent that passed in staging survives launch day.

Run it in CI

Every prompt or model change runs the suite before it merges — quality gates for conversations, not just code.

Pre-launch suite · customer_support_v1182 / 200 passed
Deflection · Vanity containmentpass · 92
Escalation · The dropped handoffpass · 88
Knowledge · The confident wrong answerfail · 61
Triage · The loop that won’t endpass · 85
Frustration · The slow boilpass · 90

1 failure filed as an issue — fix it before launch, not after

§04 · Evals & observability

64+ metrics. Your models, not just an LLM.

Every production call scored as it lands — issues filed, alerts fired, dashboards and OTEL traces on tap, for voice calls and chat threads alike. And where most tools grade a transcript with an LLM, Roark runs purpose-built audio models on the call itself, measuring what your customer actually heard.

Everyone else

LLM reads the transcript

“The agent said the right words.” Misses how it sounded — the mispronounced drug name, the flat apology, the rushed close.

“…refund within three business days.” ✓ text-match
Roark · audio modelhears the call

Audio models hear the call

Pronunciation, accent, emotion and vocal stress measured from the waveform — the signal an LLM grading text can never see.

emotion · pace: rushed close
Empathy
84

Audio-native

custom models

  • Emotion
  • Vocal stress
  • Accent clarity
  • Pronunciation
  • Pace & pauses
  • Interruptions

Conversational

LLM + rules

  • Resolution
  • Escalation
  • Repetition
  • Hallucination
  • Task success
  • Tone

Compliance

policy

  • Disclosures
  • Identity check
  • PII exposure
  • Script adherence

Performance

latency

  • Time-to-first-word
  • Turn latency
  • ASR WER
  • Barge-in handling
64+metrics out of the box
custom metrics, your rules
Audio + LLMmodels on every call

§05 · Get started

First call scored in under a minute.

One click on any platform below and production calls stream in on their own — or send any recording with three lines of code.

Read the quickstart
evaluate.ts
import Roark from '@roarkhq/sdk'
const roark = new Roark({ apiKey })
await roark.calls.evaluate({
recordingUrl, agent: 'support_v2',
}) // scored in seconds
Node · Python · Go — plus a REST API for CI/CD and webhooks the instant a call is scored

Works with

Vapi
Bland
Retell
LiveKit
Pipecat
ElevenLabs
Kore.ai
Google
SOC 2Type IIHIPAABAA available

PII handling & data residency

Configurable redaction and retention, SSO, and audit logs — with PCI and TCPA script checks scored on every call.

Security details

Bring a recording.
We’ll score it live.

See your own agent measured on the audio it actually produced — in the demo, in real time. Stop guessing whether your voice AI works.

founders@roark.ai · we reply fast