SIMULATE · EVALUATE · IMPROVEVoice agents for order status, returns & product Q&A

Your support agent has to quote the return policy right.

Shoppers call about order status, returns, and "is it in stock in a medium." One wrong policy or a made-up stock number is a chargeback, a refund you owe, or a one-star review. Roark scores every call — on the audio, and load-tests it before Black Friday.

Book a demo See what breaks

Backed by YC

Live · scoring every call1,284 today

Caller: I bought this six weeks ago — can I still return it?

Agent: Absolutely, you have a full 60 days.

Wrong return policy — Quoted 60 days on a 30-day policy — a refund committed that was never offered.Policy accuracy

Scoring production voice AI for teams at

radiantgraph

§01 · When the call goes wrong

Here's how a shopper call goes wrong.

Each one is a refund you owe, a sale you lost, or a brand-voice slip that shows up in a review — and the worst of them are inaudible to a tool that only reads the transcript.

01Returns

The 60-day return that was 30

Caller: I bought this six weeks ago — can I still return it?

Agent: Absolutely, you have a full 60 days.

The policy is 30 days. The agent stated the wrong number with total confidence and committed you to a refund you never offered. Roark checks every policy assertion against the source of truth.

Policy accuracy

02Product Q&A

Invented stock

Caller: Do you have the walnut dresser in stock?

Agent: Yes, plenty in stock — ships tomorrow.

It was backordered six weeks. A confident, unverified stock claim becomes a cancelled order and a chargeback. Roark flags inventory and availability stated without a tool call.

Stock accuracy

03Peak load

Passed in July, cratered on Black Friday

Caller: Where is my order? It said delivered.

Agent: I'm sorry, I— could you— one moment— [4.1s] …say that again?

The agent that aced a quiet July call falls apart under peak concurrency — latency spikes, barge-in breaks, turns drop. Roark load-tests with simulated concurrency so you find this before the rush, not during it.

Turn latency

04Returns

Cheerful at a furious caller

Caller: This is the THIRD time I've called about this refund.

Agent: No problem at all, happy to help! 😊

The words are polite; the delivery is tone-deaf to an angry caller. The audio model hears the caller's frustration and the agent's chirpy mismatch — the transcript reads as a perfect interaction.

Emotion match

05Order status

Off-brand voice

Caller: Has my order shipped yet?

Agent: Yo, lemme pull that up real quick, gimme a sec.

A premium brand sounds like a stranger. Off-brand register and slang erode the experience you spent years building. Roark scores tone and brand-voice adherence on every call.

Brand voice

§02 · From caught to fixed

Roark catches every one of these — and proves the fix.

Each failure above is filed with its evidence, becomes a repeatable simulation until a candidate passes, and is verified on your next thousand live calls.

01 · Catch

The ledger above — every failure filed live, evidence attached.

See what breaks

02 · Simulate

Your fix, replayed against the exact failures above.

Testing your candidates82 / 240

prompt · policy-from-source rulefail

tool · check_inventory() before claimsfail

voice · brand-voice personafail

03 · Review

Every change explicit and diffed — you apply it.

Your fix, diffedsupport_v3 v4

PromptToolVoice

− Help the shopper with returns and product questions.

+ Quote policy only from the policy tool; never state a return window from memory.

04 · Verify

You ship — Roark confirms the metric moved on live calls.

Verifying support_v4 in production126 calls scored

Policy accuracy — since your deploy71→78

Issue recurrencewatching…

Quality score78 ↑

Regressions on other metricsnone

you ship it — Roark verifies every call, load-tested at peak-season concurrency

…and the loop runs again on the next call.

§03 · Simulate before launch

Break it in staging,
not in production.

Run your agent against hundreds of simulated callers — realistic personas, accents, background noise and edge cases — and get every conversation scored before a customer ever dials in.

Scenarios & personas

Hundreds of simulated callers — the angry one, the rambler, the interrupter — built from your real call types.

45 languages & accents

Native accents, code-switching and background noise — in every market your agent answers.

Load & health tests

Peak-volume concurrency and always-on health checks, so the agent that passed in staging survives launch day.

Run it in CI

Every prompt or model change runs the suite before it merges — quality gates for conversations, not just code.

Pre-launch suite · retail_v1182 / 200 passed

Returns · The 60-day return that was 30pass · 92

Product Q&A · Invented stockpass · 88

Peak load · Passed in July, cratered on Black Fridayfail · 61

Returns · Cheerful at a furious callerpass · 85

Order status · Off-brand voicepass · 90

1 failure filed as an issue — fix it before launch, not after

Run your first suite

§04 · Evals & observability

64+ metrics. Your models,
not just an LLM.

Every production call scored as it lands — issues filed, alerts fired, dashboards and OTEL traces on tap, for voice calls and chat threads alike. And where most tools grade a transcript with an LLM, Roark runs purpose-built audio models on the call itself, measuring what your customer actually heard.

Everyone else

LLM reads the transcript

“The agent said the right words.” Misses how it sounded — the mispronounced drug name, the flat apology, the rushed close.

“…refund within three business days.” ✓ text-match

Roark · audio modelhears the call

Audio models hear the call

Pronunciation, accent, emotion and vocal stress measured from the waveform — the signal an LLM grading text can never see.

emotion · pace: rushed close

Empathy

Accuracy

LLM + rules

Policy accuracy
Stock accuracy
Hallucination
Order-detail accuracy
Task success
Repetition

Audio-native

custom models

Emotion match
Brand voice
Pronunciation
Vocal stress
Pace & pauses
Interruptions

Performance

latency & load

Time-to-first-word
Turn latency
Peak-load health
ASR WER
Barge-in handling

Conversational

LLM + rules

Tone
De-escalation
Empathy
Script adherence
Refund eligibility

64+metrics out of the box

∞custom metrics, your rules

Audio + LLMmodels on every call

§05 · Get started

First call scored in under a minute.

One click on any platform below and production calls stream in on their own — or send any recording with three lines of code.

Read the quickstart

evaluate.ts

import Roark from '@roarkhq/sdk'
const roark = new Roark({ apiKey })
await roark.calls.evaluate({
  recordingUrl, agent: 'support_v2',
}) // scored in seconds

Node · Python · Go — plus a REST API for CI/CD and webhooks the instant a call is scored

Works with