Your support agent has to quote the return policy right.
Shoppers call about order status, returns, and "is it in stock in a medium." One wrong policy or a made-up stock number is a chargeback, a refund you owe, or a one-star review. Roark scores every call — on the audio, and load-tests it before Black Friday.
Caller: I bought this six weeks ago — can I still return it?
Agent: Absolutely, you have a full 60 days.
Wrong return policy — Quoted 60 days on a 30-day policy — a refund committed that was never offered.Policy accuracy
Scoring production voice AI for teams at


§01 · When the call goes wrong
Here's how a shopper call goes wrong.
Each one is a refund you owe, a sale you lost, or a brand-voice slip that shows up in a review — and the worst of them are inaudible to a tool that only reads the transcript.
The 60-day return that was 30
Caller: I bought this six weeks ago — can I still return it?
Agent: Absolutely, you have a full 60 days.
The policy is 30 days. The agent stated the wrong number with total confidence and committed you to a refund you never offered. Roark checks every policy assertion against the source of truth.
Policy accuracyInvented stock
Caller: Do you have the walnut dresser in stock?
Agent: Yes, plenty in stock — ships tomorrow.
It was backordered six weeks. A confident, unverified stock claim becomes a cancelled order and a chargeback. Roark flags inventory and availability stated without a tool call.
Stock accuracyPassed in July, cratered on Black Friday
Caller: Where is my order? It said delivered.
Agent: I'm sorry, I— could you— one moment— [4.1s] …say that again?
The agent that aced a quiet July call falls apart under peak concurrency — latency spikes, barge-in breaks, turns drop. Roark load-tests with simulated concurrency so you find this before the rush, not during it.
Turn latencyCheerful at a furious caller
Caller: This is the THIRD time I've called about this refund.
Agent: No problem at all, happy to help! 😊
The words are polite; the delivery is tone-deaf to an angry caller. The audio model hears the caller's frustration and the agent's chirpy mismatch — the transcript reads as a perfect interaction.
Emotion matchOff-brand voice
Caller: Has my order shipped yet?
Agent: Yo, lemme pull that up real quick, gimme a sec.
A premium brand sounds like a stranger. Off-brand register and slang erode the experience you spent years building. Roark scores tone and brand-voice adherence on every call.
Brand voice§02 · From caught to fixed
Roark catches every one of these — and proves the fix.
Each failure above is filed with its evidence, becomes a repeatable simulation until a candidate passes, and is verified on your next thousand live calls.
Your fix, replayed against the exact failures above.
Every change explicit and diffed — you apply it.
You ship — Roark confirms the metric moved on live calls.
you ship it — Roark verifies every call, load-tested at peak-season concurrency
…and the loop runs again on the next call.
§03 · Simulate before launch
Break it in staging,
not in production.
Run your agent against hundreds of simulated callers — realistic personas, accents, background noise and edge cases — and get every conversation scored before a customer ever dials in.
Scenarios & personas
Hundreds of simulated callers — the angry one, the rambler, the interrupter — built from your real call types.
45 languages & accents
Native accents, code-switching and background noise — in every market your agent answers.
Load & health tests
Peak-volume concurrency and always-on health checks, so the agent that passed in staging survives launch day.
Run it in CI
Every prompt or model change runs the suite before it merges — quality gates for conversations, not just code.
1 failure filed as an issue — fix it before launch, not after
§04 · Evals & observability
64+ metrics. Your models,
not just an LLM.
Every production call scored as it lands — issues filed, alerts fired, dashboards and OTEL traces on tap, for voice calls and chat threads alike. And where most tools grade a transcript with an LLM, Roark runs purpose-built audio models on the call itself, measuring what your customer actually heard.
Everyone else
LLM reads the transcript
“The agent said the right words.” Misses how it sounded — the mispronounced drug name, the flat apology, the rushed close.
Audio models hear the call
Pronunciation, accent, emotion and vocal stress measured from the waveform — the signal an LLM grading text can never see.
Accuracy
LLM + rules
- Policy accuracy
- Stock accuracy
- Hallucination
- Order-detail accuracy
- Task success
- Repetition
Audio-native
custom models
- Emotion match
- Brand voice
- Pronunciation
- Vocal stress
- Pace & pauses
- Interruptions
Performance
latency & load
- Time-to-first-word
- Turn latency
- Peak-load health
- ASR WER
- Barge-in handling
Conversational
LLM + rules
- Tone
- De-escalation
- Empathy
- Script adherence
- Refund eligibility
§05 · Get started
First call scored in under a minute.
One click on any platform below and production calls stream in on their own — or send any recording with three lines of code.
Read the quickstartimport Roark from '@roarkhq/sdk'const roark = new Roark({ apiKey })await roark.calls.evaluate({recordingUrl, agent: 'support_v2',}) // scored in seconds
Works with
Also built for
Customer Support
Resolve it for real, escalate cleanly, and hear the frustration the transcript hides.
ExploreHospitality
Understand every accent, get the booking right, sound warm not robotic — and catch the upsell you left on the table, in 45 languages.
ExploreHome Services
Get the address right, book the real window, and never quote a price you can’t honor.
ExplorePayment & order-data handling
PCI-aware scripting checks, configurable retention, and redaction of card and order PII before storage.
Bring a recording.
We’ll score it live.
See your own agent measured on the audio it actually produced — in the demo, in real time. Stop guessing whether your voice AI works.
founders@roark.ai · we reply fast