SIMULATE · EVALUATE · IMPROVEVoice agents for order status, returns & product Q&A

Your support agent has to quote the return policy right.

Shoppers call about order status, returns, and "is it in stock in a medium." One wrong policy or a made-up stock number is a chargeback, a refund you owe, or a one-star review. Roark scores every call — on the audio, and load-tests it before Black Friday.

Y CombinatorBacked by YC
Live · scoring every call1,284 today

Caller: I bought this six weeks ago — can I still return it?

Agent: Absolutely, you have a full 60 days.

Wrong return policyQuoted 60 days on a 30-day policy — a refund committed that was never offered.Policy accuracy

Scoring production voice AI for teams at

Google
AT&T
BCG
Spectrum
Aircall
Podium
radiantgraph
Google
AT&T
BCG
Spectrum
Aircall
Podium
radiantgraph

§01 · When the call goes wrong

Here's how a shopper call goes wrong.

Each one is a refund you owe, a sale you lost, or a brand-voice slip that shows up in a review — and the worst of them are inaudible to a tool that only reads the transcript.

01Returns

The 60-day return that was 30

Caller: I bought this six weeks ago — can I still return it?

Agent: Absolutely, you have a full 60 days.

The policy is 30 days. The agent stated the wrong number with total confidence and committed you to a refund you never offered. Roark checks every policy assertion against the source of truth.

Policy accuracy
02Product Q&A

Invented stock

Caller: Do you have the walnut dresser in stock?

Agent: Yes, plenty in stock — ships tomorrow.

It was backordered six weeks. A confident, unverified stock claim becomes a cancelled order and a chargeback. Roark flags inventory and availability stated without a tool call.

Stock accuracy
03Peak load

Passed in July, cratered on Black Friday

Caller: Where is my order? It said delivered.

Agent: I'm sorry, I— could you— one moment— [4.1s] …say that again?

The agent that aced a quiet July call falls apart under peak concurrency — latency spikes, barge-in breaks, turns drop. Roark load-tests with simulated concurrency so you find this before the rush, not during it.

Turn latency
04Returns

Cheerful at a furious caller

Caller: This is the THIRD time I've called about this refund.

Agent: No problem at all, happy to help! 😊

The words are polite; the delivery is tone-deaf to an angry caller. The audio model hears the caller's frustration and the agent's chirpy mismatch — the transcript reads as a perfect interaction.

Emotion match
05Order status

Off-brand voice

Caller: Has my order shipped yet?

Agent: Yo, lemme pull that up real quick, gimme a sec.

A premium brand sounds like a stranger. Off-brand register and slang erode the experience you spent years building. Roark scores tone and brand-voice adherence on every call.

Brand voice

§02 · From caught to fixed

Roark catches every one of these — and proves the fix.

Each failure above is filed with its evidence, becomes a repeatable simulation until a candidate passes, and is verified on your next thousand live calls.

01 · Catch

The ledger above — every failure filed live, evidence attached.

See what breaks

02 · Simulate

Your fix, replayed against the exact failures above.

Testing your candidates82 / 240
prompt · policy-from-source rulefail
tool · check_inventory() before claimsfail
voice · brand-voice personafail
03 · Review

Every change explicit and diffed — you apply it.

Your fix, diffedsupport_v3 v4
PromptToolVoice
Help the shopper with returns and product questions.
+ Quote policy only from the policy tool; never state a return window from memory.
04 · Verify

You ship — Roark confirms the metric moved on live calls.

Verifying support_v4 in production126 calls scored
Policy accuracy — since your deploy7178
Issue recurrencewatching…
Quality score78 ↑
Regressions on other metricsnone

you ship it — Roark verifies every call, load-tested at peak-season concurrency

…and the loop runs again on the next call.

§03 · Simulate before launch

Break it in staging, not in production.

Run your agent against hundreds of simulated callers — realistic personas, accents, background noise and edge cases — and get every conversation scored before a customer ever dials in.

Scenarios & personas

Hundreds of simulated callers — the angry one, the rambler, the interrupter — built from your real call types.

45 languages & accents

Native accents, code-switching and background noise — in every market your agent answers.

Load & health tests

Peak-volume concurrency and always-on health checks, so the agent that passed in staging survives launch day.

Run it in CI

Every prompt or model change runs the suite before it merges — quality gates for conversations, not just code.

Pre-launch suite · retail_v1182 / 200 passed
Returns · The 60-day return that was 30pass · 92
Product Q&A · Invented stockpass · 88
Peak load · Passed in July, cratered on Black Fridayfail · 61
Returns · Cheerful at a furious callerpass · 85
Order status · Off-brand voicepass · 90

1 failure filed as an issue — fix it before launch, not after

§04 · Evals & observability

64+ metrics. Your models, not just an LLM.

Every production call scored as it lands — issues filed, alerts fired, dashboards and OTEL traces on tap, for voice calls and chat threads alike. And where most tools grade a transcript with an LLM, Roark runs purpose-built audio models on the call itself, measuring what your customer actually heard.

Everyone else

LLM reads the transcript

“The agent said the right words.” Misses how it sounded — the mispronounced drug name, the flat apology, the rushed close.

“…refund within three business days.” ✓ text-match
Roark · audio modelhears the call

Audio models hear the call

Pronunciation, accent, emotion and vocal stress measured from the waveform — the signal an LLM grading text can never see.

emotion · pace: rushed close
Empathy
84

Accuracy

LLM + rules

  • Policy accuracy
  • Stock accuracy
  • Hallucination
  • Order-detail accuracy
  • Task success
  • Repetition

Audio-native

custom models

  • Emotion match
  • Brand voice
  • Pronunciation
  • Vocal stress
  • Pace & pauses
  • Interruptions

Performance

latency & load

  • Time-to-first-word
  • Turn latency
  • Peak-load health
  • ASR WER
  • Barge-in handling

Conversational

LLM + rules

  • Tone
  • De-escalation
  • Empathy
  • Script adherence
  • Refund eligibility
64+metrics out of the box
custom metrics, your rules
Audio + LLMmodels on every call

§05 · Get started

First call scored in under a minute.

One click on any platform below and production calls stream in on their own — or send any recording with three lines of code.

Read the quickstart
evaluate.ts
import Roark from '@roarkhq/sdk'
const roark = new Roark({ apiKey })
await roark.calls.evaluate({
recordingUrl, agent: 'support_v2',
}) // scored in seconds
Node · Python · Go — plus a REST API for CI/CD and webhooks the instant a call is scored

Works with

Vapi
Bland
Retell
LiveKit
Pipecat
ElevenLabs
Kore.ai
Google
SOC 2Type IIHIPAABAA available

Payment & order-data handling

PCI-aware scripting checks, configurable retention, and redaction of card and order PII before storage.

Security details

Bring a recording.
We’ll score it live.

See your own agent measured on the audio it actually produced — in the demo, in real time. Stop guessing whether your voice AI works.

founders@roark.ai · we reply fast