Pricing Guide 12 min read

Speech-to-Text API Pricing:
Every Provider Compared

We analyzed 17+ STT providers — proprietary APIs and hosted open models — so you don't have to. Pricing, accuracy, latency, and which one to pick for your use case.

Choosing a speech-to-text API shouldn't require a spreadsheet and three days of research. But here we are — with dozens of providers, wildly different pricing models, and performance claims that are hard to verify. We built this guide to cut through the noise.

Below you'll find current pricing for every major STT provider, split into proprietary APIs (closed-source, fully managed) and hosted open models (open-source models like Whisper, deployed on managed infrastructure). All prices are for standard English batch transcription unless noted.

🏢 Proprietary STT Pricing

These are the traditional, closed-source providers. You get polished APIs, SLAs, and enterprise support — at a premium. Prices shown per hour of audio processed.

PlatformBatch / AsyncReal-TimeFree TierKey Notes
OpenAI Whisper $0.36/hr N/A None $0.006/min. Highly accurate multilingual. No native streaming.
Google Cloud STT $0.18–$0.96/hr $0.36–$1.92/hr 60 min/mo V2 Chirp models start lower. Batch discounts up to 50%.
Amazon Transcribe $1.44/hr $2.88/hr 60 min/mo Medical/call analytics add $0.07–$0.14/hr. Deep AWS integration.
Microsoft Azure $0.45–$1.00/hr $1.00–$1.20/hr 5 hr/mo Batch v3.2 at $0.45/hr. Custom models add 20%. Includes translation.
Deepgram $0.26/hr $0.46/hr $200 credits Nova-3 model. Ultra-low latency. Volume discounts from 500 hrs.
AssemblyAI $0.12–$0.31/hr $0.31–$0.49/hr $50 credits Nano ($0.12/hr) for speed, Best ($0.31/hr) for accuracy. Universal-2 model.
Rev AI $0.18/hr $0.30/hr 15 min trial $0.003/min async. 36+ languages. Known for high accuracy.
IBM Watson STT $0.60/hr $1.20/hr 500 min/mo $0.01/min. Custom model support. Strong enterprise customization.
Speechmatics $0.30+/hr $0.60+/hr 8 hr/mo 50+ languages. 20% discount over 500 hours. Pay-as-you-grow.
Gladia $0.61/hr Async focus 10 hr/mo 100+ languages. Code-switching support. Generous free tier.

Takeaway: AssemblyAI's Nano tier ($0.12/hr) and Deepgram ($0.26/hr) dominate for batch. For real-time streaming, Deepgram ($0.46/hr) and Rev AI ($0.30/hr) are the value leaders. The big cloud providers (AWS, Google, Azure) charge 2–10x more but offer deeper ecosystem integration.

🔓 Hosted Open Model Pricing

These are open-source models (primarily Whisper, plus Meta's Wav2Vec and NVIDIA's Parakeet) hosted on managed cloud platforms. You get open-source flexibility at 20–60% less than proprietary APIs, with comparable accuracy.

PlatformModel(s)Batch / AsyncReal-TimeFree TierNotes
Replicate whisper-large-v3 $0.18–$0.30/hr N/A $10 credits GPU-time billing. Supports translation/language ID. No cold starts.
Hugging Face whisper-large-v3, wav2vec2 $0.24–$0.60/hr ~$2.40+/hr 1 hr/mo (PRO) Dedicated endpoints. Up to 96% savings on batch. Deploy from Hub.
Groq whisper-large-v3 $0.11/hr $0.22/hr 10 min/day Up to 300x real-time speed. $0.0018/min batch. Accent-robust.
BorgCloud whisper-large-v3-turbo $0.06/hr N/A None $0.001/min. Cheapest Whisper hosting available. Enterprise focus.
Fal.ai whisper-large-v3, parakeet-tdt $0.12–$0.24/hr $0.24–$0.48/hr $5 credits Serverless. Parakeet transcribes 1 hr audio in ~1 sec.
Deepinfra whisper-large-v3, voxtral-mini $0.15/hr $0.30/hr $10 credits $0.0025/min. Long-context audio. Fine-tuning supported.
Modal parakeet-tdt, whisper variants $0.20–$0.40/hr Limited $30/mo credits Serverless functions. Parakeet WER ~6%. Ideal for batch.

Takeaway: BorgCloud ($0.06/hr) and Groq ($0.11/hr) are absurdly cheap for Whisper-level accuracy. That's 400 hours of transcription for $24–$44. Even the pricier hosted options beat proprietary APIs by 20–60%. The trade-off: less polish, fewer enterprise features, and you're managing model selection yourself.

🎯 Accuracy Benchmarks (WER)

Word Error Rate (WER) measures transcription accuracy — lower is better. These are approximate benchmarks on standard English test sets. Real-world performance varies by audio quality, accent, and domain.

Proprietary Models

Deepgram Nova-33.2% WER
AssemblyAI Universal-24.1% WER
Google Chirp 24.5% WER
Rev AI4.8% WER
Azure Speech (custom)5.2% WER
Amazon Transcribe6.0% WER

Open Models (Hosted)

Whisper large-v34.2% WER
Whisper large-v3-turbo5.0% WER
NVIDIA Parakeet TDT6.0% WER
Distil-Whisper6.5% WER
Wav2Vec 2.08.0% WER

Key insight: Whisper large-v3 at 4.2% WER matches or beats most proprietary APIs — at a fraction of the cost. The gap between open and closed-source has essentially closed for English transcription. For multilingual, Whisper handles 99+ languages; Parakeet is fastest (3,386x real-time) but English-focused.

Latency Comparison

Processing speed matters if you're building real-time voice agents or need fast turnaround on batch jobs. Here's how providers stack up on time to transcribe 1 hour of audio:

ProviderModelTime for 1hr AudioReal-Time Factor
Fal.aiParakeet TDT~1 second3,386x
GroqWhisper large-v3~12 seconds300x
DeepgramNova-3~30 seconds120x
AssemblyAIUniversal-2~45 seconds80x
ReplicateWhisper large-v3~3–5 minutes12–20x
OpenAIWhisper API~5–8 minutes8–12x
Google CloudChirp 2~5–10 minutes6–12x
AWS TranscribeStandard~8–15 minutes4–8x

🎯 Best Pick by Use Case

💰

Cheapest Bulk Transcription

Processing thousands of hours of recordings with cost as the top priority.

→ BorgCloud ($0.06/hr)
🎙

Real-Time Voice Agents

Live calls, voice assistants, and real-time captions with sub-second latency.

→ Deepgram Nova-3 ($0.46/hr)
🌍

Multilingual / Indic

Transcribing content in Hindi, Tamil, or 99+ languages with high accuracy.

→ Whisper large-v3 via Groq ($0.11/hr)
🏥

Enterprise / Healthcare

HIPAA compliance, custom models, and SLAs for regulated industries.

→ Azure / AWS Transcribe
🏎

Fastest Processing

When you need 1 hour transcribed in 1 second, regardless of cost.

→ Fal.ai Parakeet ($0.12/hr)
🎯

Best Overall Accuracy

Lowest WER on clean English audio, no compromises on quality.

→ Deepgram Nova-3 (3.2% WER)

⚠️ Hidden Costs to Watch

The per-minute price is just the beginning. Here's what most comparison guides don't tell you:

Speaker diarization adds 10–20% to the base cost on most platforms. Custom vocabulary or domain-specific models add another 20–40% (Azure, IBM). Storage fees — if you're transcribing into AWS Transcribe, the S3 storage for your audio files is extra. And GPU upgrades on hosted platforms like Hugging Face can balloon costs 20–50% if you need A100s instead of T4s for throughput.

Then there's the cost nobody prices: integration time. Every provider has a different API. Different auth. Different response formats. Different error handling. Switching from Deepgram to Groq to test costs means rewriting your integration layer.

Unless you use an aggregator.

🔊 Why Aggregation Wins

Here's the problem this article illustrates: the "best" STT provider depends entirely on your use case. Bulk Hindi transcription? Groq with Whisper. Real-time English voice agent? Deepgram. Cheapest possible? BorgCloud. Enterprise compliance? Azure.

Most products need more than one provider. Your live call feature needs Deepgram's low latency. Your podcast processing needs Whisper's multilingual accuracy. Your budget tier needs the cheapest open model available. That's three integrations, three billing systems, three sets of documentation.

That's why we built 1ni.in.

1ni.in gives you a single API that routes to the best model for every request. Access ElevenLabs, Sarvam AI, open-weight Whisper models, and more — through one endpoint, one API key, one credit system. Switch models per-request with a single parameter change. No re-integration. No vendor lock-in.

And it's not just STT. Voice generation, sound effects, music creation — every category of audio AI, unified. We're launching soon with 1,000 free credits on signup.

Stop comparing. Start building.

One API key. Every audio AI model. 1,000 free credits. No credit card required.

Join the Waitlist