Choosing a speech-to-text API shouldn't require a spreadsheet and three days of research. But here we are — with dozens of providers, wildly different pricing models, and performance claims that are hard to verify. We built this guide to cut through the noise.
Below you'll find current pricing for every major STT provider, split into proprietary APIs (closed-source, fully managed) and hosted open models (open-source models like Whisper, deployed on managed infrastructure). All prices are for standard English batch transcription unless noted.
🏢 Proprietary STT Pricing
These are the traditional, closed-source providers. You get polished APIs, SLAs, and enterprise support — at a premium. Prices shown per hour of audio processed.
| Platform | Batch / Async | Real-Time | Free Tier | Key Notes |
|---|---|---|---|---|
| OpenAI Whisper | $0.36/hr | N/A | None | $0.006/min. Highly accurate multilingual. No native streaming. |
| Google Cloud STT | $0.18–$0.96/hr | $0.36–$1.92/hr | 60 min/mo | V2 Chirp models start lower. Batch discounts up to 50%. |
| Amazon Transcribe | $1.44/hr | $2.88/hr | 60 min/mo | Medical/call analytics add $0.07–$0.14/hr. Deep AWS integration. |
| Microsoft Azure | $0.45–$1.00/hr | $1.00–$1.20/hr | 5 hr/mo | Batch v3.2 at $0.45/hr. Custom models add 20%. Includes translation. |
| Deepgram | $0.26/hr | $0.46/hr | $200 credits | Nova-3 model. Ultra-low latency. Volume discounts from 500 hrs. |
| AssemblyAI | $0.12–$0.31/hr | $0.31–$0.49/hr | $50 credits | Nano ($0.12/hr) for speed, Best ($0.31/hr) for accuracy. Universal-2 model. |
| Rev AI | $0.18/hr | $0.30/hr | 15 min trial | $0.003/min async. 36+ languages. Known for high accuracy. |
| IBM Watson STT | $0.60/hr | $1.20/hr | 500 min/mo | $0.01/min. Custom model support. Strong enterprise customization. |
| Speechmatics | $0.30+/hr | $0.60+/hr | 8 hr/mo | 50+ languages. 20% discount over 500 hours. Pay-as-you-grow. |
| Gladia | $0.61/hr | Async focus | 10 hr/mo | 100+ languages. Code-switching support. Generous free tier. |
Takeaway: AssemblyAI's Nano tier ($0.12/hr) and Deepgram ($0.26/hr) dominate for batch. For real-time streaming, Deepgram ($0.46/hr) and Rev AI ($0.30/hr) are the value leaders. The big cloud providers (AWS, Google, Azure) charge 2–10x more but offer deeper ecosystem integration.
🔓 Hosted Open Model Pricing
These are open-source models (primarily Whisper, plus Meta's Wav2Vec and NVIDIA's Parakeet) hosted on managed cloud platforms. You get open-source flexibility at 20–60% less than proprietary APIs, with comparable accuracy.
| Platform | Model(s) | Batch / Async | Real-Time | Free Tier | Notes |
|---|---|---|---|---|---|
| Replicate | whisper-large-v3 | $0.18–$0.30/hr | N/A | $10 credits | GPU-time billing. Supports translation/language ID. No cold starts. |
| Hugging Face | whisper-large-v3, wav2vec2 | $0.24–$0.60/hr | ~$2.40+/hr | 1 hr/mo (PRO) | Dedicated endpoints. Up to 96% savings on batch. Deploy from Hub. |
| Groq | whisper-large-v3 | $0.11/hr | $0.22/hr | 10 min/day | Up to 300x real-time speed. $0.0018/min batch. Accent-robust. |
| BorgCloud | whisper-large-v3-turbo | $0.06/hr | N/A | None | $0.001/min. Cheapest Whisper hosting available. Enterprise focus. |
| Fal.ai | whisper-large-v3, parakeet-tdt | $0.12–$0.24/hr | $0.24–$0.48/hr | $5 credits | Serverless. Parakeet transcribes 1 hr audio in ~1 sec. |
| Deepinfra | whisper-large-v3, voxtral-mini | $0.15/hr | $0.30/hr | $10 credits | $0.0025/min. Long-context audio. Fine-tuning supported. |
| Modal | parakeet-tdt, whisper variants | $0.20–$0.40/hr | Limited | $30/mo credits | Serverless functions. Parakeet WER ~6%. Ideal for batch. |
Takeaway: BorgCloud ($0.06/hr) and Groq ($0.11/hr) are absurdly cheap for Whisper-level accuracy. That's 400 hours of transcription for $24–$44. Even the pricier hosted options beat proprietary APIs by 20–60%. The trade-off: less polish, fewer enterprise features, and you're managing model selection yourself.
🎯 Accuracy Benchmarks (WER)
Word Error Rate (WER) measures transcription accuracy — lower is better. These are approximate benchmarks on standard English test sets. Real-world performance varies by audio quality, accent, and domain.
Proprietary Models
Open Models (Hosted)
Key insight: Whisper large-v3 at 4.2% WER matches or beats most proprietary APIs — at a fraction of the cost. The gap between open and closed-source has essentially closed for English transcription. For multilingual, Whisper handles 99+ languages; Parakeet is fastest (3,386x real-time) but English-focused.
⚡ Latency Comparison
Processing speed matters if you're building real-time voice agents or need fast turnaround on batch jobs. Here's how providers stack up on time to transcribe 1 hour of audio:
| Provider | Model | Time for 1hr Audio | Real-Time Factor |
|---|---|---|---|
| Fal.ai | Parakeet TDT | ~1 second | 3,386x |
| Groq | Whisper large-v3 | ~12 seconds | 300x |
| Deepgram | Nova-3 | ~30 seconds | 120x |
| AssemblyAI | Universal-2 | ~45 seconds | 80x |
| Replicate | Whisper large-v3 | ~3–5 minutes | 12–20x |
| OpenAI | Whisper API | ~5–8 minutes | 8–12x |
| Google Cloud | Chirp 2 | ~5–10 minutes | 6–12x |
| AWS Transcribe | Standard | ~8–15 minutes | 4–8x |
🎯 Best Pick by Use Case
Cheapest Bulk Transcription
Processing thousands of hours of recordings with cost as the top priority.
Real-Time Voice Agents
Live calls, voice assistants, and real-time captions with sub-second latency.
Multilingual / Indic
Transcribing content in Hindi, Tamil, or 99+ languages with high accuracy.
Enterprise / Healthcare
HIPAA compliance, custom models, and SLAs for regulated industries.
Fastest Processing
When you need 1 hour transcribed in 1 second, regardless of cost.
Best Overall Accuracy
Lowest WER on clean English audio, no compromises on quality.
⚠️ Hidden Costs to Watch
The per-minute price is just the beginning. Here's what most comparison guides don't tell you:
Speaker diarization adds 10–20% to the base cost on most platforms. Custom vocabulary or domain-specific models add another 20–40% (Azure, IBM). Storage fees — if you're transcribing into AWS Transcribe, the S3 storage for your audio files is extra. And GPU upgrades on hosted platforms like Hugging Face can balloon costs 20–50% if you need A100s instead of T4s for throughput.
Then there's the cost nobody prices: integration time. Every provider has a different API. Different auth. Different response formats. Different error handling. Switching from Deepgram to Groq to test costs means rewriting your integration layer.
Unless you use an aggregator.
🔊 Why Aggregation Wins
Here's the problem this article illustrates: the "best" STT provider depends entirely on your use case. Bulk Hindi transcription? Groq with Whisper. Real-time English voice agent? Deepgram. Cheapest possible? BorgCloud. Enterprise compliance? Azure.
Most products need more than one provider. Your live call feature needs Deepgram's low latency. Your podcast processing needs Whisper's multilingual accuracy. Your budget tier needs the cheapest open model available. That's three integrations, three billing systems, three sets of documentation.
That's why we built 1ni.in.
1ni.in gives you a single API that routes to the best model for every request. Access ElevenLabs, Sarvam AI, open-weight Whisper models, and more — through one endpoint, one API key, one credit system. Switch models per-request with a single parameter change. No re-integration. No vendor lock-in.
And it's not just STT. Voice generation, sound effects, music creation — every category of audio AI, unified. We're launching soon with 1,000 free credits on signup.