Voice Pipeline
Configure text-to-speech (TTS) and speech-to-text (STT) capabilities to enable natural voice conversations between your users and AI agents.

Overview
The voice pipeline transforms your agent from a text-only interface into a voice-enabled conversational AI. When a user speaks, their audio is transcribed via STT, processed by the LLM, and the response is synthesized back into speech via TTS.
Voice Pipeline Flow
User Speaks
Audio captured via microphone or phone call
Speech-to-Text (STT)
Audio transcribed to text using provider model
LLM Processes
Agent generates text response
Text-to-Speech (TTS)
Response synthesized to natural speech
User Hears Response
Audio played back in real-time
Voice Pipeline Availability
Text-to-Speech (TTS)
Configure TTS provider, voice, speech rate, and pitch to create natural-sounding voice responses for your agent.
Supported TTS Providers
| Provider | Voice Quality | Languages | Latency |
|---|---|---|---|
| OpenAI TTS | Excellent (HD) | 6 languages | ~500ms |
| ElevenLabs | Best-in-class | 29 languages | ~300ms |
| Google Cloud TTS | Very Good | 220+ voices | ~400ms |
| Amazon Polly | Good | 30+ languages | ~600ms |
TTS Configuration
Voice Selection
Pre-built Voices
Choose from a library of pre-built voices across different genders, accents, and styles. Each provider offers a curated selection of high-quality voices optimized for conversational AI.
Voice Cloning
ElevenLabs and other premium providers offer voice cloning: upload a short audio sample (1-3 minutes) to create a custom voice that matches your brand identity.
Custom Voice Parameters
Fine-tune voice characteristics for your specific use case:
Speech-to-Text (STT)
Configure speech recognition to accurately transcribe user speech in real-time. Supports multiple languages and domain-specific vocabulary.
| Provider | Accuracy | Languages | Real-time |
|---|---|---|---|
| OpenAI Whisper | High | 99 languages | Yes |
| Deepgram | Very High | 32 languages | Yes |
| Google STT | High | 125+ languages | Yes |
| Azure Speech | High | 100+ languages | Yes |
Real-time Streaming
Language Support
Configure which languages your agent supports for voice conversations. The system can auto-detect the user's language or be locked to a specific locale.
Auto-Detection
Enable language auto-detection to automatically identify and transcribe the user's spoken language. Works with all major languages across STT providers.
Fixed Language
Lock the agent to a specific language for better accuracy in single-language deployments. Improves transcription accuracy by 5-10% compared to auto-detection.
Voice Pipeline API
Programmatically enable and configure voice settings for your agents.
Voice Activity Detection (VAD)