Voice Pipeline

Configure text-to-speech (TTS) and speech-to-text (STT) capabilities to enable natural voice conversations between your users and AI agents.

app.8bit-ai.com

Overview

The voice pipeline transforms your agent from a text-only interface into a voice-enabled conversational AI. When a user speaks, their audio is transcribed via STT, processed by the LLM, and the response is synthesized back into speech via TTS.

Voice Pipeline Flow

User Speaks

Audio captured via microphone or phone call

Speech-to-Text (STT)

Audio transcribed to text using provider model

LLM Processes

Agent generates text response

Text-to-Speech (TTS)

Response synthesized to natural speech

User Hears Response

Audio played back in real-time

Voice Pipeline Availability

Voice pipeline is available on Pro and Enterprise plans. The free tier supports text-only conversations. Contact sales for volume pricing on voice minutes.

Text-to-Speech (TTS)

Configure TTS provider, voice, speech rate, and pitch to create natural-sounding voice responses for your agent.

Supported TTS Providers

Provider	Voice Quality	Languages	Latency
OpenAI TTS	Excellent (HD)	6 languages	~500ms
ElevenLabs	Best-in-class	29 languages	~300ms
Google Cloud TTS	Very Good	220+ voices	~400ms
Amazon Polly	Good	30+ languages	~600ms

TTS Configuration

Voice Selection

Pre-built Voices

Choose from a library of pre-built voices across different genders, accents, and styles. Each provider offers a curated selection of high-quality voices optimized for conversational AI.

Voice Cloning

ElevenLabs and other premium providers offer voice cloning: upload a short audio sample (1-3 minutes) to create a custom voice that matches your brand identity.

Custom Voice Parameters

Fine-tune voice characteristics for your specific use case:

Speed (0.5-2.0):Speech rate multiplier. 1.0 is normal speed.

Pitch (-20 to 20):Semitone adjustment. 0 is normal pitch.

Stability (0-1):Voice consistency. Higher = less variation.

Volume (0-100):Output volume level.

Speech-to-Text (STT)

Configure speech recognition to accurately transcribe user speech in real-time. Supports multiple languages and domain-specific vocabulary.

Provider	Accuracy	Languages	Real-time
OpenAI Whisper	High	99 languages	Yes
Deepgram	Very High	32 languages	Yes
Google STT	High	125+ languages	Yes
Azure Speech	High	100+ languages	Yes

Real-time Streaming

All supported STT providers support real-time streaming transcription, enabling live conversation flow with sub-300ms latency for partial results.

Language Support

Configure which languages your agent supports for voice conversations. The system can auto-detect the user's language or be locked to a specific locale.

Auto-Detection

Enable language auto-detection to automatically identify and transcribe the user's spoken language. Works with all major languages across STT providers.

Fixed Language

Lock the agent to a specific language for better accuracy in single-language deployments. Improves transcription accuracy by 5-10% compared to auto-detection.

Voice Pipeline API

Programmatically enable and configure voice settings for your agents.

Voice Activity Detection (VAD)

VAD is critical for natural conversation flow. It detects when the user has stopped speaking and triggers processing. Configure silence duration based on your use case — shorter for rapid exchanges, longer for thoughtful conversations.

Agent Configuration

LLM parameters and model selection

Agent Deployment

Deploy to phone channels and manage versions

Voice Pipeline

Overview

Voice Pipeline Flow

Text-to-Speech (TTS)

Supported TTS Providers

TTS Configuration

Voice Selection

Pre-built Voices

Voice Cloning

Custom Voice Parameters

Speech-to-Text (STT)

Language Support

Auto-Detection

Fixed Language

Voice Pipeline API

Related Documentation

Agent Configuration

Agent Deployment