Voice Pipeline

Configure text-to-speech (TTS) and speech-to-text (STT) capabilities to enable natural voice conversations between your users and AI agents.

app.8bit-ai.com
Voice Pipeline

Overview

The voice pipeline transforms your agent from a text-only interface into a voice-enabled conversational AI. When a user speaks, their audio is transcribed via STT, processed by the LLM, and the response is synthesized back into speech via TTS.

Voice Pipeline Flow

1

User Speaks

Audio captured via microphone or phone call

2

Speech-to-Text (STT)

Audio transcribed to text using provider model

3

LLM Processes

Agent generates text response

4

Text-to-Speech (TTS)

Response synthesized to natural speech

5

User Hears Response

Audio played back in real-time

Voice Pipeline Availability

Voice pipeline is available on Pro and Enterprise plans. The free tier supports text-only conversations. Contact sales for volume pricing on voice minutes.

Text-to-Speech (TTS)

Configure TTS provider, voice, speech rate, and pitch to create natural-sounding voice responses for your agent.

Supported TTS Providers

ProviderVoice QualityLanguagesLatency
OpenAI TTSExcellent (HD)6 languages~500ms
ElevenLabsBest-in-class29 languages~300ms
Google Cloud TTSVery Good220+ voices~400ms
Amazon PollyGood30+ languages~600ms

TTS Configuration

Voice Selection

Pre-built Voices

Choose from a library of pre-built voices across different genders, accents, and styles. Each provider offers a curated selection of high-quality voices optimized for conversational AI.

Voice Cloning

ElevenLabs and other premium providers offer voice cloning: upload a short audio sample (1-3 minutes) to create a custom voice that matches your brand identity.

Custom Voice Parameters

Fine-tune voice characteristics for your specific use case:

Speed (0.5-2.0):Speech rate multiplier. 1.0 is normal speed.
Pitch (-20 to 20):Semitone adjustment. 0 is normal pitch.
Stability (0-1):Voice consistency. Higher = less variation.
Volume (0-100):Output volume level.

Speech-to-Text (STT)

Configure speech recognition to accurately transcribe user speech in real-time. Supports multiple languages and domain-specific vocabulary.

ProviderAccuracyLanguagesReal-time
OpenAI WhisperHigh99 languagesYes
DeepgramVery High32 languagesYes
Google STTHigh125+ languagesYes
Azure SpeechHigh100+ languagesYes

Real-time Streaming

All supported STT providers support real-time streaming transcription, enabling live conversation flow with sub-300ms latency for partial results.

Language Support

Configure which languages your agent supports for voice conversations. The system can auto-detect the user's language or be locked to a specific locale.

Auto-Detection

Enable language auto-detection to automatically identify and transcribe the user's spoken language. Works with all major languages across STT providers.

Fixed Language

Lock the agent to a specific language for better accuracy in single-language deployments. Improves transcription accuracy by 5-10% compared to auto-detection.

Voice Pipeline API

Programmatically enable and configure voice settings for your agents.

Voice Activity Detection (VAD)

VAD is critical for natural conversation flow. It detects when the user has stopped speaking and triggers processing. Configure silence duration based on your use case — shorter for rapid exchanges, longer for thoughtful conversations.