RAG Configuration
Fine-tune how your agent retrieves and uses knowledge base content. Adjust chunking strategies, retrieval parameters, and embedding models to optimize response quality for your specific use case.

Chunking
Control how documents are split into retrievable segments
Retrieval
Configure similarity search parameters and chunk selection
Embeddings
Choose embedding models that power semantic search
How RAG Works in 8bit-ai
Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from your knowledge base before generating an answer. The quality of RAG output depends on three key stages: document chunking, embedding-based retrieval, and context-augmented generation.
Document Chunking
Documents are split into smaller pieces called chunks. The chunk size, overlap, and splitting strategy determine how information is segmented. Good chunking ensures each chunk contains a complete, coherent thought that can stand alone when retrieved.
Vector Embedding
Each chunk is converted into a high-dimensional vector using an embedding model. These vectors capture semantic meaning, enabling similarity-based retrieval. The same embedding model must be used for both document and query vectors.
Semantic Search
When a user sends a query, it is embedded using the same model, then cosine similarity is computed against all document chunk vectors. The top-K chunks above the similarity threshold are returned, ranked by relevance score.
Context Augmentation
Retrieved chunks are injected into the LLM prompt alongside the user query. The LLM uses this context to generate grounded, accurate responses with source citations. Properly configured RAG prevents hallucination and ensures responses are based on your actual documentation.
Chunking Configuration
Chunking determines how your documents are split into indexable segments. The right chunking strategy balances granularity with context completeness.
Chunk Size
The target size of each chunk in tokens. Smaller chunks provide more precise retrieval but may lack context. Larger chunks include more context but may introduce irrelevant information into the retrieval results.
Chunk Overlap
The number of tokens shared between consecutive chunks. Overlap prevents information loss at chunk boundaries, ensuring that sentences or concepts spanning the cut point are captured in at least one chunk.
Chunking Strategy
The system uses a recursive character text splitter that respects document structure. It attempts to split at natural boundaries like paragraph breaks, headings, and sentence endings before falling back to token-count-based splitting.
Retrieval Configuration
Retrieval settings control how the system searches for and selects relevant chunks when processing a user query. These settings directly impact response quality and cost.
Max Chunks
The maximum number of chunks to retrieve per query (range: 1-10). More chunks provide broader context but increase token usage and cost.
Similarity Threshold
The minimum cosine similarity score (0-1) for a chunk to be included in results. Higher values mean stricter matching, returning only highly relevant chunks.
Retrieval Mode
Choose how the system combines results from multiple collections linked to an agent.
Embedding Models
The embedding model converts text into numerical vectors that capture semantic meaning. Choosing the right model affects retrieval quality, language support, and cost.
| Model | Dimensions | Languages | Best For |
|---|---|---|---|
| text-embedding-3-small | 512 (default) / 1536 | 100+ | General purpose, cost-effective (recommended) |
| text-embedding-3-large | 1536 (default) / 3072 | 100+ | High-accuracy, nuanced retrieval |
| Cohere embed-v3 | 1024 | 50+ | Multilingual, compression-friendly outputs |
Model Selection Guidelines
- Start with text-embedding-3-small for most use cases. It offers an excellent balance of quality, speed, and cost.
- Upgrade to text-embedding-3-large when retrieval precision is critical and you need the highest accuracy for complex domain-specific queries.
- Use Cohere embed-v3 when working with multilingual content or when you need compressed embeddings to reduce storage costs.
- All documents in a collection must use the same embedding model. Changing the model requires a full reindex.
Vector Dimensions and Storage
Advanced Settings
Fine-tune additional parameters for specialized RAG configurations. These settings are available for advanced use cases and should be adjusted with care.
MMR (Maximum Marginal Relevance)
Diversifies retrieved chunks to reduce redundancy. When enabled, the system selects chunks that are both relevant to the query and different from each other, providing broader coverage of the topic.
Hybrid Search
Combines semantic (vector) search with keyword (BM25) search. Hybrid search improves retrieval for queries with specific terminology where exact keyword matches are important.
Query Rewriting
Automatically rewrites the user's query before embedding to improve retrieval quality. For example, expanding acronyms, correcting grammar, or reformulating questions for better semantic matching.
Re-ranking
Applies a cross-encoder model to re-rank the initial retrieval results for higher precision. This adds latency but significantly improves the relevance of the top-ranked chunks.
Performance Benchmarks
The table below shows how different settings affect retrieval quality and latency. Benchmarks are based on a 10,000-document knowledge base with standard documentation.
| Configuration | Precision | Recall | Latency |
|---|---|---|---|
| Basic (default) | ~72% | ~68% | ~45ms |
| Hybrid Search | ~78% | ~82% | ~60ms |
| Hybrid + MMR | ~81% | ~79% | ~75ms |
| Full (Hybrid + MMR + Rerank) | ~91% | ~85% | ~200ms |
Performance vs. Cost Trade-offs