RAG Configuration

Fine-tune how your agent retrieves and uses knowledge base content. Adjust chunking strategies, retrieval parameters, and embedding models to optimize response quality for your specific use case.

app.8bit-ai.com

Chunking

Control how documents are split into retrievable segments

Retrieval

Configure similarity search parameters and chunk selection

Embeddings

Choose embedding models that power semantic search

How RAG Works in 8bit-ai

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from your knowledge base before generating an answer. The quality of RAG output depends on three key stages: document chunking, embedding-based retrieval, and context-augmented generation.

Document Chunking

Documents are split into smaller pieces called chunks. The chunk size, overlap, and splitting strategy determine how information is segmented. Good chunking ensures each chunk contains a complete, coherent thought that can stand alone when retrieved.

Vector Embedding

Each chunk is converted into a high-dimensional vector using an embedding model. These vectors capture semantic meaning, enabling similarity-based retrieval. The same embedding model must be used for both document and query vectors.

Semantic Search

When a user sends a query, it is embedded using the same model, then cosine similarity is computed against all document chunk vectors. The top-K chunks above the similarity threshold are returned, ranked by relevance score.

Context Augmentation

Retrieved chunks are injected into the LLM prompt alongside the user query. The LLM uses this context to generate grounded, accurate responses with source citations. Properly configured RAG prevents hallucination and ensures responses are based on your actual documentation.

Chunking Configuration

Chunking determines how your documents are split into indexable segments. The right chunking strategy balances granularity with context completeness.

Chunk Size

The target size of each chunk in tokens. Smaller chunks provide more precise retrieval but may lack context. Larger chunks include more context but may introduce irrelevant information into the retrieval results.

200-400 tokens:Precise retrieval for Q&A, definitions, and facts

500-1000 tokens:Balanced (recommended default)

1200-2000 tokens:Best for narrative content, procedures, and tutorials

Chunk Overlap

The number of tokens shared between consecutive chunks. Overlap prevents information loss at chunk boundaries, ensuring that sentences or concepts spanning the cut point are captured in at least one chunk.

0 tokens:No overlap; minimal storage but risk of lost context

50-100 tokens:Recommended for general documentation

150-200 tokens:Heavy overlap for dense or highly interconnected content

Chunking Strategy

The system uses a recursive character text splitter that respects document structure. It attempts to split at natural boundaries like paragraph breaks, headings, and sentence endings before falling back to token-count-based splitting.

Recursive split:Respects headings, paragraphs, sentences (default)

Fixed size:Strict token count split, ignores structure

Retrieval Configuration

Retrieval settings control how the system searches for and selects relevant chunks when processing a user query. These settings directly impact response quality and cost.

Max Chunks

The maximum number of chunks to retrieve per query (range: 1-10). More chunks provide broader context but increase token usage and cost.

1-2 chunks:Simple, factual queries

3-5 chunks:Balanced (recommended)

6-10 chunks:Complex, multi-faceted questions

Similarity Threshold

The minimum cosine similarity score (0-1) for a chunk to be included in results. Higher values mean stricter matching, returning only highly relevant chunks.

0.5-0.6:Loose matching, high recall

0.7-0.8:Balanced (recommended)

0.85-1.0:Strict matching, high precision

Retrieval Mode

Choose how the system combines results from multiple collections linked to an agent.

Merge:Search all collections, merge results by score, return top-K (default)

Per-collection:Separate top-K per collection, then merge. Ensures coverage from all sources

Embedding Models

The embedding model converts text into numerical vectors that capture semantic meaning. Choosing the right model affects retrieval quality, language support, and cost.

Model	Dimensions	Languages	Best For
text-embedding-3-small	512 (default) / 1536	100+	General purpose, cost-effective (recommended)
text-embedding-3-large	1536 (default) / 3072	100+	High-accuracy, nuanced retrieval
Cohere embed-v3	1024	50+	Multilingual, compression-friendly outputs

Model Selection Guidelines

Start with text-embedding-3-small for most use cases. It offers an excellent balance of quality, speed, and cost.
Upgrade to text-embedding-3-large when retrieval precision is critical and you need the highest accuracy for complex domain-specific queries.
Use Cohere embed-v3 when working with multilingual content or when you need compressed embeddings to reduce storage costs.
All documents in a collection must use the same embedding model. Changing the model requires a full reindex.

Vector Dimensions and Storage

Higher-dimensional embeddings capture more semantic nuance but consume more storage and increase query latency. text-embedding-3-small at 512 dimensions offers approximately 85% of the performance of 3072 dimensions at a fraction of the cost.

Advanced Settings

Fine-tune additional parameters for specialized RAG configurations. These settings are available for advanced use cases and should be adjusted with care.

MMR (Maximum Marginal Relevance)

Diversifies retrieved chunks to reduce redundancy. When enabled, the system selects chunks that are both relevant to the query and different from each other, providing broader coverage of the topic.

Lambda (0-1): 0 = maximum diversity, 1 = maximum relevance

Hybrid Search

Combines semantic (vector) search with keyword (BM25) search. Hybrid search improves retrieval for queries with specific terminology where exact keyword matches are important.

Alpha (0-1): 0 = pure keyword, 1 = pure semantic

Query Rewriting

Automatically rewrites the user's query before embedding to improve retrieval quality. For example, expanding acronyms, correcting grammar, or reformulating questions for better semantic matching.

Re-ranking

Applies a cross-encoder model to re-rank the initial retrieval results for higher precision. This adds latency but significantly improves the relevance of the top-ranked chunks.

Performance Benchmarks

The table below shows how different settings affect retrieval quality and latency. Benchmarks are based on a 10,000-document knowledge base with standard documentation.

Configuration	Precision	Recall	Latency
Basic (default)	~72%	~68%	~45ms
Hybrid Search	~78%	~82%	~60ms
Hybrid + MMR	~81%	~79%	~75ms
Full (Hybrid + MMR + Rerank)	~91%	~85%	~200ms

Performance vs. Cost Trade-offs

Advanced features like re-ranking and query rewriting add latency and consume additional API credits. Start with the default configuration and enable advanced features only if retrieval quality does not meet your requirements.