Engineering·2026-05-12·12 min

How to build RAG with Claude: a practical guide for 2026

Retrieval-augmented generation done right with Claude — what actually moves the accuracy needle, what's hype, and a working architecture you can ship this week.

Retrieval-augmented generation (RAG) is the cheapest, most reliable way to make Claude answer questions about your own data — internal docs, codebase, support tickets, anything that's too big or too private to fit in a prompt. But there's a wide gap between a tutorial RAG and a production one. Here's what actually matters in 2026. Why RAG over fine-tuning? -------------------------- Fine-tuning is a hammer. RAG is a screwdriver. For 90% of "Claude needs to know about my data" problems, RAG wins because: - It's instant — index a new doc and it's queryable in seconds, not hours - It's cheap — embedding + vector search runs for cents per million tokens - It's auditable — you can show the user exactly which source documents the answer came from - It updates in place — no model retraining when content changes Fine-tuning is for behavioural changes (tone, format, terminology). RAG is for factual grounding. They solve different problems. The 5-stage RAG pipeline ------------------------- Every production RAG system has the same five stages. The art is in the details of each. 1. Ingest — pull source docs into a normalised text format 2. Chunk — split into retrievable units 3. Embed — convert each chunk to a vector 4. Retrieve — at query time, find the top-k most similar chunks 5. Generate — pass retrieved chunks + the question to Claude Most "my RAG doesn't work" problems trace to stages 2 and 4. Embedding models are commoditised in 2026 (Voyage, Cohere, OpenAI all ship near-identical quality at the embedding layer). The differentiation is in how you chunk and how you retrieve. Stage 1 — ingest ----------------- Strip down to clean Markdown. PDFs become text via pdfminer or unstructured.io. HTML becomes Markdown via Mozilla Readability. Code stays as code. Don't try to embed a hex-dumped binary — it's all noise. Keep document-level metadata (title, source URL, last modified) — you'll need it for stage 4. Stage 2 — chunk ---------------- This is where most RAG systems leak accuracy. The naive approach is "split every 512 tokens" — fast and terrible. Real production chunkers do three things: - Split on semantic boundaries (paragraphs, headings, code blocks) rather than fixed token counts - Preserve a small overlap between consecutive chunks (~10%) so an answer that straddles a chunk boundary is still recoverable - Attach the parent document title and section heading to every chunk's text — this helps embeddings encode "what is this about" For technical docs, 200-400 tokens per chunk hits the sweet spot. Smaller and you lose context; larger and the embedding becomes too averaged to discriminate. Stage 3 — embed ---------------- Pick one embedding model and stick with it. Re-embedding everything when you switch models is expensive. Voyage 3 large is the 2026 default for English; for code, Voyage code 2. Store the vector + the original text + the metadata in pgvector (Postgres + the pgvector extension) or a hosted vector DB like Pinecone or Qdrant. For under 1M chunks, pgvector with HNSW indexing is more than fast enough and saves you a separate piece of infrastructure. Stage 4 — retrieve ------------------- This is where serious systems separate from toy ones. Three improvements stack: **1. Hybrid search.** Run BM25 (keyword) and vector search in parallel, then fuse the rankings with reciprocal rank fusion. Pure vector search misses obvious lexical matches; pure keyword misses paraphrased ones. Hybrid is consistently 15-25% more accurate on retrieval recall. **2. Reranking.** Take the top 50 candidates from hybrid search and pass them through a cross-encoder reranker (Cohere Rerank 3, Voyage Rerank). The reranker scores each candidate against the query directly, which is more accurate than cosine similarity but too expensive to run on the full corpus. Stack: hybrid → top 50 → rerank → top 5. **3. Query expansion.** When the user asks "how do I deploy", expand to "how do I deploy", "deployment instructions", "release pipeline" before retrieval. A small Claude Haiku call generates the expansions cheaply. Adds latency but recovers the long-tail of paraphrased queries. Stage 5 — generate ------------------- Pass the top-k retrieved chunks (in our pipeline, k=5) to Claude with a system prompt like: Answer the question using ONLY the provided context. If the context doesn't contain the answer, say so. Cite the source for every claim using [doc-id]. Set temperature to 0 for factual queries; 0.3 for synthesis tasks. Use Claude Sonnet for most queries; reach for Opus only when you've measured Sonnet failing. Always cite sources. Always pass the metadata (title, URL) in the prompt so Claude can produce clickable citations the user can verify. Common pitfalls ---------------- **"My RAG is too slow."** Cache embeddings on the document side; cache rerank results per (query-prefix, doc-id) pair. Most production RAGs serve from cache 60% of the time. **"My RAG is wrong."** Run an eval — 50 hand-labeled (question, expected source) pairs. Measure retrieval recall@5 first. If recall is bad, generation can't save you. Fix retrieval before tuning prompts. **"My RAG hallucinates."** This means generation is overriding retrieval. Two fixes: (1) tighten the system prompt ("answer ONLY from context, otherwise say 'I don't know'"), (2) raise the retrieval threshold so weak matches don't make it into the prompt at all. A reference architecture ------------------------- For a typical SaaS product wiring RAG into Claude: Postgres (pgvector) — chunks + embeddings + metadata Voyage 3 large — embedding model Cohere Rerank 3 — reranker Claude Sonnet 4.7 — generator Redis — query + result cache Total cost at moderate volume: $50-150/month for under 100k queries. Latency: P50 ~800ms, P95 ~1.4s end-to-end. The ClaudeSkill marketplace has several published RAG skills you can install (`rag-doc-builder`, `vector-store-helper`) — they include the prompt patterns above as ready-to-use SKILL.md rules. Browse them at claudeskil.com/explore.