Engineering·2026-05-06·13 min

How to architect a production AI app: a battle-tested blueprint

From prompt to product — the architecture patterns, infra choices, and gotchas you actually need to ship a Claude-powered app to real paying users.

Building a production AI app isn't building an LLM demo. The demo is 10% of the work. The 90% is everything around it: caching, evals, fallbacks, observability, cost control, and the boring queue infrastructure that keeps the lights on at 3am. Here's the blueprint that's worked across the AI products I've shipped. The core layers ---------------- A production AI app has six layers, each solving a specific class of problem: 1. Edge — auth, rate limiting, request shaping 2. Orchestration — workflow logic, retries, fallbacks 3. Inference — actual LLM calls (Claude, embedding models) 4. Memory — conversation state, RAG, user context 5. Observability — traces, evals, cost dashboards 6. Async — long-running tasks, batch jobs, webhooks Skipping any one of these is fine for a demo. Skipping any one of these in production guarantees an outage in the first 6 months. Layer 1 — Edge --------------- Cloudflare Workers, Vercel Edge Functions, or a CDN-level rate limiter sits in front of every request. Three jobs: - **Auth.** Verify the JWT or session cookie before you spend any LLM tokens. - **Rate limit.** Per-user (e.g. 60 RPM) and per-org (e.g. 5,000 RPM) limits stop a runaway script from costing you $4,000 in tokens overnight. - **Idempotency keys.** When a client retries a slow request, you don't want to double-charge them or run two parallel inference calls. Reject duplicate idempotency keys at the edge. Skip the edge layer and your first incident will be a token-cost surprise. Layer 2 — Orchestration ------------------------- This is where your business logic lives. The orchestration layer: - Picks which model to call (Sonnet for quality, Haiku for speed) - Routes to the right prompt template - Handles retries with exponential backoff (LLM APIs do timeout) - Falls back to a smaller model if the primary is overloaded - Writes traces to the observability layer In TypeScript: Mastra or BAML for prompt + workflow definitions. In Python: LangGraph or Haystack. For simple apps, a few hundred lines of plain TypeScript is fine — you don't need a framework until you have multi-step workflows. Layer 3 — Inference --------------------- The actual LLM call. Three things matter: - **Streaming.** Always stream the response if you're rendering text to a user. Time-to-first-token is what users feel. A 2-second TTFT with a streaming response feels faster than a 1.5-second non-streaming response. - **Function calling for structured output.** When you need JSON, use Claude's tool-use API with a schema, not "respond in JSON" in the prompt. The schema-validated response is more reliable. - **Provider redundancy.** Run primary on Anthropic's API, fallback to AWS Bedrock or Google Vertex when Anthropic has an outage. The fallback path can be 30% slower — that's fine. It's much better than a 503. Layer 4 — Memory ------------------ Three sublayers: - **Conversation memory** — Postgres + a "messages" table. One row per turn. Index on (conversation_id, created_at). Truncate to last 20 turns for context, summarise older turns into a single "context" message if you need long memory. - **RAG memory** — pgvector or Pinecone, indexed by org_id (so user A's data never leaks to user B). See our practical RAG guide for the full pipeline. - **User context** — Postgres "user_profiles" table with the things you reliably want in every prompt (name, role, plan, last 3 actions). Inject this as a small system-prompt prefix every call. Hard rule: every memory record carries an owner_id and queries are filtered by it at the DB level, not at the application level. RLS in Postgres gives you this for free with Supabase. Layer 5 — Observability ------------------------- Without observability, debugging an AI app is shooting in the dark. Wire up: - **Langfuse, Helicone, or Phoenix** for trace dashboards. Every LLM call is one span. Tag with user_id, prompt_template_id, model_id, and feature_id. - **Cost dashboards.** Per-feature, per-model, per-user-tier. Check daily. Alerts on 24h delta > 30%. - **Evals.** A suite of 50-200 hand-labeled (input, expected_output) pairs that you re-run nightly. When a prompt change drops eval pass rate by 5%, you catch it before it ships. You can stand all this up in a weekend. Skipping it for "later" is the most common reason teams ship bad AI features. Layer 6 — Async ----------------- Anything that takes more than ~10 seconds runs async. RAG indexing, batch generation, evals, summarisation jobs. Use: - **Upstash QStash, Inngest, or Trigger.dev** for the queue - **Redis** for short-lived locks and idempotency - **Webhooks** for "your job is done" notifications back to the client The frontend never blocks on a sync LLM call longer than 30s. Above that, you return a job_id immediately and poll/stream completion. Cost controls (the part that bankrupts startups) -------------------------------------------------- - **Cache aggressively.** Anthropic's prompt caching is 90% cheaper for cached tokens. Restructure your prompts so the long static part (system prompt, RAG chunks) comes first, the short dynamic part (user question) last. Cache the static part. - **Budget alerts at 70% and 90%.** Daily spend caps with auto-shutoff at 110% of budget. Email + Slack on every breach. - **Per-user usage caps.** Even on paid tiers. Free tier: 50 messages/day. Pro: 1000/day. Enterprise: custom. Without caps, one bad actor wipes out your margin. - **Right-size your models.** Most chat tasks don't need Opus. Sonnet is 5x cheaper and good enough. Haiku is 20x cheaper and good enough for routing/extraction. Reliability — what'll actually break ------------------------------------- Three things take down production AI apps in roughly this order: 1. **Provider outages** — Anthropic, OpenAI, AWS all have hours-long degradations. Multi-provider fallback is non-negotiable. 2. **Cost runaways** — usually a bug in retry logic that fires 50 retries on a 429 instead of backing off. Always cap total retries. 3. **Hallucinated tool calls** — the model invents a tool that doesn't exist, or passes the wrong argument shape. Schema-validate every tool call before executing it. Plan for all three from day one. Each has burned me at least once. The starter stack ------------------ For a TypeScript SaaS in 2026, here's what I'd default to: Frontend: Next.js + Vercel Edge: Vercel Edge Middleware + Upstash Ratelimit Orchestration: Mastra Inference: Anthropic SDK + Bedrock fallback Memory: Supabase Postgres (with pgvector + RLS) Observability: Langfuse + Vercel logs Async: Inngest Total infra: ~$200/mo at small scale, scales to $2k/mo at 100k MAU. Margins remain healthy because cached tokens dominate cost at that scale. Many of the SKILL.md skills on claudeskil.com (search "rag", "evals", "ai infra") encode these patterns as ready-to-install rules for Claude Code. Browse them at claudeskil.com/explore.