Engineering··13 min
How to architect a production AI app: a battle-tested blueprint
From prompt to product — the architecture patterns, infra choices, and gotchas you actually need to ship a Claude-powered app to real paying users.
Building a production AI app isn't building an LLM demo. The demo is 10% of the work. The 90% is everything around it: caching, evals, fallbacks, observability, cost control, and the boring queue infrastructure that keeps the lights on at 3am. Here's the blueprint that's worked across the AI products I've shipped.
The core layers
----------------
A production AI app has six layers, each solving a specific class of problem:
1. Edge — auth, rate limiting, request shaping
2. Orchestration — workflow logic, retries, fallbacks
3. Inference — actual LLM calls (Claude, embedding models)
4. Memory — conversation state, RAG, user context
5. Observability — traces, evals, cost dashboards
6. Async — long-running tasks, batch jobs, webhooks
Skipping any one of these is fine for a demo. Skipping any one of these in production guarantees an outage in the first 6 months.
Layer 1 — Edge
---------------
Cloudflare Workers, Vercel Edge Functions, or a CDN-level rate limiter sits in front of every request. Three jobs:
- **Auth.** Verify the JWT or session cookie before you spend any LLM tokens.
- **Rate limit.** Per-user (e.g. 60 RPM) and per-org (e.g. 5,000 RPM) limits stop a runaway script from costing you $4,000 in tokens overnight.
- **Idempotency keys.** When a client retries a slow request, you don't want to double-charge them or run two parallel inference calls. Reject duplicate idempotency keys at the edge.
Skip the edge layer and your first incident will be a token-cost surprise.
Layer 2 — Orchestration
-------------------------
This is where your business logic lives. The orchestration layer:
- Picks which model to call (Sonnet for quality, Haiku for speed)
- Routes to the right prompt template
- Handles retries with exponential backoff (LLM APIs do timeout)
- Falls back to a smaller model if the primary is overloaded
- Writes traces to the observability layer
In TypeScript: Mastra or BAML for prompt + workflow definitions. In Python: LangGraph or Haystack. For simple apps, a few hundred lines of plain TypeScript is fine — you don't need a framework until you have multi-step workflows.
Layer 3 — Inference
---------------------
The actual LLM call. Three things matter:
- **Streaming.** Always stream the response if you're rendering text to a user. Time-to-first-token is what users feel. A 2-second TTFT with a streaming response feels faster than a 1.5-second non-streaming response.
- **Function calling for structured output.** When you need JSON, use Claude's tool-use API with a schema, not "respond in JSON" in the prompt. The schema-validated response is more reliable.
- **Provider redundancy.** Run primary on Anthropic's API, fallback to AWS Bedrock or Google Vertex when Anthropic has an outage. The fallback path can be 30% slower — that's fine. It's much better than a 503.
Layer 4 — Memory
------------------
Three sublayers:
- **Conversation memory** — Postgres + a "messages" table. One row per turn. Index on (conversation_id, created_at). Truncate to last 20 turns for context, summarise older turns into a single "context" message if you need long memory.
- **RAG memory** — pgvector or Pinecone, indexed by org_id (so user A's data never leaks to user B). See our practical RAG guide for the full pipeline.
- **User context** — Postgres "user_profiles" table with the things you reliably want in every prompt (name, role, plan, last 3 actions). Inject this as a small system-prompt prefix every call.
Hard rule: every memory record carries an owner_id and queries are filtered by it at the DB level, not at the application level. RLS in Postgres gives you this for free with Supabase.
Layer 5 — Observability
-------------------------
Without observability, debugging an AI app is shooting in the dark. Wire up:
- **Langfuse, Helicone, or Phoenix** for trace dashboards. Every LLM call is one span. Tag with user_id, prompt_template_id, model_id, and feature_id.
- **Cost dashboards.** Per-feature, per-model, per-user-tier. Check daily. Alerts on 24h delta > 30%.
- **Evals.** A suite of 50-200 hand-labeled (input, expected_output) pairs that you re-run nightly. When a prompt change drops eval pass rate by 5%, you catch it before it ships.
You can stand all this up in a weekend. Skipping it for "later" is the most common reason teams ship bad AI features.
Layer 6 — Async
-----------------
Anything that takes more than ~10 seconds runs async. RAG indexing, batch generation, evals, summarisation jobs. Use:
- **Upstash QStash, Inngest, or Trigger.dev** for the queue
- **Redis** for short-lived locks and idempotency
- **Webhooks** for "your job is done" notifications back to the client
The frontend never blocks on a sync LLM call longer than 30s. Above that, you return a job_id immediately and poll/stream completion.
Cost controls (the part that bankrupts startups)
--------------------------------------------------
- **Cache aggressively.** Anthropic's prompt caching is 90% cheaper for cached tokens. Restructure your prompts so the long static part (system prompt, RAG chunks) comes first, the short dynamic part (user question) last. Cache the static part.
- **Budget alerts at 70% and 90%.** Daily spend caps with auto-shutoff at 110% of budget. Email + Slack on every breach.
- **Per-user usage caps.** Even on paid tiers. Free tier: 50 messages/day. Pro: 1000/day. Enterprise: custom. Without caps, one bad actor wipes out your margin.
- **Right-size your models.** Most chat tasks don't need Opus. Sonnet is 5x cheaper and good enough. Haiku is 20x cheaper and good enough for routing/extraction.
Reliability — what'll actually break
-------------------------------------
Three things take down production AI apps in roughly this order:
1. **Provider outages** — Anthropic, OpenAI, AWS all have hours-long degradations. Multi-provider fallback is non-negotiable.
2. **Cost runaways** — usually a bug in retry logic that fires 50 retries on a 429 instead of backing off. Always cap total retries.
3. **Hallucinated tool calls** — the model invents a tool that doesn't exist, or passes the wrong argument shape. Schema-validate every tool call before executing it.
Plan for all three from day one. Each has burned me at least once.
The starter stack
------------------
For a TypeScript SaaS in 2026, here's what I'd default to:
Frontend: Next.js + Vercel
Edge: Vercel Edge Middleware + Upstash Ratelimit
Orchestration: Mastra
Inference: Anthropic SDK + Bedrock fallback
Memory: Supabase Postgres (with pgvector + RLS)
Observability: Langfuse + Vercel logs
Async: Inngest
Total infra: ~$200/mo at small scale, scales to $2k/mo at 100k MAU. Margins remain healthy because cached tokens dominate cost at that scale.
Many of the SKILL.md skills on claudeskil.com (search "rag", "evals", "ai infra") encode these patterns as ready-to-install rules for Claude Code. Browse them at claudeskil.com/explore.