Deploy AI systems reliably: infrastructure patterns, cost control, and operational runbooks.
11 min readRun new model alongside old for 72h. Compare outputs, latency, and error rates before full cutover. Rollback plan must exist before shadow starts.
Set per-user token limits. Alert at 80% of monthly LLM budget. Implement token counting before API call — not after. For Turkish: average 1.4× more tokens than English.
Semantic cache (Redis + vector similarity) gives 40–60% hit rate on FAQ-style workloads. Cache at the prompt level, not the API level. TTL: 24h for factual, 1h for dynamic.
Instrument: input/output logging, token usage, latency P50/P95/P99, error rate, model version. Use structured logs (JSON) — they're queryable. Store in Basefyio analytics table.
Write the runbook before incidents happen. Cover: high error rate, cost spike, latency degradation, safety violation. Each should have a <5min detection-to-mitigation SLA.