
Shipping your first LLM feature feels magical. Shipping your tenth feels terrifying. Why? Because AI has a unique failure mode: it can be subtly wrong in ways traditional software never is.
A regular API either works or throws an error. An LLM might give you a confident answer that’s 80% correct—and you won’t know which 20% is wrong until a user complains.
Here’s how to ship AI features with confidence.
The Wake-Up Call
Last month, a fintech company deployed an LLM-powered customer support chatbot. Week one: amazing. 85% of tickets auto-resolved. Week two: disaster.
The bot started telling users to “just transfer your balance to resolve the overdraft fee.” Technically correct—but terrible advice that cost customers money. The LLM had learned from old support transcripts that included bad advice given by junior agents years ago.
They had no monitoring in place. They found out when Twitter complaints started rolling in.
Don’t be that company.
The Three Pillars of LLM MLOps
1. Evaluation Before Deployment
2. Observability in Production
3. Continuous Improvement Loops
Let’s break down each.
Pillar 1: Evaluation (Before You Ship)
You wouldn’t deploy code without tests. Don’t deploy LLM features without evals.
Building an Eval Set
Start with 50-100 real examples:
Input | Expected Output | Reasoning |
---|---|---|
“What’s my balance?” | Should query user’s account | Requires auth + DB lookup |
“Cancel my subscription” | Should ask for confirmation first | Safety check needed |
“You guys suck!” | Empathetic response, no argument | De-escalation required |
Include edge cases:
- Ambiguous queries
- Requests for harmful actions
- Questions outside your domain
- Multilingual inputs (if applicable)
- Attempts to jailbreak or manipulate
Automated Evaluation Metrics
You can’t manually review every output. Use automated checks:
✅ Exact Match (for structured outputs)
If you expect JSON with specific fields, check if output matches schema exactly.
✅ Semantic Similarity (for text)
Compare embedding similarity between expected and actual output. >0.85 similarity usually means correct.
✅ LLM-as-Judge
Use a second LLM to evaluate the first’s output:
“Rate this customer support response 1-10 for: accuracy, tone, helpfulness. Explain your rating.”
✅ Rule-Based Checks
Simple but effective:
- Response length (too short = incomplete, too long = rambling)
- Forbidden words (profanity, competitor names)
- Must include certain elements (e.g., support responses must have a “How can I help further?” closing)
The Baseline Test
Before deploying any change (new prompt, new model, new temperature), run your eval set and compare:
Metric | Current (GPT-4) | New (Claude 3.7) | Change |
---|---|---|---|
Accuracy | 92% | 95% | +3% ✅ |
Avg response time | 1.2s | 1.8s | +0.6s ⚠️ |
Cost per request | $0.02 | $0.04 | +100% ⚠️ |
Helpfulness score | 8.1/10 | 8.9/10 | +0.8 ✅ |
Decision: Better accuracy and helpfulness, but double the cost and slower. Deploy only if cost is acceptable.
Without this baseline, you’re flying blind.
Pillar 2: Observability (In Production)
Once deployed, monitor everything. Here’s what actually matters:
Critical Metrics to Track
1. Response Quality Drift
Sample 1% of production responses and run through your eval pipeline. If scores drop below threshold (e.g., from 90% to 85%), alert.
2. User Feedback
Track thumbs up/down, “report” clicks, follow-up questions. A spike in negative feedback = something broke.
3. Fallback Rate
How often does your app fall back to “I don’t know” or escalate to humans? Rising fallback rate = model struggling.
4. Latency Distribution
Track p50, p95, p99 latency. A slow LLM kills UX. Set alerts for p95 > 3 seconds.
5. Cost Per Session
If cost per user session suddenly doubles, investigate. Could be a prompt regression causing longer outputs.
6. Error Rates
API timeouts, rate limits, content policy violations. These kill user trust.
LangSmith, LangFuse, or Braintrust
Don’t build observability from scratch. Use purpose-built LLM monitoring tools:
- LangSmith: Best if you use LangChain. Full tracing, eval runs, prompt versioning.
- LangFuse: Open-source alternative. Self-host or use cloud. Great dashboards.
- Braintrust: Excellent for evals. Supports A/B testing prompts in production.
- Helicone: Focuses on cost tracking and caching. Simple integration.
Pick one. Don’t go to production without it.
The 5-Minute Dashboard
Your “LLM health dashboard” should show at a glance:
- Requests in last 24 hours (is traffic normal?)
- Error rate (is the API working?)
- P95 latency (is it fast enough?)
- Average cost per request (are we bleeding money?)
- User satisfaction score (are users happy?)
- Sample of recent responses (spot-check quality)
If anything looks off, drill down.
Pillar 3: Continuous Improvement
Shipping is the beginning, not the end. Great AI products iterate weekly.
The Weekly Improvement Loop
- Monday: Review last week’s metrics. Identify top 3 issues.
- Tuesday: Pull failed examples from production. Add to eval set.
- Wednesday: Experiment with fixes (new prompts, model changes).
- Thursday: Run evals on fixes. Pick best candidate.
- Friday: Deploy to 10% of traffic (canary). Monitor closely.
- Weekend: If canary looks good, roll out to 100%.
This loop compounds. After 10 weeks, your AI is dramatically better.
Learning from Failures
Every bad output is a learning opportunity:
Failure Type | How to Fix |
---|---|
Hallucinating facts | Add RAG with authoritative sources |
Wrong tone | Add tone examples to prompt |
Ignoring instructions | Rephrase instructions, add emphasis |
Off-topic responses | Add classification layer first |
Too verbose | Add length constraint to prompt |
A/B Testing Prompts
Never guess if a new prompt is better—test it:
- Control group: 80% of users get current prompt
- Test group: 20% get new prompt
- Duration: Run for 3-7 days
- Metrics: Compare user satisfaction, task success rate, cost
If test wins by >5% on key metric, roll out to everyone.
Handling the “Long Tail” of Weird Inputs
In production, users will do things you never imagined:
- Ask questions in languages your model doesn’t handle well
- Submit essays when you expect one sentence
- Try to trick the AI into saying inappropriate things
- Submit gibberish to see what happens
Defense Strategies:
- Input validation: Check length, detect language, filter profanity
- Intent classification: “Is this query actually related to our product?” If no, reject.
- Output filtering: Scan responses for policy violations before showing to user
- Graceful degradation: If LLM fails, fall back to rule-based responses or human handoff
Version Control for Prompts
Treat prompts like code. Use version control:
prompts/ customer_support/ v1.txt v2.txt (improved tone) v3.txt (added examples) v4_current.txt summarization/ v1.txt v2_current.txt
Log which prompt version was used for each production request. If quality drops, you can trace it to a specific prompt change.
The Deployment Checklist
Before pushing any LLM change to production:
☐ Ran eval set (>90% pass rate)
☐ Tested edge cases manually (weird inputs)
☐ Checked cost impact (within budget)
☐ Verified latency acceptable (p95 < 3s)
☐ Set up monitoring/alerts
☐ Prepared rollback plan (can revert in 5 min)
☐ Tested failure modes (what if API is down?)
☐ Got peer review of prompt changes
☐ Staged rollout plan (10% → 50% → 100%)
☐ On-call engineer assigned for first 24 hours
Skip any of these at your own risk.
Common Mistakes (And How to Avoid Them)
❌ Mistake: “It works on my machine”
✅ Fix: Always test with real user data, not cherry-picked examples.
❌ Mistake: No rollback plan
✅ Fix: Feature flags. Be able to disable AI feature instantly.
❌ Mistake: Optimizing for the wrong metric
✅ Fix: User satisfaction > model accuracy. A slightly less accurate but faster model might be better UX.
❌ Mistake: Ignoring cost until the bill arrives
✅ Fix: Set cost alerts. Get notified if daily spend exceeds threshold.
❌ Mistake: Treating prompts as “set and forget”
✅ Fix: Continuous improvement. Good products iterate weekly.
The Incident Response Plan
When something goes wrong (it will), follow this runbook:
- Detect: Alert fires (quality drop, error spike, cost spike)
- Assess: How bad? How many users affected? What’s the root cause?
- Contain: If critical, disable AI feature immediately (fallback to rules or humans)
- Fix: Roll back to last known good version OR apply emergency patch
- Verify: Monitor for 1 hour. Is it really fixed?
- Post-mortem: Write up what happened, why, and how to prevent it
Practice this drill before you need it.
The Bottom Line
Shipping LLM features is different from shipping traditional software. You need:
- Evals before deployment (don’t ship blind)
- Monitoring in production (catch issues fast)
- Continuous improvement (iterate weekly)
The companies winning with AI aren’t the ones with the best models. They’re the ones with the best processes.
Build the processes now, and you’ll ship AI features with confidence instead of fear.