MLOps for LLMs: Ship AI Without Breaking

Shipping your first LLM feature feels magical. Shipping your tenth feels terrifying. Why? Because AI has a unique failure mode: it can be subtly wrong in ways traditional software never is.

A regular API either works or throws an error. An LLM might give you a confident answer that’s 80% correct—and you won’t know which 20% is wrong until a user complains.

Here’s how to ship AI features with confidence.

The Wake-Up Call

Last month, a fintech company deployed an LLM-powered customer support chatbot. Week one: amazing. 85% of tickets auto-resolved. Week two: disaster.

The bot started telling users to “just transfer your balance to resolve the overdraft fee.” Technically correct—but terrible advice that cost customers money. The LLM had learned from old support transcripts that included bad advice given by junior agents years ago.

They had no monitoring in place. They found out when Twitter complaints started rolling in.

Don’t be that company.

The Three Pillars of LLM MLOps

1. Evaluation Before Deployment

2. Observability in Production

3. Continuous Improvement Loops

Let’s break down each.

Pillar 1: Evaluation (Before You Ship)

You wouldn’t deploy code without tests. Don’t deploy LLM features without evals.

Building an Eval Set

Start with 50-100 real examples:

Input	Expected Output	Reasoning
“What’s my balance?”	Should query user’s account	Requires auth + DB lookup
“Cancel my subscription”	Should ask for confirmation first	Safety check needed
“You guys suck!”	Empathetic response, no argument	De-escalation required

Include edge cases:

Ambiguous queries
Requests for harmful actions
Questions outside your domain
Multilingual inputs (if applicable)
Attempts to jailbreak or manipulate

Automated Evaluation Metrics

You can’t manually review every output. Use automated checks:

✅ Exact Match (for structured outputs)

If you expect JSON with specific fields, check if output matches schema exactly.

✅ Semantic Similarity (for text)

Compare embedding similarity between expected and actual output. >0.85 similarity usually means correct.

✅ LLM-as-Judge

Use a second LLM to evaluate the first’s output:

“Rate this customer support response 1-10 for: accuracy, tone, helpfulness. Explain your rating.”

✅ Rule-Based Checks

Simple but effective:

Response length (too short = incomplete, too long = rambling)
Forbidden words (profanity, competitor names)
Must include certain elements (e.g., support responses must have a “How can I help further?” closing)

The Baseline Test

Before deploying any change (new prompt, new model, new temperature), run your eval set and compare:

Metric	Current (GPT-4)	New (Claude 3.7)	Change
Accuracy	92%	95%	+3% ✅
Avg response time	1.2s	1.8s	+0.6s ⚠️
Cost per request	$0.02	$0.04	+100% ⚠️
Helpfulness score	8.1/10	8.9/10	+0.8 ✅

Decision: Better accuracy and helpfulness, but double the cost and slower. Deploy only if cost is acceptable.

Without this baseline, you’re flying blind.

Pillar 2: Observability (In Production)

Once deployed, monitor everything. Here’s what actually matters:

Critical Metrics to Track

1. Response Quality Drift

Sample 1% of production responses and run through your eval pipeline. If scores drop below threshold (e.g., from 90% to 85%), alert.

2. User Feedback

Track thumbs up/down, “report” clicks, follow-up questions. A spike in negative feedback = something broke.

3. Fallback Rate

How often does your app fall back to “I don’t know” or escalate to humans? Rising fallback rate = model struggling.

4. Latency Distribution

Track p50, p95, p99 latency. A slow LLM kills UX. Set alerts for p95 > 3 seconds.

5. Cost Per Session

If cost per user session suddenly doubles, investigate. Could be a prompt regression causing longer outputs.

6. Error Rates

API timeouts, rate limits, content policy violations. These kill user trust.

LangSmith, LangFuse, or Braintrust

Don’t build observability from scratch. Use purpose-built LLM monitoring tools:

LangSmith: Best if you use LangChain. Full tracing, eval runs, prompt versioning.
LangFuse: Open-source alternative. Self-host or use cloud. Great dashboards.
Braintrust: Excellent for evals. Supports A/B testing prompts in production.
Helicone: Focuses on cost tracking and caching. Simple integration.

Pick one. Don’t go to production without it.

The 5-Minute Dashboard

Your “LLM health dashboard” should show at a glance:

Requests in last 24 hours (is traffic normal?)
Error rate (is the API working?)
P95 latency (is it fast enough?)
Average cost per request (are we bleeding money?)
User satisfaction score (are users happy?)
Sample of recent responses (spot-check quality)

If anything looks off, drill down.

Pillar 3: Continuous Improvement

Shipping is the beginning, not the end. Great AI products iterate weekly.

The Weekly Improvement Loop

Monday: Review last week’s metrics. Identify top 3 issues.
Tuesday: Pull failed examples from production. Add to eval set.
Wednesday: Experiment with fixes (new prompts, model changes).
Thursday: Run evals on fixes. Pick best candidate.
Friday: Deploy to 10% of traffic (canary). Monitor closely.
Weekend: If canary looks good, roll out to 100%.

This loop compounds. After 10 weeks, your AI is dramatically better.

Learning from Failures

Every bad output is a learning opportunity:

Failure Type	How to Fix
Hallucinating facts	Add RAG with authoritative sources
Wrong tone	Add tone examples to prompt
Ignoring instructions	Rephrase instructions, add emphasis
Off-topic responses	Add classification layer first
Too verbose	Add length constraint to prompt

A/B Testing Prompts

Never guess if a new prompt is better—test it:

Control group: 80% of users get current prompt
Test group: 20% get new prompt
Duration: Run for 3-7 days
Metrics: Compare user satisfaction, task success rate, cost

If test wins by >5% on key metric, roll out to everyone.

Handling the “Long Tail” of Weird Inputs

In production, users will do things you never imagined:

Ask questions in languages your model doesn’t handle well
Submit essays when you expect one sentence
Try to trick the AI into saying inappropriate things
Submit gibberish to see what happens

Defense Strategies:

Input validation: Check length, detect language, filter profanity
Intent classification: “Is this query actually related to our product?” If no, reject.
Output filtering: Scan responses for policy violations before showing to user
Graceful degradation: If LLM fails, fall back to rule-based responses or human handoff

Version Control for Prompts

Treat prompts like code. Use version control:

prompts/
  customer_support/
    v1.txt
    v2.txt (improved tone)
    v3.txt (added examples)
    v4_current.txt
  summarization/
    v1.txt
    v2_current.txt

Log which prompt version was used for each production request. If quality drops, you can trace it to a specific prompt change.

The Deployment Checklist

Before pushing any LLM change to production:

☐ Ran eval set (>90% pass rate)

☐ Tested edge cases manually (weird inputs)

☐ Checked cost impact (within budget)

☐ Verified latency acceptable (p95 < 3s)

☐ Set up monitoring/alerts

☐ Prepared rollback plan (can revert in 5 min)

☐ Tested failure modes (what if API is down?)

☐ Got peer review of prompt changes

☐ Staged rollout plan (10% → 50% → 100%)

☐ On-call engineer assigned for first 24 hours

Skip any of these at your own risk.

Common Mistakes (And How to Avoid Them)

❌ Mistake: “It works on my machine”

✅ Fix: Always test with real user data, not cherry-picked examples.

❌ Mistake: No rollback plan

✅ Fix: Feature flags. Be able to disable AI feature instantly.

❌ Mistake: Optimizing for the wrong metric

✅ Fix: User satisfaction > model accuracy. A slightly less accurate but faster model might be better UX.

❌ Mistake: Ignoring cost until the bill arrives

✅ Fix: Set cost alerts. Get notified if daily spend exceeds threshold.

❌ Mistake: Treating prompts as “set and forget”

✅ Fix: Continuous improvement. Good products iterate weekly.

The Incident Response Plan

When something goes wrong (it will), follow this runbook:

Detect: Alert fires (quality drop, error spike, cost spike)
Assess: How bad? How many users affected? What’s the root cause?
Contain: If critical, disable AI feature immediately (fallback to rules or humans)
Fix: Roll back to last known good version OR apply emergency patch
Verify: Monitor for 1 hour. Is it really fixed?
Post-mortem: Write up what happened, why, and how to prevent it

Practice this drill before you need it.

The Bottom Line

Shipping LLM features is different from shipping traditional software. You need:

Evals before deployment (don’t ship blind)
Monitoring in production (catch issues fast)
Continuous improvement (iterate weekly)

The companies winning with AI aren’t the ones with the best models. They’re the ones with the best processes.

Build the processes now, and you’ll ship AI features with confidence instead of fear.