May 25, 2026
Your AI MVP went live last Thursday. The first paying customer signed up on Saturday. By the end of the second week you will either be in a tight loop of small fixes that compound into a real product, or in the other kind of loop — chasing a different bug every day, watching the LLM bill creep up, and arguing about what “the model got worse” actually means. The AI MVP post-launch playbook below is what we hand founders to keep them in the first loop for the first 90 days.
TL;DR
- In week 1, instrument every LLM call before fixing anything. You cannot improve what you cannot see.
- In week 2, build a 50-row eval set from real first-week traffic. Score every release against it.
- Wire a spend-by-feature dashboard, not just a monthly bill. Surprise costs always come from one feature, one tenant, or one retry loop.
- Add three guardrails early: input length cap, output schema validation, per-user rate limit. They prevent 80% of the embarrassing failures.
- Turn customer complaints into eval cases on the same day they arrive. The “complaint-to-eval” loop is the single best learning machine an MVP can build.
- Lock a 30-minute weekly quality review by day 60. By day 90, run your first model bake-off across the providers you are not currently using.
Why this matters
Pre-launch, the work is binary: it ships or it does not. Post-launch, the work is statistical. You are running a system whose behavior drifts with model versions, prompt edits, user inputs, and your own bug fixes. Founders who treat the launch like the finish line spend month two firefighting and month three watching churn. Founders who treat day 1 as the start of an instrumented learning loop usually look back at month three with a product that is measurably better than the one they launched — and an LLM bill that did not surprise them.
Step-by-step: the 90-day AI MVP post-launch playbook
1. Days 0 to 7: instrument every LLM call before fixing anything
The first instinct after launch is to fix the first bug a real user reports. Resist it for 48 hours. The single highest-leverage thing you can do in week 1 is make the system observable, because every fix you ship after that becomes measurable instead of vibes-based.
Minimum viable instrumentation for an AI MVP is one log record per LLM call with these fields: request id, user id (or tenant id), feature name (search, summarize, draft-email, etc.), model name and version, prompt template version, input token count, output token count, latency in milliseconds, finish reason, and a hash of the input. Add a parent trace id if the call is part of a larger workflow with tool calls or retrieval.
Do not roll your own. Pick one of OpenTelemetry’s GenAI semantic conventions plus a hosted backend (Braintrust, Langfuse, Honeycomb, Datadog LLM Observability — any of them), or wire OpenTelemetry to your existing logging pipeline if you already have one. The whole setup is a one-day job for one engineer. Skipping it costs you weeks later.
Concrete acceptance criterion for week 1: open the dashboard, filter by any single feature, see every call from the last hour, and click into one of them to read the full prompt and response.
2. Days 0 to 14: stand up a 50-row eval set from real traffic
By day 7 you have real user requests in the logs. Pick 50 of them by hand. Aim for variety — 30 typical cases, 10 hard cases (long inputs, weird formatting, edge intents), 10 cases where the model already did something noticeably wrong. For each row, write down what the right answer looks like, in plain language. Not a rubric, not a JSON schema, just one sentence.
This is your eval set. Run it against your production prompts and your production model. Record the pass rate. From now on, every release that touches a prompt, a model name, a temperature, a tool definition, or a retrieval index runs against this set first. If the pass rate drops, you do not ship.
Two practical notes. First, you do not need an LLM-as-judge in week 1; a junior engineer reviewing 50 outputs in 20 minutes is more reliable than a flaky judge prompt. Add the judge later once the rubric is stable. Second, the eval set is a living artifact. Every time a customer complains, the failing example becomes row 51, 52, 53. By day 90 you should be at 150 to 250 rows, weighted toward the hard cases.
3. Days 7 to 30: build the spend-by-feature dashboard
The default LLM bill is one number per month. That number tells you nothing actionable. The dashboard you need shows daily cost broken down by feature, by model, and by tenant or environment. Three orthogonal lines on three small charts, refreshed daily.
Why this matters for an MVP: cost surprises in the first 90 days almost never come from total usage growing. They come from one feature shipping a regression that triples its token usage per call, one customer with a script hitting the API in a loop, or a retry policy that quietly retries 6 times instead of 2. Each of these is invisible in the monthly total and obvious in a per-feature view.
Build it from the same instrumentation you set up in step 1. Most observability platforms have a cost column out of the box. If you are rolling your own, multiply token counts by current per-million-token rates from your provider’s pricing page and sum by the dimensions above. Add an alert that fires when daily cost on any single feature crosses 2x its 7-day moving average. That one alert will pay for the dashboard within the first month.
4. Days 14 to 30: add three guardrails that prevent 80% of embarrassing failures
Pre-launch you can write a long list of safety, quality, and compliance guardrails. Post-launch with real traffic you discover that three of them prevent most of the actual failures you see. Ship those three first; add the rest as specific incidents force you to.
The three for almost every AI MVP:
- Input length cap. Reject or truncate any input above a hard ceiling. Pick the ceiling based on what your feature actually needs (a chat turn needs maybe 8k tokens, a document summarizer maybe 100k). Long-tail inputs cause long-tail bills and long-tail latencies, both of which look like the model being broken when it is not.
- Output schema validation. If your feature returns structured output, validate it on the way out. If it fails, retry once with a stricter prompt; if it fails again, return a fallback response and log the failure as an eval candidate. This catches roughly half of all customer-visible AI bugs in an MVP.
- Per-user rate limit. Not a global rate limit — a per-user one. One enthusiastic free-tier user with a script can spend more than your top ten paying customers combined.
A guardrail you do not need in week 4: a hand-tuned safety classifier on every output. Either your model provider already has one, or your use case does not require it. Adding one early adds latency and bug surface without preventing a real failure mode.
5. Days 30 to 60: run the complaint-to-eval loop on a 24-hour clock
By day 30 you have a stream of customer complaints. Some are bugs. Some are misunderstanding. Some are the model genuinely doing the wrong thing. The complaint-to-eval loop turns each one into compounding leverage.
The loop has four steps. First, every customer complaint about an AI output gets the input and output pulled from the logs the same day. Second, the engineer who triages it decides whether the output is wrong (add to eval set), the prompt is wrong (fix the prompt, run the eval set), or the spec is wrong (write down what right looks like, add to eval set, fix the prompt). Third, the fix ships only if the eval pass rate did not drop. Fourth, the customer gets a reply that names the specific change, not “we are looking into it.”
Two months of this loop and your eval set encodes your product judgment. New engineers can ship prompt changes safely because the eval set tells them when they break something. A model upgrade becomes a 15-minute decision instead of a week of arguing.
6. Days 60 to 90: lock a 30-minute weekly quality review
Same time every week, 30 minutes, three things on the agenda. First, the eval pass rate trend on the last 4 weeks of releases. Second, the top 3 features by cost-per-user this week vs. last week. Third, the 5 most-recent customer complaints and what changed in response. That is the whole meeting.
It looks small. It is the single mechanism that prevents quality drift, cost drift, and ignoring the customers who actually use the product. Founders who keep this 30-minute meeting on the calendar for a year typically have measurably better products than founders who do not.
7. Day 90: run the first model bake-off
Around day 90 you have three things you did not have at launch: a stable eval set, real cost numbers, and a sense of which features are doing the heavy lifting. That is the moment to spend two engineering days benchmarking your top one or two features against the model providers you are not currently using.
Pick the feature with the highest spend or the highest quality variance. Run the eval set against your current model, then against two alternatives — for example Claude Sonnet 4.6 if you are on GPT-5, Claude Haiku 4.5 if you are on a frontier model, an open-weights model like Llama 4 if you are on a frontier model and cost is biting. Compare pass rate, p95 latency, and cost per pass. Do not switch on the basis of marketing benchmarks; switch on the basis of your own eval set.
In 2026, swapping models on a single feature is usually a one-day change. The win is often 30 to 50% cost reduction at flat quality, or measurably better quality at flat cost. Either way, the bake-off pays for itself in week one.
A concrete example: an AI sales-research MVP, 6 weeks in
A two-founder team ships an AI sales-research MVP in early April: paste in a domain, get a one-page brief on the company, the funding history, and three personalized outreach hooks. Six paying customers in week one, eighteen by week three.
Week 1: they wire OpenTelemetry GenAI traces into Braintrust. Total instrumentation work: half a day. By Friday they can see every brief generation in a dashboard with cost and latency per call.
Week 2: they hand-pick 50 historical briefs and write down what a good brief looks like in one sentence per row. Initial pass rate against their current Claude Sonnet 4.6 prompt: 38 out of 50. They are surprised; they thought it was higher.
Week 3: the cost dashboard shows the outreach-hooks feature is costing 3x the brief-summary feature, because the prompt is asking for 10 candidate hooks and discarding 7. They cap the prompt at 3 candidate hooks. Cost on that feature drops 60% overnight. Pass rate holds.
Week 4: a customer complaint surfaces that the brief is making up funding rounds. The failing input goes into the eval set as row 51. The fix: add a retrieval step against Crunchbase before generating, and validate that every funding round mentioned appears in the retrieval context. Pass rate on the new eval set climbs to 44 out of 51.
Week 6: weekly quality review reveals the eval pass rate has trended 38 ? 41 ? 44 ? 47 across four releases. The LLM bill is up 20% on 200% more customers — meaning cost per customer is down 60%. They book the day-90 model bake-off as a calendar item.
This is not a special story. It is what the playbook produces when you run it.
Common pitfalls
- Treating the LLM bill as accounting data instead of an operational control. The bill is the slowest, dumbest signal you have. Build the per-feature dashboard.
- Building an elaborate eval framework before you have 50 hand-picked real examples. The framework is the easy part. The examples are the moat.
- Adding a fifth guardrail before the first three are paying for themselves. Three is enough at launch. Add the fourth when a specific incident demands it.
- Switching models because a benchmark on Twitter says you should. Switch models because your eval set says you should.
FAQ
How much engineering time does the 90-day playbook actually consume?
Roughly 6 to 10 engineer-days in total over 90 days, front-loaded into weeks 1 and 2. Most of it is one-time setup — instrumentation, the eval set, the cost dashboard. After day 30 the running cost is the 30-minute weekly review plus same-day complaint triage, which usually fits inside existing on-call rotation. For a two-engineer team this is well under 10% of total capacity.
Do I need a specialized AI observability platform, or is my existing logging enough?
If you already have OpenTelemetry traces and a backend like Honeycomb or Datadog, you can ship LLM observability inside it using OpenTelemetry’s GenAI semantic conventions. If you do not have a tracing setup at all, a dedicated platform (Braintrust, Langfuse, Helicone, Datadog LLM Observability) is faster to stand up and gives you the eval and dataset features for free. The wrong answer is to defer the decision and ship blind.
When should we move beyond a 50-row hand-curated eval set?
Stay hand-curated until the set is 200 to 300 rows or until reviewing it manually takes more than 30 minutes per release. Then layer in an LLM-as-judge for the easy half of the set, keep human review on the hard half. Do not skip the human-review phase — the rubric you write during it is what makes the judge prompt reliable later.
What to do next
The 90-day playbook is the floor for any AI product team in 2026, not a stretch goal. If you have shipped your MVP and want to skip the firefighting month, our team at Bitsens runs this playbook with founders on a 90-day engagement — instrumentation, eval set, cost dashboard, guardrails, and the weekly cadence locked in. See our AI automation servicesfor how we structure it, or send us the launch date and we will start from there.