Example deliberation
Prompt
We're a 4-person team running a customer-facing API. After 3 deploy incidents this month, the VP of Engineering wants us to add a staging environment. I think that'll slow us down and the real problem is our lack of observability. We ship 5–10 times a week and I don't want to lose that velocity. How should I make this case, or am I wrong?
Stage 1
Each model responds independently to the same prompt.
Don't lead with “no staging”; lead with “fix the failure modes first with observability + safer production rollout controls, and add staging only if incidents are caused by integration gaps that prod-safe testing can't catch.”
Write a one-page incident review comparing the last 3 deploy failures against a matrix of “would staging have prevented this?” vs. “would observability/canary/rollback have prevented or contained this?”, then propose a 30-day reliability plan with explicit metrics.
Reject the staging environment in favor of implementing feature flags and automated rollbacks driven by your proposed observability improvements.
Draft a one-page “Production Safety Strategy” for the VP that details how feature flags plus enhanced observability will achieve zero-downtime rollbacks faster and cheaper than maintaining a separate staging cluster.
Stage 2
Models review each other's proposals through a constrained vocabulary: challenge, alternative, refinement, or question.
Include an analysis of the last three incidents showing how this approach would have mitigated them.
Gemini is overclaiming on feature flags. Flags do not prevent customer impact for schema changes, config errors, dependency/runtime issues, bad migrations, or deploy-time infrastructure mistakes—the exact classes staging can catch. The critical question is not “staging vs flags” but whether the 3 incidents were release-control failures or environment/integration failures.
Stage 3
The lead model synthesizes proposals and critiques. When a decisive fact is missing, Council asks the user before resolving.
To resolve this, we need to look at the exact root causes of your recent incidents. ChatGPT correctly notes that Gemini is overclaiming the power of feature flags—flags won't save you from a bad database migration, a missing environment variable, or a hard schema mismatch.
The VP is reacting to the symptom (customer-facing incidents) with a traditional solution (staging). To effectively make your case (or realize you are wrong), you must map your specific failures to the right technical control.
Were your recent deploy incidents caused by structural issues (e.g., config drift, missing environment variables, schema mismatches) or emergent issues (e.g., production data edge cases, unexpected traffic volume, stateful third-party behaviors)?
Why it matters: A staging environment primarily catches structural errors. Observability and progressive delivery primarily mitigate emergent errors. If you build a staging environment to catch emergent data issues, you will pay a massive velocity tax for a false sense of security, and the incidents will continue.
If the incidents were structural (config, schema, infrastructure): You must add a pre-production testing gate; the VP's instinct is correct. However, to preserve your 5–10 deploys per week, do not build a static, shared staging server that developers have to queue up for. Instead, build ephemeral preview environments spun up automatically per PR. Frame this to the VP as “Automated Pre-Prod Verification” rather than a traditional staging bottleneck.
If the incidents were emergent (data scale, traffic patterns, edge cases): You are right to push back on staging, but asking for “observability” sounds like a passive engineering luxury to a VP dealing with outages. Instead, pitch a Production Safety Strategy. Write a 1-page memo mapping the last 3 incidents to prove that staging would not have caught them. Propose a concrete 30-day plan to implement progressive delivery: feature flags, canary deployments, and automated fast-rollbacks triggered by the new observability metrics you want to build.
Not sure, but I believe they were structural.
You are prioritizing velocity and runtime mitigation (observability), while your VP is prioritizing pre-release structural validation (staging). If your hunch is correct and the recent incidents were truly “structural” (e.g., bad database migrations, infrastructure configuration errors, broken dependency contracts), your VP is right. Feature flags and observability do not prevent structural deploy failures; they only help you watch them burn faster.
Because you are unsure, you must look at the verifiable evidence of the last 3 incidents before making your case.
Do not push back on the VP yet. Go to them and say:
“I agree we need to stop these deploy incidents. Give me 48 hours to map the exact failure modes of the last three incidents. I want to determine if a staging environment or better rollout controls (like feature flags and observability) would have actually prevented them, so we invest in the right safety net without killing our 5–10x weekly velocity.”
If the investigation confirms structural issues: You need pre-production validation. However, you can still protect your velocity. Instead of a traditional, static staging environment (which bottlenecks 4-person teams), propose ephemeral environments. Spin up a temporary, production-like environment for every Pull Request, run automated integration tests, and tear it down upon merge. This gives the VP structural safety without the permanent staging tax.
If the investigation reveals data/runtime issues: A staging environment would have provided a false sense of security. Show the VP the data proving that staging wouldn't have caught the bugs, but that observability and progressive delivery would have safely contained them.