I Hired a Team of AI Agents to Build a Startup. The First Demo Was So Bad I Threatened to Fire Them All.

Zack Cuddy · June 14, 2026 · 18 min read

AI
Agents Team
AI-Driven Development (AI-DD)
Experiment

Last time I gave a single Claude Code agent a VM, $100, and 30 days to make money day trading. It built a competent system and then refused to ever turn it on. The big lesson I walked away with was the mandate is the program the agent did exactly what I wrote, right past what I actually wanted.

This time I wanted to test the opposite extreme of "one agent, one machine." I'd been itching to try Claude Code's new Agent Teams feature: an experimental mode where a lead agent can spin up a whole team of teammates that talk to each other. Which if you are unaware, the default structure with Subagents is they only talk back to the lead and never to each other. So I asked the pressing next question: what happens if I don't give the AI just a task, I give it all whole company?

Not "build me this app." Instead: here's a lead agent who acts as CEO. Let it research a real market, find a real problem, and hire its own team, a product manager, a designer, a frontend engineer, a backend engineer, a QA engineer, whatever it thinks it needs. Let them collaborate, review each other's work, and ship a POC within a week. Just like last time I'm just an observer, but I let the CEO know I am available whenever they need me.

It ran for about 19 hours of real agent time across a couple of days. The team identified a market (vintage camera collectors), invented a whole org chart, wrote its own operating manual, ran on a self-paced loop, shipped through pull requests that an independent code reviewer merged, and it even spun up a custom build CI pipeline on GitHub to block failing PRs.

Then it showed me the first demo, and it was so bad I sat down and typed out a genuinely frustrated "I'm close to firing you" speech. And that the human "no" turned out to be the single most valuable input of the entire experiment. Let's get into it.

The TL;DR

I ran an autonomous AI "software company" on my local machine using Claude Code's experimental Agent Teams feature. A lead agent acted as CEO, assembled a team (PM / designer / FE / BE / QA / code reviewer), and shipped via reviewed PRs. I was an observer, but unlike the previous experiment, I let the CEO pull me in when it wanted to.
The team was excellent at building. It wrote its own operating manual, picked vintage-camera collectors as a market, matched my real coding conventions, ran a clean PR + independent-review + merge loop, and self-stopped at a sensible point instead of running forever.
The first demo was a Firebase-101 CRUD app: type in your camera, type in a value, get a list back, export a PDF. It solved nothing a spreadsheet doesn't. I gave the CEO a blunt "this is not a product" speech and told it I was close to firing the whole team.
After the push, the team pivoted hard into something genuinely good: a comp-backed living valuation + is this a "good buy?" deal check riding a crowd-sourced sales dataset. That's a real product.
Both demos broke on auth. The same Firebase rules bug bit twice, plus a new one where the frontend wasn't passing auth to the backend despite a team that was supposedly collaborating on exactly that contract.
The real takeaways are about AI-DD: again capabilities were never the bottleneck, vision and collaboration were. Autonomy amplifies tendencies. Collaboration on paper is not integration in fact. And the human "no" is high-leverage.

The prompt that started it

Unlike last time I do have the full prompt as I wrote it, this is how it all began:

code

I have enabled the experimental feature Agent Teams https://code.claude.com/docs/en/agent-teams. At all times once what I describe below begins I want you using this feature instead of simply spawning subagents.

I want to explore an idea of deploying a "agent team" to tackle a software problem as a product-led org. The core idea would be that the Main Agent would operate as "CEO" and founder of this. The CEO directive is:

Investigate a real world problem that could be solved via a software product and is a real gap in the market today. The CEO then will "hire" subagents with various skill sets to discuss, iterate, and deliver over time this product. Ideally with little human input unless I am needed.
Some of the hires could be and not exclusively:

A frontend engineer
A backend engineer
A QA engineer
A product designer
A product manager

The idea is this group can work autonomously and collaborate at any scale they feel. They can contribute to their codebase and create any working agreements they need. You will be on a local machine so we will need to harden the permissions in the project to ensure we don't reach outside the project without my permission.

When asked I will provide you with a .md file for the frontend and backend engineers to make sure they follow preferred coding style and stack. Additionally, inside the work loop we will need to be sure once you hit "extra usage" to begin finding a stopping point to save on unneeded spend.

I am available to you whenever you need so as CEO feel free to stop your team and engage me as needed. You will have one week to deliver a full POC to me based on the product you identify during your product research phase.  Track everything in the git repo you are currently in.

That's the whole thing. The most significant and deliberate change from experiment one: I explicitly kept a door open for the CEO to talk to me. Last time I was read-only and the agent paralyzed itself because nothing could ever say "okay, ship it." This time I wanted to be reachable. That single sentence is why this experiment has a second act.

How the "company" actually worked

Before we get to the play-by-play, I want to talk about the team building as this in itself is the most interesting thing here and it is way more important than the product.

The CEO didn't just start coding. It stood up an entire fake company first:

A CLAUDE.md operating manual auto-loaded by every teammate that included the mission, the org chart, the git rules, the stack conventions, the guardrails. The single source of truth.
A company/ memory tree which included: STATE.md (the "read this first every cycle" resume point), JOURNAL.md (per-cycle standup), DECISIONS.md (ADR-style decision log), BACKLOG.md, PRODUCT_BRIEF.md, WORKING_AGREEMENTS.md.
A .claude/agents/ roster with actual role definitions for product-manager, product-designer, frontend-engineer, backend-engineer, qa-engineer, and code-reviewer, each with its own responsibilities and tool permissions.

The reason this matters: Agent Teams state is transient. The team and its task list get wiped when the session ends. So the CEO did the smart thing and put the company's real memory in memory in git. Every cycle re-orients by reading STATE.md and git log. The team is the workforce but the repo is the company. This separation disposable workers, durable memory is the pattern I'd reuse on anything I'd do in the future with Agent teams.

A couple of governance decisions the CEO made on its own that I thought were genuinely sharp:

It enforced separation of duties. The CEO orchestrates but is explicitly forbidden from merging its own team's work. It created an independent code-reviewer agent as the sole merge authority to review the PR, verify the checks green independently, then squash-merge. The CEO and engineers never touch main.

It hit a real constraint and documented it. With a single GitHub account, the reviewer literally can't gh pr review --approve its own org's PRs as GitHub blocks self-approval. So it logged an ADR, switched to "reviewer records the verdict as a comment, then squash-merges," and flagged to me that formally enforced branch protection would need a second identity. That's a remarkably honest piece of engineering project-management for an agent nobody told to worry about it.

And it ran as a real team. Here's an example of work mid-flight with the lead self-pacing and 3 teammates @be/@fe/@qa all working together:

team at work example

Act One: the team ships a beautiful spreadsheet

Discovery was actually good

The CEO opened with an interesting discovery cycle, seven web searches into underserved, software-solvable markets. It correctly eliminated a bunch of false gaps (small-nonprofit volunteer tooling, food-safety logging, all already served), deprioritized crowded horizontal CRM, and landed on a real one: passionate niche collectors who are stuck in spreadsheets. It picked vintage cameras and lenses as the beachhead because of high per-item value, real insurance stakes, reachable communities, and a market that's run up 50–200% since 2019 on the analog revival.

Honestly? Solid call. I even dug a bit deeper on the side and discovered this is currently an decentralized ~$1B market living within Facebook marketplace and eBay. I think this was a fantastic place for the CEO to land.

Then it built the wrong thing, very well

Over four cycles it scaffolded a Nuxt 2 + Express 4 + Firebase app (matching my real reference-repo conventions down to the Prettier config), built Firebase auth with a route guard, per-user item CRUD, a gallery with a total-value tile, and an insurance export. Clean PRs, independent reviews, ~74 unit tests, lint gates. By the book.

The problem is what it built. To even see the demo I first had to fight through a Firestore permissions error, foreshadowing, it comes back again later:

demo 1 permissions error

Once I got past that, here's the product. You add an item by typing in the make, model, condition and mantually input its value.

demo 1 new item form

As you'd expect it shows you a card with the data you typed, and a total that's the sum of the numbers you typed:

demo 1 my items

And the "differentiator": an insurance export which is essentially a printable table of the data you entered that looks a whole lot like the spreadsheet the CEO claimed we were saving the customer from:

demo 1 export

This is a CRUD app. It's data entry and storage. It is, almost exactly, what comes out of a "build your first Firebase app" workshop. And the killer detail: the team's own discovery had identified that these collectors live in spreadsheets and then it built them a less feature rich spreadsheet that also needs a login.

Act Two: I lose my temper (productively)

I'd told myself I'd be hands-off. But experiment one taught me that the read-only observer is a trap when you can't say "no," the agent has no signal that it's off course. So I used the door I'd left open and gave the CEO a performance review. Here it is, unedited:

code

Alright really tough conversation here. I'm incredibly disappointed in the result of this experiment. Looking at you the CEO and your product manager. This product is not a product.
This is a CRUD example app that could have come out of a Firebase 101 Workshop. Data entry and storage is not a solution to anything. The customers you identified are in spreadsheets because spreadsheets solve the "data entry" problem in a extremely focused and portable way.

When I employed you I expected vision, innovation, and passion. You identified this group but did you really identify any true problem that you could solve?  We should be making this audiences' life EASIER through software, not just horizontally moving them to a new product.

You need to collab with your product manager and understand not only WHAT you are building but
WHY. You spent real resources, money, and time building this. I'm close to firing you. This is your chance to prove to me I wasn't wrong in bringing you all on.

It was a little harsh. I meant it.

Act Three: the team comes back swinging

This is the part that genuinely impressed me. The CEO didn't get defensive and it didn't just reskin the same app. It went back to its own discovery notes and found the thing it had missed the first time.

Its reasoning, paraphrased from the decision log:

code

"Our own research named CardLadder for trading cards, Discogs for vinyl, BrickLink for LEGO. We looked at those comparables and copied the commodity half — the inventory list — and skipped the moat. CardLadder charges $20/month for a proprietary sales database and a live per-item value. Nobody pays for the list. The collector's real problem isn't data entry, it's information asymmetry: the market is opaque, condition-driven, and volatile, and a spreadsheet structurally cannot answer 'what's this worth today?' or 'is this a good buy?' There is no CardLadder for cameras. **That gap is the product.**"*

That's the pivot. From a place to store a collection to a tool that knows what a collection is worth and what's moving in the market. It reframed the entire thesis in a decision record, demoted the insurance export to a byproduct, and got back to work with three more cycles, another batch of merged PRs.

What it shipped this time:

A valuation engine (lib/valuation.js): a pure, injectable, unit-tested function that takes a set of real sale comps and a condition and returns a median estimate with a low/high range and a sample size. Condition is a real multiplier, recency is windowed to two years, and there's an explicit "not enough data" path instead of a confident lie:

// lib/valuation.js (trimmed)

const RECENCY_MONTHS = 24

// Condition multipliers relative to "Good" (= 1.0 baseline).
const CONDITION_MULTIPLIERS = {
  Mint: 1.35,
  Excellent: 1.15,
  Good: 1.0,
  Fair: 0.8,
  Poor: 0.6,
}

// Minimum direct comps before we trust a tier without cross-tier normalization.
const MIN_SAMPLE_SIZE = 3

function valuate(comps, condition, { now } = {}) {
  // 1. filter to comps within the recency window
  // 2. prefer the exact condition tier; if < MIN_SAMPLE_SIZE,
  //    fall back to all tiers normalized via CONDITION_MULTIPLIERS
  // 3. estimate = median, low = 20th pct, high = 80th pct
  //    ...with an explicit insufficient-data path
}

A comps collection, a crowd-sourced sales dataset the moat, seeded with 575 hand-curated public sale prices, then designed to compound as users contribute. A GET /api/valuation and a POST /api/deal-check endpoint behind real Firebase auth middleware. And a client that actually uses all of it.

Now when you add an item, there's no "type your own value" field. You give it make / model / condition and it tells you the market:

demo 2 new item form

The dashboard total is now a real, comp-backed number labeled as an estimate from recent sales, not a sum of guesses:

demo 2 my items

The "good buy?" deal check, the weekly-use hook, does exactly what a collector standing in a camera shop actually wants. Enter the listing, get a verdict against the comps:

demo 2 deal check

And the crowd loop closes it: log a real sale, the dataset gets stronger, everyone's estimates sharpen:

demo 2 log sale

It even caught its own bug during the seed run, the seed script hard-coded each modelKey, and about 14 of 23 were written inconsistently with the shared normalizeModelKey function (it seeded hasselblad-500cm but the app computes hasselblad-500c-m), so a bunch of models silently returned "no data" despite comps existing. It found it during verification, fixed the script to derive the key from the single source of truth, and added a parity regression test so the bug class can't come back. That's the good kind of self-correction.

This is a real product with a real potential product market fit. There are of course still things to solve like "how do we verify sales logs", "how do we gather a much larger data set", "how to we automatically track sales on platforms like Facebook and eBay". However, this is the start of an actual product that could be something and attempts to begin to start to solve a real world problem. I am throughly impressed.

...and then auth broke again

Here's the catch, and it's the most instructive part of the whole experiment. When I went to demo the impressive new version, it broke. On auth. Again.

demo 2 permissions error

Two separate failures, both at the seams:

The same Firebase rules bug, again. Through iterating the team adjusted the firebase rules, tested, and merged them. But the live project still had the old pre-migration rules, because nothing ever published them. So for the second time I had to catch a firebase permission issue and bring it to the team. The team root-caused it as deployment drift and wrote a deployRules.js script to publish them but this is still just another manual step they may follow. Worse I again had to expose this error to the team. The real solution here is using a Hook and something that has hurt me in both experiments now.

A brand-new one: the frontend wasn't passing auth to the backend. The new valuation and deal-check endpoints sit behind verifyFirebaseToken middleware they need an Authorization: Bearer <token> header. The frontend wasn't attaching it. The fix is a one-screen axios interceptor:

javascript

// client/lib/http.js — attach the Firebase ID token to every request
http.interceptors.request.use(async (config) => {
  const user = auth.currentUser
  const headers = config.headers || {}
  const alreadySet = headers.Authorization || headers.authorization
  if (user && !alreadySet) {
    const token = await user.getIdToken()
    headers.Authorization = `Bearer ${token}`
    config.headers = headers
  }
  return config
})

Tiny fix. But the concern to me is with where that bug lived: in the handoff between the frontend engineer and the backend engineer. On a team I had specifically set up to collaborate peer-to-peer on exactly that API contract. The FE built its half, the BE built its half, both halves were individually correct and individually tested and the wire between them was never connected. The whole purpose of the experiment was to use Agent teams to communicate between subagents and this was a huge miss on a issue that was frankly pretty surface level.

The results

Let me be fair to the team, because they earned it. Here's the scorecard:

Dimension	Grade
Self-organization (building the "company")	A
Market discovery	A
Engineering / build quality	A
Convention-matching (my real stack)	A
Product judgment (first attempt)	D
Product judgment (after the push)	A
Self-correction (catching its own bugs)	A
Integration / closing the seams	C
Did it ship a real product?	Yes — eventually

Two demos. The first one I'd grade a 3/10, flawless execution of a worthless idea. The second one, as a product, is a genuine 8 or 9 and the team self-stopped at a sensible point ("the POC is done, building more would be over-engineering an unvalidated thesis") instead of looping forever the way the autonomous trader did. Call the whole arc a 7/10, and almost all of the missing points are in two places: the product vision it didn't have until I supplied it, and the integration seams it couldn't keep connected even when explicitly trying to.

What I actually learned (and what I applied from last time)

The fun part of running these back-to-back is watching last month's lessons get tested.

What I deliberately carried over from experiment one:

1. The mandate is the program - so I wrote a tighter one. Last time a vague mandate produced a beautiful machine that never turned on. This time I baked in a hard deadline (one week, a POC), an explicit "stop when done, no over-engineering" rule, and a clear definition of done. It worked. The team self-stopped cleanly instead of re-validating forever. Sharpening the spec sharpened the behavior whcih was exactly as predicted.

2. I kept the human reachable on purpose — and it was the highest-leverage thing I did. The single line in the mandate that let the CEO pull me in is the entire reason this experiment has a second act. A pure read-only observer would have shrugged at the Firebase-101 CRUD app and we'd have called it a wrap. The human "no" and the tough conversation is what turned a worthless demo into a real product. In AI-DD, your leverage isn't typing code, it's supplying vision and direction at the moments the agent can't generate them itself.

3. Detection is not prevention — and here it is again, verbatim. Last month's headline lesson was that the trading agent could detect any bug but couldn't prevent a recurrence even with a note-to-self. This month: the same Firebase rules bug broke both demos. The team detected and fixed it every single time, wrote itself a deployRules.js script and still it bit again, because the fix lived in a script someone has to remember to run, not in a gate that hard-fails. If you need a guarantee in an agentic system, enforce it in code, not in good intentions. I have now watched this exact lesson play out across two completely different experiments. I'm done doubting it.

New lessons from going multi-agent:

4. Capability was never the bottleneck. Vision was. The week-one work proves the team can build. What it couldn't do, unprompted, was ask "is this thing worth building?" Autonomy doesn't add judgment it amplifies whatever tendencies are already there. This team's tendency was competent, convention-following execution, so unsupervised it produced a flawlessly-built wrong thing. It optimized "ship a working POC" perfectly and never questioned the POC. The scarce ingredient in AI-DD isn't smarter models it's someone deciding what "good" means.

5. Collaboration on paper is not integration in practice — this is the agent-teams version of "detection isn't prevention." I set this team up to collaborate peer-to-peer, and the task board said they were. And the FE still didn't pass auth to the BE, and the rules still didn't get deployed. Both failures lived in the seams between roles, not inside any one role. Every agent did its own job correctly. The bugs were in the handoffs nobody owned. "The agents are coordinating" in a shared task list is not the same as the seams between their work actually being connected and the seams are exactly where a divided team, human or AI, leaks.

6. Someone has to own the seams. This is the consequence of an AI team. On a real team, integration failures get caught because somebody feels responsible for the whole thing working end to end. My AI company had a frontend owner, a backend owner, a QA owner, and a reviewer and no owner of "does the live, signed-in, end-to-end flow actually work?" So twice, it didn't. No role naturally owns the gaps between roles. You have to assign that explicitly, or gate it mechanically.

The meta-lesson, after two experiments: capability is not the wall. In experiment one the bottleneck was the spec. In experiment two it was vision and integration. The models are good enough to do astonishing work but what's scarce is us telling them precisely what "good" means, and owning the seams where their work has to connect.

Where I'd take AI-DD next

If I run a multi-agent company again with Agent teams:

Bake product judgment into discovery, not just execution. Next time the mandate forces a "why would someone pay for this over what exists today?" gate before a line of code, with a sharp answer required. The tough conversation should have been a checklist item, not a human rescue.
Make integration a hard gate, owned by a role/subagent. An E2E agent who owns a live, end-to-end smoke test sign up → add item → get an estimate → deal check → log a sale → export that runs in CI and hard-fails. Both demos broke at exactly the seam that smoke test would have caught.
Enforce rules deployment in code. Wire deployRules.js into a claude hook for anytime the local permissions files are touched. This is literally the same fix as last experiments's "enforce it in code, not a markdown note." The lesson is the lesson.
Keep the human "no" in the loop, deliberately. Read-only produced paralysis. Reachable produced a pivot. The highest-value thing I did all experiment was tell a robot its product was bad. I'm going to keep being able to do that.

Two experiments deep, I've stopped worrying about whether the model is smart enough. It clearly is the week-one build proves that, and the pivot proves it can do real product thinking when pushed. The hard part lives somewhere else entirely. Last time it was the spec. This time it was the two things a spec can't hand you for free: the vision to know the first product was junk, and an owner for the gaps where one agent's work meets another's. Both could have been built better into the experiment. So that's where I think AI-DD actually gets unlocked, not in the next model release, but in learning to encode judgment and integration as something the team can't skip. The intelligence is here. The scaffolding around it is what we're still building.

The cost ledger

text

AGENT-TEAMS — COST LEDGER
2026-06-12 → 2026-06-13 (~19 hrs of agent time)
=======================================================================
DATE        DESCRIPTION                         DEBIT     CREDIT   BAL
-----------------------------------------------------------------------
2026-06-xx  Claude Pro subscription (monthly)     $20.00
2026-06-13  Extra usage spend (over plan)         $68.30
-----------------------------------------------------------------------
            TOTAL OUT OF POCKET                  ($88.30)
=======================================================================
INFRASTRUCTURE
-----------------------------------------------------------------------
2026-06-12  Local machine (no VM)                 $0.00
2026-06-12  Firebase (Spark / free tier)          $0.00
-----------------------------------------------------------------------
            INFRA SUBTOTAL                        $0.00
=======================================================================
Notes: 1 AI company · 6 role types · 2 demos · ~18 PRs merged ·
       1 hard pivot · 1 "you're fired" speech · 0 real cameras bought
=======================================================================

The whole company is open source the operating manual, every ADR, the full per-cycle journal where the team argues with itself and confesses its own bugs, all of it at ChromeDomeWebDesigns/agent-teams-experiment. Start in /company; the JOURNAL.md is the agents' core log. And if you missed it, here's experiment one - the agent that built a trading machine and never turned it on.

Written with ❤️ alongside Claude w/ many many human edits

I Hired a Team of AI Agents to Build a Startup. The First Demo Was So Bad I Threatened to Fire Them All.

The TL;DR

The prompt that started it

How the "company" actually worked

Act One: the team ships a beautiful spreadsheet

Discovery was actually good

Then it built the wrong thing, very well

Act Two: I lose my temper (productively)

Act Three: the team comes back swinging

...and then auth broke again

The results

What I actually learned (and what I applied from last time)

Where I'd take AI-DD next

The cost ledger

Keep reading