cover

TL;DR

I/O 2026 wasn't a model launch. It was Google quietly bolting seven products together into one agent runtime while everyone else was busy arguing about benchmark scores. Below: each piece, a short example of what it actually does, and an honest note on what's shipping versus still in preview.


Why I'm writing this

I've spent the last year shipping agentic systems and watching the agent race from the cheap seats. OpenAI, Anthropic, the Cursor and Devin crowd: they got most of the attention. Google was the one half the room had written off as "great models, no real product." I/O 2026 is their answer.

This post is my opinion on what they actually shipped. Where their strong sides finally turned into product. Where the catch-up is still theater. Six months from now we can come back and check which parts aged.

The stack at a glance

                ┌─────────────────────────────────────────────┐
  Surfaces      │   Spark      WebMCP       Search mini-apps  │
  (where you    │  (Workspace) (browser)        (AI Mode)     │
   meet it)     └─────────────────────────────────────────────┘
                ┌─────────────────────────────────────────────┐
                │           Antigravity SDK                   │  bring your own infra
                ├─────────────────────────────────────────────┤
  Runtime       │           Managed Agents API                │  isolated sandbox
                ├─────────────────────────────────────────────┤
                │       Antigravity 2.0  +  Antigravity CLI   │  orchestration
                ├─────────────────────────────────────────────┤
                │              Gemini 3.5 Flash               │  the brain
                └─────────────────────────────────────────────┘

Four foundation layers. Three surfaces where the runtime touches a human. Bottom-up, one piece at a time.

Quick heads-up before we start. Five of these seven pieces are real today. Two are mostly slideware: I'll name both when we get there. Treat the stack as 5 + 2 in your head, not 7, and the post will hit better.


1. Gemini 3.5 Flash, the brain

The model. Reads your request, picks the next step, writes it.

Faster than a speeding bullet, with a new twist: you can tell it "this one's tricky, think harder" by raising the thinking budget. Sandwich-level questions stay cheap. Thanksgiving-level questions get the deeper pass.

Ask it to summarize my unread Slack into 5 bullets and you get the bullets in under a second. Ask it to find the bug in this 800-line function and it bumps thinking up, takes 20 seconds, and gives you a real answer instead of a confident guess.

Why this matters: every layer above assumes the brain is fast enough to sit inside a loop without being the bottleneck. 3.5 Flash is the first Google model designed for that loop, not for one-shot chat replies.

Reality check: shipping in AI Studio. Configurable thinking levels in the API.


2. Antigravity 2.0 + CLI, the orchestrator

The thing that turns a goal into a plan. Every multi-step agent I've watched fail in the wild fails the same way: it does step 1, then step 2, then forgets step 3 was supposed to feed step 4. Antigravity is built around fixing exactly that.

Picture our hero splitting into half a dozen copies, each handling one piece of the job, each reporting back with a Polaroid before they vanish. That's the gist. Antigravity takes a goal, breaks it into steps, spawns sub-agents in parallel, runs each step in a sandbox, and captures artifacts (screenshots, recordings, diffs) so you can verify the work without re-running it.

Ask it to ship a fix for the login bug. It opens the repo, reproduces the bug, writes a test, writes the fix, runs the test, captures a before/after screenshot, opens a PR. You see the screenshot, hit approve, the PR merges. You never opened your terminal.

The CLI is the same engine wrapped for tmux folks. Sub-agent sandboxing, credential masking, hardened Git policies, both ship with it.

Reality check: 2.0 and CLI are generally available. The end-to-end demo works on curated benchmarks. Real repos surface seams; expect rough edges in the harness for the first month or two while Google patches the obvious stuff. Not vapor, but not magic either.


3. Managed Agents API, the isolated sandbox

Where the agent actually runs. When I built my own agent pipelines, sandboxing was the part I wanted off my plate the most. Spinning up a fresh Linux container per agent and tearing it down cleanly is its own product. The Managed Agents API takes it off your hands.

Every hero needs a Phantom Zone where things can blow up without consequences. The Managed Agents API gives each agent a hosted Linux sandbox: it can reason, call tools, execute code, and the rest of your world hears nothing.

Ask it to take this customer-uploaded CSV and produce a clean Parquet file with a summary chart. The agent spins up a sandbox, installs pandas, runs the transformation, exits. The sandbox is destroyed. The CSV never touched your machine. The malicious Python the customer may have buried in a cell never touched anything that mattered.

Sandboxing isn't a nice-to-have. It's the precondition for letting a probabilistic decision-maker execute code on your behalf.

Reality check: in preview. Region-gated. Priced per sandbox-minute. Tear them down quickly.


4. Antigravity SDK, bring your own cape

Same powers. Your costume. And the costume is the governance: the same brain plus a different harness can be a strict fintech agent or a YOLO consumer toy. The harness is where the policy lives.

Antigravity's defaults run on Google's infrastructure. Fine for a demo. Not fine if you're a fintech, a hospital, or any team where "agent reasoning traces sit on someone else's servers" is a sentence that ends a meeting. If you've ever sat in a security review explaining why your agent has access to production credentials, the policy hooks in the SDK are why you'll want that conversation again.

The SDK is the escape hatch. Programmatic control over the harness: how it plans, which tools it can call, what counts as "done", what gets logged, where the artifacts land. Deploy it on your VPC, your Kubernetes cluster, wherever your auditor can audit it.

Ask it to process this batch of insurance claims, summarize each one, flag anything that looks fraudulent, and never let the model see the full SSN. You write that as a policy. The harness enforces it. The model only ever sees masked data.

Reality check: SDK shipping. BYO-infra deployment in preview. Self-hosting documented but rough.


5. Spark, the mild-mannered version

Now we leave infrastructure and meet the runtime where humans actually work.

Spark lives inside Gmail, Docs, Calendar, and Sheets. Not as a sidebar, not as a button. As background staff. The Clark Kent shift: glasses on, suit pressed, sitting at the next desk all day, drafting easy replies, preparing meeting notes, filing the receipts you forwarded last Tuesday.

Ask it to prep me for the 3pm with Acme. It pulls the email thread, the last meeting notes, the contract draft in Drive, a quick summary of the new attendees from your CRM. Builds a one-page brief in Docs and pins it to the calendar event. You walk in already briefed. You didn't open six tabs.

Spark is Google's bet that the next moat for office software isn't features. It's whether your runtime remembers what you wrote last Thursday. I already use the lighter version of this (Smart Compose, autocomplete in Docs) every working day. Spark is the same idea cranked up three notches. The risk is the embarrassment radius widens with the autonomy radius. Clark Kent looks normal, which is exactly why you stop double-checking. Don't.

Reality check: rolling out, region-gated, tier-gated. Drafts are good, sometimes embarrassing. Verify before sending.


6. WebMCP, the standard handshake

The stack works fine while the agent stays inside Google. The problem is the rest of the web.

Most sites don't have APIs. Today's agents fall back to the worst possible workaround: pretending to be a human, clicking around, hoping the layout didn't change overnight. Our hero in disguise, fumbling at the front desk, hoping nobody notices the cape. I have burned weekends writing scrapers that broke the next Monday because some product team shipped a redesign. WebMCP, if it lands, makes that weekend not happen.

WebMCP is the proposed fix. USB-C for websites. Any site publishes a small manifest saying "here are the actions you can take here, here's how to call them, here's the auth you need." The agent reads it and calls the actions directly. No scraping. No fragile selectors.

Ask it to book me a flight to Berlin next Friday, aisle seat, under €400. The airline site replies with a structured list: search_flights, select_seat, pay, get_confirmation. The agent calls them in order, hands you the boarding pass.

This is the only piece of the stack that needs the rest of the world to play along. Google can ship it in Chrome. Adoption is a different problem.

Reality check: Chrome 149 origin trial. Spec still moving. Slideware flag: this is one of the two I'd put in the "wait and see" bucket. The protocol is the easy part. Adoption is the product, and Google doesn't control adoption.


7. Search as mini-apps, the on-demand UI

The strangest layer and the easiest to underestimate.

Google Search is no longer a list of links. For complex queries in AI Mode, it builds you a custom UI on the fly. Filterable. Interactive. Disposable. Less phone book, more X-ray vision: instead of handing you ten links and wishing you luck, Search looks through the question and builds the answer surface for it on the spot.

Ask it to compare EV lease deals in Berlin under €400/month with a 24-month term. Search doesn't return ten blue links. It builds a sortable table of matching offers with sliders for price and term, and live links to the dealerships. You filter, you decide, you click out.

A mini-app is just a frozen agent run with a UI on top. Search is where most people will first meet that pattern. Not in a developer console. In a search box.

Reality check: live in select markets, mostly English-language queries. Quality varies. Sometimes a real comparator. Sometimes a fancier snippet. Slideware flag: this is the second of the two. The demos are dazzling. The queries I've actually run lean closer to the snippet end than the comparator end.


Zoom out

Seven products. One runtime. The brain decides, the orchestrator splits the job, the sandbox keeps the work contained, the SDK lets you fly off the grid. Spark sits at your desk in plain clothes. WebMCP gives the rest of the web a proper handshake. Search hands you a one-off dashboard instead of a phone book.

The product boundary moved. It used to be "open Gmail, then Docs, then Calendar". Now it's "ask for an outcome, the runtime picks the tools". Call that an OS pivot. Google is too polite to.

What I'll be watching next: WebMCP. It's the only piece that needs the rest of the world to agree to anything. Everything else Google can ship on its own. But if WebMCP doesn't catch on outside Google, agents stay locked inside Google products and the whole runtime story becomes a Google story instead of a web story.

If you build agents for a living, the move is the same as ever: pick one of the seven, ship something small with it, find the seams. The cape is shiny right now. My bet for the Phantom Zone is one of the two I flagged as slideware. I won't say which yet.

Which one are you betting on, and which one do you think Google will quietly stop talking about by next I/O? Make your list while you're reading. Six months from now I'll publish mine and you can see how the two compare.


References