Meetup Talk

Agentic Software Engineering

What's actually changed in the last two years, and what to do about it

About Me

Joe Reuter

Tech Lead · Elastic
Elastic is the company behind the ELK Stack — Elasticsearch, Logstash, Kibana.

Used by thousands of companies to search, observe, and secure their data at scale. Think full-text search, APM, SIEM, log analytics.

Technology journey

TypeScript · React · Node.js
Frontend in Kibana — my main thing
Java
Elasticsearch core & internals
Go
Data shippers & agents
Foundation — Step 1 of 3

It Started as Next-Token Prediction

The model picks the most likely next token and repeats. Pattern matching over enormous text corpora. One token ≈ ¾ of a word.

Context so far: "The Eiffel Tower is located in "
Model samples the next single token:
"Paris" ← 94%
"France" ← 3%
"the" ← 1%
Token appended → now predicts the next: "…in Paris_"
"," ← 61%   "." ← 22%   "and" ← 8%
Repeats until end-of-sequence token or max length.
Foundation — Step 2 of 3

Add Thinking: Reason Before Answering

Chain-of-thought prompting (and o1-style reasoning) gives the model space to work through a problem before committing to an answer. This changed things substantially.

Task: "This test fails intermittently. Why?"
<thinking>
The test touches a shared cache...
Two threads could write simultaneously...
Race condition on cache.set() — not thread-safe...
Fix: add a lock, or use an atomic operation.
</thinking>
"The test has a race condition on the shared cache. Add a threading.Lock around cache.set()."
Foundation — Step 3 of 3

Add Tool Calls: Act in the World

The model can pause mid-response, call a tool, and continue from where it left off. Text in, real actions out.

Conceptual flow
Task: "Run tests, fix failures."
→ tool: run_shell("pytest tests/")
FAILED test_user.py::test_login — 401
→ tool: read_file("src/auth.py")
…off-by-one on token expiry seconds…
→ tool: edit_file("src/auth.py", fix)
→ tool: run_shell("pytest tests/")
What the model actually emits (Claude / Anthropic format)
// model streams these tokens verbatim
<function_calls>
<invoke name="run_shell">
<parameter name="cmd">pytest tests/</parameter>
</invoke>
</function_calls>
// harness parses, executes, injects result:
<function_results>
FAILED test_user.py — AssertionError: 401
</function_results>
// model continues generating…
Every model has its own token dialect. The harness parses it and runs the real command. The model never executes — it only describes.
Foundation — The Infrastructure

Harness → Provider: The Loop

The harness (pi, Claude Code…) orchestrates a tight request/response loop with the model provider. The model never touches your machine — it only describes what to do.

Harness  ·  runs on your machine
Claude Code pi Cursor Pi…
Reads codebase · manages context · executes tools locally · provides UI
Provider  ·  remote GPU cluster
Anthropic OpenAI Google io.net via OpenRouter…
Runs model inference · prices vary 3–5× for the same model across providers
1
Harness sends prompt — context + codebase snippets + tool definitions
2
Model generates tokens — may include a tool call <function_calls>…</function_calls>
3
Harness parses & executes — runs bash, reads file, edits code — on the local machine
4
Tool result sent back — appended to context window at the provider
5
Model continues — uses result to reason, calls more tools or replies
…repeats until task done or human takes over
Part 1 — The New Reality

The Models Finally Got Good

We went from autocomplete to reasoning agents capable of multi-step work. A useful measure: feedback loop length — how long an agent runs before needing your input.

Part 1 — The New Reality

Feedback Loop Length Is Everything

METR tracks how long a model can work unsupervised on a research task — doubling roughly every 7 months. (Kwa et al. 2025, metr.org)

3 sec 1 min 10 min 1 hr 10 hr 100 hr 1 week 2019 2020 2021 2022 2023 2024 2025 2026 GPT-2 GPT-3.5 GPT-4 GPT-4o Sonnet 3.5 o3 Opus 4.5 Opus 4.6 Autocomplete write a line at a time Supervised agent tell it what to do, watch it work Autonomous agent draft a plan, hand off, come back to a PR Source: METR (metr.org) — 50% task-completion time horizon
Part 2 — The Lifecycle

Planning Before Code

Agree on what to build before any code is written. OpenSpec is a lightweight framework that structures this as a folder of Markdown artifacts. GitHub (41k ⭐) →

The workflow

/opsx:propose dark-mode
→ openspec/changes/dark-mode/
proposal.md ← why & what
specs/ ← requirements
design.md ← approach
tasks.md ← checklist
You review and refine each artifact
with the agent until it's locked.
/opsx:apply → agent implements all tasks
/opsx:archive → spec merged into codebase

specs/dark-mode-spec.md

### Requirement: Theme toggle
- SHALL render a toggle in the top-right header
- SHALL apply the new theme within 50ms
- SHALL persist choice to localStorage
#### Scenario: Enable dark mode
- GIVEN the user is on any page
- WHEN they click the dark mode toggle
- THEN the theme switches within 50ms
- AND the preference is saved
#### Scenario: Page reload
- GIVEN dark mode was previously set
- WHEN the page reloads
- THEN dark mode is applied before first render
Part 2 — The Lifecycle

The Bottleneck Moved

Agents cut coding time roughly in half, which surfaced the next constraint. Reviewing and validating is now the slow part.

Ideation &
Planning
Coding
Validation &
Review
Deploy &
Monitor
Triage &
Fix
Before agents
~normal
⚠ bottleneck
~normal
~normal
~normal
With agents now
~normal
~2× faster
⚠ bottleneck
~normal
~normal
Agent
effectiveness
partial
strong ✓
emerging
early
partial
Part 2 — The Lifecycle

The Validation Bottleneck

Writing code got faster. Reviewing it didn't. The constraint shifted but didn't disappear.

Part 2 — The Lifecycle

Macroscope — Example Review

A real Macroscope comment on a PR — logic analysis, not just style.

Macroscope code review screenshot
Part 2 — The Lifecycle

Agentic Exploratory Testing

To speed up validation, we're using agents for exploratory testing — beyond the traditional testing pyramid.

Part 2 — The Lifecycle

Exploratory Testing — Example Output

A GitHub Issue filed autonomously by the exploratory testing agent — with a full reproduction journey.

Exploratory testing GitHub issue screenshot
Part 2 — The Lifecycle

Dev Meets Ops

A production error becomes a GitHub Issue becomes a PR — mostly without a human kicking it off.

1. Discover: ES|QL CATEGORIZE clusters millions of logs → ranked error patterns
2. Investigate: Triage Agent runs a protocol-driven investigation — queries telemetry, checks git history, correlates stack traces
3. File: Agent creates a GitHub Issue with root cause, affected versions, and a suggested fix
4. Fix: Coding Agent picks up the issue, drafts a PR
5. Validate: Macroscope + Exploratory Tests run on the PR
6. Ship: Human approves → merged → deployed
Part 2 — The Lifecycle

Triage Agent — Example Output

The actual GitHub Issue filed autonomously — root cause, affected versions, and a fix suggestion, all included.

Error triage report screenshot
Part 2 — The Lifecycle

The GitHub Switchboard

GitHub is the coordination layer. Agents plug into every stage — but humans set direction and hold the quality gate.

New Idea
User Feedback
Production Error
↓ Planning Agent scopes the work
↓ Planning Agent scopes the work
↓ Triage Agent clusters logs, creates Issue
↓                ↓                ↓
GitHub Issue  — auto-created or manual
Planning / Spec
agent drafts spec → human approves
Coding Agent
Pull Request
Macroscope — static review
Exploratory Tests — dynamic
Human: very involved
Architecture & design
Human: less involved
Review & steer
Human: quality gate
Approve or reject
Part 3 — The Stack

You Are the Orchestration Layer

Each agent works through its own think→act→observe loop. You move between them: kick off a task, check progress, correct course.

How it works for me today

  • pi runs in each TMux pane — one agent per task/branch
  • Assign Agent A a 30-min task → switch pane → assign Agent B → come back, review, correct → repeat
  • For smaller tasks & adjustments: harness runs in a GitHub Action — ephemeral cloud worker, not tied to my machine. Produces a PR, I review async.

The next level

  • Gastown (by Steve Yegge) — agents orchestrate agents: a "Mayor" agent creates convoys, assigns tasks to worker agents, tracks work in git-backed beads
  • You describe the goal at a high level; Gastown breaks it into a multi-agent effort and coordinates execution
  • Still early — but the direction is clear: the human moves one more level up from execution
Part 3 — The Stack
pi running in split TMux panes
Part 3 — The Stack

Context Is the New Code

Model quality is fixed — what you can control is what you put in the context window.

Part 3 — The Stack

Model Economics

Match the model to the task. Prices via OpenRouter (per million tokens, input / output).

Opus 4.6

$5 in / $25 out per 1M tokens
  • Most capable model right now
  • Architecture & planning
  • High-level reasoning

GLM 5.1

$0.95 in / $3.15 out per 1M tokens
  • Open-source (Z.ai)
  • Coding & triage
  • More than capable for the actual work

Step 3.5 Flash

$0.10 in / $0.30 out per 1M tokens
  • ~50× cheaper than Opus on output
  • Great for high-volume loops
Part 4 — Downsides

The Death of "The Zone"

Working this way has real downsides worth being honest about.

Part 4 — Downsides

Building the Wrong Things

When building anything takes hours instead of weeks, the cost of a bad idea shrinks — and you build more bad ideas.

Part 4 — Downsides

The Reality Check

20–30%
Net productivity gain
(not 10x)
100%
More complexity
in the workflow

The bottleneck moved — from coding to validation. That doesn't mean it doesn't work. It means it's more complicated than just pumping out code faster. METR RCT: experienced devs were 19% slower with AI →

Part 5 — Conclusion

Code Is the Universal Language

Agents work best when problems are expressed in code. That makes coding a more central skill, not a less relevant one.

Part 5 — Conclusion

The Self-Healing System

When observability and development run on the same agent loop, the SDLC closes itself.

Plan spec agent Code coding agent Review Macroscope Deploy CI/CD Observe APM / logs Triage triage agent auto-creates Issue → feeds back to Plan you you you

Questions?

Let's discuss model costs, agent hallucinations, and what's actually working.

1 / 22