How We Do Evals & Observability for Agentic Systems

September 2, 2025 · 7 min read

Software Engineer

TL;DR:

LLM agentic systems fail in subtle ways. At Vertexcover Labs, we use a 5-part evaluation approach—powered by a structured logging foundation:

Custom reporting/observability app to inspect the step-by-step agent flow (screenshots, LLM traces, code samples, step context, JSON/text blocks, costs).
Component/agent-level tests (like unit tests) to isolate/fix one step without re-running the whole agent.
End-to-end evals that validate the final product output while also comparing each stage to explain failures.
Eval reporting dashboard (Airtable or similar) showing run status with linked "run → steps" tables for fast triage.
Easy promotion of failing production runs into test cases (just use the run_id).

Foundation: a structured logging layer that makes all of the above trivial to build and maintain.

Why Evals & Observability Are Hard for LLM/Agent Systems

Everyone agrees evals & obs are critical. Few teams have a practical pattern that works across domains. The pain points we kept hitting (and solved):

🎲 LLMs are non-deterministic
Small prompt shifts, model updates, or temperature changes can flip outcomes. You need repeatable tests and live monitors per call.

🔗 Deeply chained workflows
Planning → tool calls → retries → post-processing. A bad first decision can ripple invisibly. Only looking at the final result hides root causes, but product quality still depends on the final result.

⏱️ Full-run latency & cost
Running agents end-to-end for every tweak is slow and expensive. We need component isolation and captured inputs to iterate quickly.

🔍 Debugging requires rich context
Off-the-shelf APM doesn't understand domain-specific failures like "selected wrong data source", "missed edge case in extraction logic", or "hallucinated field mappings". We need pipeline-aware traces: visual state captures, complete LLM interactions, generated artifacts, execution paths, and granular cost attribution.

📉 Evals decay as use cases evolve
Your perfect eval suite from Q1 becomes incomplete by Q3. New customer patterns emerge, edge cases multiply, and existing tests miss real-world failures. You need frictionless eval creation from production data—one command to turn today's failure into tomorrow's regression test, without manual test writing.

We see these issues across all our agentic projects. Below is how we solve them in general, using our OSS project Strot as a running example.

Context: Strot (Example System)

Strot reverse-engineers websites to discover internal APIs and extract structured data. The pipeline has three AI components:

Component	What It Does	Example Failure
Request Detection	Identify the network call loading target data	Picks non-related network call
Parameter Analysis	Detect pagination and dynamic filters	Misses cursor param, breaks pagination
Structured Extraction	Generate Python to parse responses into schema	Wrong JSON key mapping

Each can fail differently across thousands of site patterns.

Our Solution (4+1 Stages)

1) Make Debugging Fast & Visual (Custom Next.js Reporting App)

💾Implementation Note: Pipeline-Aware Debug UI

What we built: A pipeline-aware debug UI that shows every step in a run:

Browser screenshots per step (so you see the exact DOM/context the LLM "saw")
LLM prompt/response traces with token counts & costs
Artifacts like generated code and validation results
Run timeline + per-step cost/time + rollups

Why custom: General APM lacks LLM context and domain objects like "network request candidates," "pagination keys," etc. Our UI renders code blocks, JSON blocks, and text blocks differently so you can scan fast.

What it unlocks:

Diagnose in seconds ("wrong request chosen")
Optimize budgets ("this step is the cost hotspot")
No waiting for full pipeline completion—watch in flight

2) Component-Level Unit Testing (Agent "Lego Bricks")

Running the full agent to test a single change is slow and pricey. We treat each component as a unit:

Provide captured inputs to just that component
Support "replay this step from an existing run" (auto-plucks the right inputs via run_id)
Validate outputs against expectations (pagination keys, dynamic filters, parse schema, etc.)

This is akin to unit tests for agents: ship fixes in minutes, not hours.

CLI example:

echo '[{
  "request": {
    "url": "https://api.example.com/products",
    "queries": {"page": "2", "limit": "50"},
    "post_data": {"sort": "price", "filter": "new"}
  },
  "expected_pagination_keys": ["page", "limit"],
  "expected_dynamic_keys": ["sort", "filter"]
}]' | uv run stroteval

Parameter Detection View

3) End-to-End Eval (Product-Level Confidence)

The product cares about the final output. Our E2E eval suite runs the full pipeline and checks:

Correct source endpoint
Pagination/dynamic keys
Entity counts & schema parse correctness

Crucially, we also compare intermediate stages so failures are explainable ("endpoint correct, pagination wrong").

CLI example:

echo '[{
  "job_id": "existing-job-uuid",
  "expected_source": "https://api.example.com/reviews",
  "expected_pagination_keys": ["cursor", "limit"],
  "expected_entity_count": 243
},
{
  "site_url": "https://example.com/category/abc",
  "query": "Listed products with name and prices",
  "expected_source": "https://api.example.com/products",
  "expected_pagination_keys": ["limit", "offset"],
  "expected_entity_count": 100
}]' | uv run stroteval

We run this suite on every deployment to catch system-level regressions before they reach users.

4) Airtable Dashboards (Explorable for the Whole Team)

We push eval results to two linked tables:

Table	Contents	Purpose
Runs / Aggregates	Pass rates by component/site/time; cost/time trends	High-level health monitoring
Step Details	One record per step with inputs, outputs, matches, screenshots, costs (linked to its run)	Deep debugging and analysis

Airtable Analysis Report

Why Airtable? Fast charts, human-friendly filtering, collab-ready, and it mirrors our step-by-step mental model.

5) Production Failures → New Tests (The Feedback Loop)

Any run that fails is already fully recorded via structured logs and run_id. Turning it into a test is trivial:

Grab run_id → auto-hydrate inputs for the failing component
Save as a component test (today manual, automation is trivial from the same logs)
The suite becomes a bug graveyard — nothing escapes twice

The Foundation: Structured Logging + Distributed Tracing

🔍Deep Dive: The Secret: Structured Logging Architecture

▼

Results (Across Strot & Similar Projects)

⚡Performance Insight: Impact Metrics

95% success across 50+ site architectures
10× faster debugging via visual traces & step replays
3× faster iteration thanks to component isolation
Zero regressions once full coverage landed

Key Takeaways

Building effective observability and evaluation for agentic systems requires a holistic approach:

Structure logs once — Proper logging infrastructure is the foundation (observability, evals, and debugging)
Make debugging visual — Custom app to inspect step-by-step agent flow (screenshots, LLM traces, code samples, step context, JSON/text blocks, costs)
Test in layers — component-level tests for rapid iteration, E2E evals for product confidence
Build reporting dashboards An easy way to view results of e2e evals with ability to drill down into each step
Close the feedback loop — easy conversion of production failures into component and E2E regression tests
Keep interfaces simple — CLI tools for running evals, web dashboards for exploring results

The framework transforms AI development from "hope it works" to "know it works."

🔗 Code: github.com/vertexcover-io/strot
📄 Docs: Evaluation Guide

Why Evals & Observability Are Hard for LLM/Agent Systems​

Context: Strot (Example System)​

Our Solution (4+1 Stages)​

1) Make Debugging Fast & Visual (Custom Next.js Reporting App)​

2) Component-Level Unit Testing (Agent "Lego Bricks")​

3) End-to-End Eval (Product-Level Confidence)​

4) Airtable Dashboards (Explorable for the Whole Team)​

5) Production Failures → New Tests (The Feedback Loop)​

The Foundation: Structured Logging + Distributed Tracing​

Results (Across Strot & Similar Projects)​

Key Takeaways​