Harsh Verma - One post | Blog

How We Do Evals & Observability for Agentic Systems

September 2, 2025 · 7 min read

Software Engineer

TL;DR:

LLM agentic systems fail in subtle ways. At Vertexcover Labs, we use a 5-part evaluation approach—powered by a structured logging foundation:

Custom reporting/observability app to inspect the step-by-step agent flow (screenshots, LLM traces, code samples, step context, JSON/text blocks, costs).
Component/agent-level tests (like unit tests) to isolate/fix one step without re-running the whole agent.
End-to-end evals that validate the final product output while also comparing each stage to explain failures.
Eval reporting dashboard (Airtable or similar) showing run status with linked "run → steps" tables for fast triage.
Easy promotion of failing production runs into test cases (just use the run_id).

Foundation: a structured logging layer that makes all of the above trivial to build and maintain.