Skip to main content

How We Do Evals & Observability for Agentic Systems

· 7 min read
Harsh Verma
Software Engineer
TL;DR:

LLM agentic systems fail in subtle ways. At Vertexcover Labs, we use a 5-part evaluation approach—powered by a structured logging foundation:

  1. Custom reporting/observability app to inspect the step-by-step agent flow (screenshots, LLM traces, code samples, step context, JSON/text blocks, costs).
  2. Component/agent-level tests (like unit tests) to isolate/fix one step without re-running the whole agent.
  3. End-to-end evals that validate the final product output while also comparing each stage to explain failures.
  4. Eval reporting dashboard (Airtable or similar) showing run status with linked "run → steps" tables for fast triage.
  5. Easy promotion of failing production runs into test cases (just use the run_id).

Foundation: a structured logging layer that makes all of the above trivial to build and maintain.

Speed up Docker Builds on Github actions

· 6 min read
TL;DR:
  1. Turn on BuildKit & Buildx everywhere
  2. Reorder Dockerfile: copy package files first, then rest of code
  3. Use cache-mounts with buildkit-cache-dance action
  4. Pick the right cache backend (inline for speed, registry for large images)
  5. Add tmpfs + unsafe-io flags for package installs
ScenarioAvg. wall-clock
No caching1 h 10 m
Layer-cache hit6 m
Layer-cache miss (deps change)52 m
Cache-mount + Cache-Dance8 m

Stop rebuilding the world on every pull-request—turn on these flags and ship faster.

ML Infra design for the GPU Poor

· 5 min read

Taming the Beast: How to Design a Queueing System for GPU-Intensive Workloads

TL;DR:

When designing for scale, the limiting factor is the GPU availability. So all rate limits / queueing must be designed around GPU availability.

Strot - The API Scraper

· 9 min read
TL;DR:

Strot (Sanskrit meaning source) is an AI agent which scrapes web api:

  1. Instead of scraping the dom, identifies the right api call.
  2. Fast, reliable, complete data scraping for listing data is possible via API scraping.
  3. Strot figures the api call so you don't have to.

Try out Strot!

AI for End-to-End Tests (Mobile too!) with Auto Healing

· 4 min read

AI Agent for End-to-End Testing to Deliver Flawless Digital Experiences


What if Ai Agent could write tests for your codebase? End-to-end? and for mobile too? and it auto heals / auto-adjusts when your codebase changes?

We share nuggets we learnt while building an AI Agent to solve one of the most persistent challenges in software development: making UI test automation accessible, reliable, and scalable across platforms and devices.