Skip to main content
Ritesh Kadmawala
Founder, Vertexcover Labs - AI-native engineering studio
View all authors

How we made a 13 GB LLM container cold-start in 40 seconds

· 21 min read
Ranjan Ojha
Software Engineer
Ritesh Kadmawala
Founder, Vertexcover Labs - AI-native engineering studio
TL;DR:

A 13 GB GPT-OSS-20B inference container: 571s → 40.65s. A 2.4 GB Llama-3.2 container: 91s → 7.4s. Same hardware, same app, only the image format and snapshotter config changed.

The obvious first move — lazy loading — made things 66% slower. The biggest single win came from a stack of hidden config knobs we had to read the snapshotter's Go source to find. The last bottleneck wasn't in containerd at all; it was the AWS EBS default write ceiling.

If you ship LLM images and your cold start is over a minute, this is for you.

Cold start journey across five fixes — Llama 3.2 1B (91s → 7.4s) and GPT-OSS 20B (571s → 40.65s)

If you ship a multi-gigabyte ML container — Hugging Face weights baked in, PyTorch + transformers, a FastAPI server — and you've watched the deploy take five-plus minutes from docker run to first inference, this post is the playbook we wish we had.

Why Node.js and Python Don't Use Linux's Native Async I/O API

· 11 min read
Ritesh Kadmawala
Founder, Vertexcover Labs - AI-native engineering studio
TL;DR:

Linux provides a native Asynchronous I/O (AIO) API that should theoretically allow true background file operations. Yet popular frameworks like Node.js and Python's asyncio avoid it entirely, relying instead on thread pools with blocking I/O. Why? The answer reveals important lessons about API design, cross-platform compatibility, and the gap between theoretical elegance and practical engineering. The future may lie in io_uring, a modern interface that addresses many of AIO's shortcomings.