Building & Optimizing a Firecracker Control Plane

March 4, 2026 · 14 min read

firecracker vm systems performance infra

Sandboxes are immensely useful in the AI era, especially for iterating on untrusted AI-generated code. While sandbox boot times are at present usually not the key latency bottleneck for autonomous agents, I often think about a world where agents branch rapidly in sandboxes for parallel exploration of solution spaces, where compute infra, not LLM inference, is the limiting factor for latency.

Inspired by the ComputeSDK sandbox benchmarks11. ComputeSDK maintains a benchmark of various cloud sandbox providers' boot times. More details here. and wanting to understand systems I use daily, I went down the rabbit hole of optimizing latencies for a Firecracker sandbox orchestrator.

The first version of this system was working in the most literal sense. You could create a sandbox, run a command, and destroy it. On paper, that is success. In practice, it took 2.725 seconds p95 to go from POST /create to the first successful POST /exec, and that is slow enough to break flow in any interactive product.

By the end of this iteration cycle, the same metric landed at 59.030ms p95. This write-up is for software engineers curious about VM internals (including me), not kernel specialists.

I did not preserve every early raw log, so this post reconstructs the journey from benchmark outputs, code changes, and progress notes. The goal is to show validated turning points, and not pretend that every micro-step was documented perfectly.

Throughout this post, TTI means:

TTI = /create (booting up a sandbox to a "ready" state) + first /exec (executing a simple echo command).

Here, ready means the guest responds to ping and guest network configuration has completed such that the sandbox can make outbound requests. Sandbox tear-downs are excluded from the measurements.

Benchmark Environment

All numbers in this post were collected on one development laptop (ThinkPad T14 Gen 4):

OS/kernel: Fedora Linux (6.18.13-200.fc43.x86_64)
CPU: 13th Gen Intel Core i5-1345U (12 logical CPUs)
Memory: 30GiB RAM
Firecracker: v1.13.1
Virtualization: KVM enabled
Workdir filesystem: btrfs (/home/...)
Note on /tmp: mounted as tmpfs on this machine, which matters for reflink behavior

Note: one deliberate scope choice I made is that this journey excludes warm VM pools. I wanted the benchmark to answer a narrower question: how much of the boot/bring-up path can be optimized when every request starts from a cold create. That keeps each improvement attributable to control plane and microVM boot work, instead of pool hit-rate behavior.22. In production, warm pools still make sense for burst absorption and tail latency, but a faster cold path means you can operate with a smaller pool and recover faster when demand outpaces pre-warmed capacity.

Firecracker MicroVMs

MicroVMs are often described as "between containers and VMs". For this project, the useful framing was boot path and device model.

Containers are fast partly because they share the host kernel. Traditional VMs isolate more strongly but often pay for broad virtual hardware emulation and long boot paths. Firecracker's microVM model narrows the machine surface intentionally: minimal devices, clear API, strong isolation boundary.

Firecracker gives you a strong foundation, but your control plane architecture determines whether that potential turns into low latency.

Where Latency Lives

At the API layer, the control plane is tiny:

POST /create
POST /exec
POST /destroy

Underneath, /create is a pipeline:

Prepare per-sandbox storage
Provision host networking
Start Firecracker
Wait for guest readiness
Make first command path usable

Once Firecracker itself is reasonably configured, most latency comes from surrounding orchestration - cold-boot frequency, readiness modeling, repeated host setup per request, and accidental serialization in the control path.

Determinism Before Speed

Early on, many symptoms looked like latency bugs but were actually boot correctness bugs. That made benchmark results noisy and hard to trust.

Kernel/rootfs were the first place this surfaced.33. Kernel = guest OS core that boots first and handles CPU/memory/devices. Rootfs = guest disk image containing user-space binaries, init scripts, and services. You need both: kernel to boot and talk to virtio devices, rootfs to actually run processes.

I could not optimize latency on top of an unstable guest substrate, because every regression looked like "network slowness" when it was really boot setup drift.

For reference, the build entry points are:

Kernel: guest/build-kernel.sh
Rootfs: guest/build-rootfs.sh

Kernel Build

Firecracker expects a kernel that boots cleanly in a minimal virtio-centric environment. If key options are wrong, symptoms look chaotic: rootfs mount failures, virtio probe failures, guest services never coming up, and control-plane timeouts that look network-related but are actually boot-related.

In my runs, failures like root mount panic and virtio probe errors were the signal that boot-critical paths were not deterministic yet.

Concretely, guest/build-kernel.sh starts from Firecracker's own microVM CI config (microvm-kernel-ci-x86_64-6.1.config) and then applies explicit overrides before building vmlinux. Starting from Firecracker's config gave a known-good baseline; the overrides acted as guardrails to avoid host-to-host drift.

The main overrides do three things:

Force boot-critical paths to be built-in (for example CONFIG_VIRTIO_BLK, CONFIG_VIRTIO_NET, CONFIG_VIRTIO_MMIO, CONFIG_EXT4_FS, CONFIG_DEVTMPFS[_MOUNT]) so the guest can mount rootfs and configure networking without relying on optional runtime module loading.
Remove features unnecessary for this microVM workload or likely to introduce surprise state (CONFIG_MODULES=n, CONFIG_BLK_DEV_INITRD=n, CONFIG_ACPI=n).
Trim debug/strictness settings (CONFIG_DEBUG_KERNEL=n, CONFIG_KALLSYMS=n, CONFIG_WERROR=n) so build/runtime behavior is more predictable for repeatable benchmark runs.

In short, Firecracker's config was the compatibility baseline, and the overrides were workload-specific guardrails for deterministic boot and readiness.

So the kernel work here was essentially enforcing boot-critical invariants:

Storage/network virtio support available at boot
Root filesystem support available at boot
No accidental dependence on modules for critical boot path

Once that stabilized, boot behavior became boring and predictable. Only then did benchmark numbers become trustworthy.

Rootfs Build

I moved to a deterministic rootfs assembly model based on staged minirootfs extraction and ext4 image construction. The practical reason was consistency across hosts and fewer container-runtime-specific surprises during image creation.

In guest/build-rootfs.sh, this is implemented as: construct the filesystem tree in a staging directory, install/enable required services there (notably sshd and a guest control agent introduced later), then materialize a sealed ext4 image with mkfs.ext4 -d. Two details mattered. Firstly, a build failure led me to ensure that pseudo-filesystems (/proc, /sys, /dev) are unmounted before mkfs -d, so runtime pseudo state is not accidentally captured during image population.44. mkfs.ext4 -d expects a stable directory tree. Pseudo-filesystems are live kernel views, not static files, so leaving them mounted can cause nondeterministic image-population failures (for example entries under /proc changing while being read). Secondly, the artifact is intentionally slimmed (resize2fs -M, plus cache/docs/manpage cleanup), which reduces IO and helps clone/restore latency in the hot path.

Beyond fixing that concrete build issue, there was a broader performance reason: unstable artifacts introduce noise. If base image setup is non-deterministic, every latency chart is suspect.

Putting kernel and rootfs together, sandbox readiness depends on this chain:

Baseline Architecture

The first complete version used a conventional flow: cold boot microVM from kernel + rootfs, wait for SSH to become ready, execute the first command over SSH. This is simple, but also has all the cost centers in the request path.

One implementation detail that matters throughout this post is rootfs materialization. I use filesystem reflink cloning (cp --reflink=auto) so per-sandbox rootfs copies are copy-on-write on supporting filesystems; otherwise it can silently fall back to full copies.55. Reflink is a filesystem-level copy-on-write clone: a new "copy" initially shares data blocks and only diverges on write. It makes rootfs duplication much faster on supporting filesystems (for example btrfs).

The rough shape of the baseline looked like this:

func createAndExecNaive(req ExecRequest) ExecResult {
    sb := allocateSandboxID()
    rootfs := cloneRootfs(baseRootfs, sb.ID) // cp --reflink=auto
    net := setupNetwork(sb.ID)               // tap + routes + per-sandbox NAT churn
    vm := firecrackerStart(kernelPath, rootfs, net)

    waitUntilSSHReady(vm.GuestIP)
    out := sshExec(vm.GuestIP, req.Command)
    return out
}

From 50 samples:

min	p50	p95	p99	max
1.685s	2.343s	2.725s	2.742s	2.757s

At this stage, optimization targets were obvious in principle but not in ordering. Should I optimize SSH first? Boot first? Host networking first? The next iterations answered that empirically.

Stage 1: Replacing SSH as the Critical Path

The first major reduction came from changing the control channel.

I introduced an in-guest agent and moved readiness and exec to a vsock RPC path.66. vsock is a host $\leftrightarrow$ guest communication channel exposed by the hypervisor, so control traffic does not need to traverse the full guest network stack in the same way as SSH-over-TCP. SSH remained in place as a reliability/debug fallback; only the latency-critical control path moved to vsock.

This shift matters because SSH, while robust, is a lot of protocol and session machinery for "run one command immediately after create". For repeated create/exec workloads, that machinery becomes a measurable tax.

vsock was a better fit for host $\leftrightarrow$ guest control in this context:

Fewer moving parts for early command path
Tighter request/response semantics for readiness
Less dependence on the full guest network bring-up sequence

As a result, p95 dropped from 2.725s to 472.275ms (~5.8x).

This was the first structural win: path redesign, not micro-optimization.

In simplified form, the transition looked like:

func execAfterCreate(sb Sandbox, cmd string) ExecResult {
    // Old path:
    // waitUntilSSHReady(sb.IP)
    // return sshExec(sb.IP, cmd)

    // New fast path:
    waitUntilAgentPingOK(sb.VSock)
    return agentRPCExec(sb.VSock, cmd)
}

Stage 2: Replace Cold Boot with Golden Snapshots

After Stage 1, pure cold boot was still expensive. The next change was to stop paying that cost per request.

Golden Snapshots

A golden snapshot here is a pre-initialized baseline VM state captured ahead of request time:

Machine state (state.snap)
Memory state (mem.snap)
Aligned baseline disk artifact

The system boots and prepares this baseline once, snapshots it, and then restores from it on /create. In other words, initialization is shifted from request-time work to preparation-time work.

This is exactly the kind of trade you want for latency-sensitive control planes: move repeatable expensive work off the critical path.

Request-time create from the golden snapshot looked roughly like:

func createFromGoldenSnapshot(req CreateRequest) Sandbox {
    sb := allocateSandboxID()
    rootfs := cloneRootfs(snapshotBaseDisk, sb.ID) // reflink when available
    ns := acquireNetNSSlot()                       // pooled first, on-demand fallback

    vm := firecrackerStartInNetNS(ns, rootfs)
    firecrackerLoadSnapshot(vm, stateSnapPath, memSnapPath)

    waitUntilAgentPingOK(sb.VSock)
    agentRPCConfigureNetwork(sb.VSock, sb.IPConfig)
    return sb
}

Adding network namespaces

netns (Linux network namespaces) gives each sandbox an isolated network stack.77. Think of it as each sandbox getting its own virtual network world: interfaces, routes, and firewall view. In Stage 2, the main latency win came from snapshot restore; netns was the enabler that made restore setup operationally clean and composable per sandbox. With per-sandbox namespaces, stable local device names can be reused without collisions. This also set up Stage 3, where netns pooling and netlink provisioning reduced host networking overhead further.

Combining snapshot restore with netns-based sandbox networking moved p95 to 190.617ms (~2.5x faster than previous stage).

Stage 3: Networking Overhead Became Dominant

With cold boot mostly out of the path, host networking setup emerged as the primary cost center.

Three changes mattered:

Per-sandbox iptables churn removed: Instead of mutating NAT rules for each sandbox, install a broader startup rule and stop touching iptables on every create/destroy.
Shell ip calls replaced with netlink APIs: Process spawning in the hot path adds overhead and failure surface. Direct netlink calls from Go removed both.
Network namespace pooling: Pre-create netns slots so create can often acquire prepared network context instead of building one from scratch.88. This is not indefinitely scalable on one host with one-octet subnet encoding and per-sandbox route/device growth. The next step is likely an allocator-based IPAM plus host-level scheduling for horizontal scale.

The theme here was amortization: if a setup step is repeated and deterministic, pre-do it or make it native.

At the networking layer, this was the core control flow:

func acquireNetNS() NetNSSlot {
    if slot, ok := pool.TryAcquire(shortWindow); ok {
        return slot
    }
    return createNetNSOnDemand()
}

func setupHostNetwork(sb Sandbox) error {
    // Startup path installs one broad MASQUERADE rule once.
    // Per-create path programs link/tap/routes via netlink.
    return netlinkProgram(sb.NetNS, sb.Veth, sb.Tap, sb.Routes)
}

The p95 latency is now 105.149ms.

Stage 4: Tightening the Tail

At this point there was no single giant bottleneck. Gains came from shaving serialization and cleaning tail behavior.

Parallelizing independent work: Some setup steps were serialized by implementation convenience, not real dependency. Running rootfs clone/materialization and netns acquisition concurrently reduced critical path length.
Polling interval tuning: Readiness loops were using conservative sleep intervals. Tightening those intervals reduced avoidable waiting without changing readiness criteria.
In-guest network config via netlink: Inside the guest agent, replacing shell ip commands with direct netlink operations reduced guest network setup time and variance once updated snapshots were rebuilt with the new agent.

By the time these landed and stabilized, measured p95 reached 59.030ms.

Now, the latency breakdown is as such:

This chart uses one instrumented create sample and shows per-stage duration magnitudes for /create.

prep_overlap: wall-clock overlap window while disk materialization and netns acquisition run concurrently.
socket_ready: time waiting for the Firecracker API Unix socket to become reachable after process start.
snapshot_load: Firecracker /snapshot/load request time (state + memory restore).
agent_ready: wait for post-restore in-guest agent handshake/ping readiness.
guest_net: in-guest network configuration work (interface/address/route setup via agent).
unattributed_overhead: remaining create-path time outside the explicit stage timers (bookkeeping, request/response plumbing, and small timing boundary gaps).

At this point, the create path is broadly balanced: no single stage significantly dominates end-to-end latency. The largest remaining chunks are snapshot restore (snapshot_load), post-restore handshake (agent_ready), and residual orchestration cost (unattributed_overhead), so remaining improvements are likely very incremental unless a new architectural lever appears.

Final Architecture

The optimized system looks like this:

And here's a summary of the latency progression thus far:

Stage	p95 TTI
Baseline cold-boot + SSH	2725 ms
vsock RPC control path	472 ms
Golden snapshot restore + netns	191 ms
Netns pool + host netlink path	105 ms
Parallelized setup + polling + guest netlink	59 ms

Overall, this is a ~46x p95 reduction!

Additional Engineering Notes

Reliability Bugs That Showed Up As Latency Problems

Many symptoms looked like latency regressions, but root causes were ownership and sequencing errors. These bugs were major turning points in how I reasoned about the system.

One example was VM process lifetime tied to request context. That made process death coincide with HTTP handler completion and surfaced as route failures/exec timeouts.
Another was API socket path collisions when Firecracker processes did not get unique --api-sock paths.
Another was snapshot-load races: assuming Firecracker API socket availability too early caused intermittent ENOENT/ECONNREFUSED and long-tail retries.
Another was path/storage assumptions: relative workdir behavior and filesystem choices caused intermittent snapshot-file failures and jitter.

None of these are exotic, but all of them meaningfully affect tail latency and reliability. In practice, low-latency control planes are built on explicit contracts: process lifetime contract, readiness contract, and artifact path contract.

Why Guardrails Mattered

Some changes did not produce headline p95 drops:

Clone-mode guardrails (auto vs strict reflink-required behavior)
Startup diagnostics around storage/reflink capability
Policy consistency across create and restore clone paths
Workdir placement gotcha: using /tmp on this laptop (mounted as tmpfs) prevented the intended reflink fast path, so clone operations under --reflink=auto could fall back to real copies; moving MANTA_WORK_DIR to a btrfs-backed path restored copy-on-write clone behavior and improved create/restore consistency99. MANTA_WORK_DIR is the host-side directory where runtime artifacts are materialized (for example per-sandbox rootfs copies, snapshot files, and sandbox working directories), so its backing filesystem directly affects clone/restore behavior.

These are still performance work in the long game. They prevent silent fallback cliffs and environment-specific regressions that would otherwise erase gains later.

Trade-offs and Future Directions

The current design is intentionally pragmatic:

Snapshot artifacts are local to one host/workdir
User snapshot APIs are in place (though not benchmarked here), while richer policy/authz is still evolving
Restore still materializes per-sandbox disks rather than using a fully layered writable model

What comes next is less about shaving another few milliseconds and more about turning this into a production-grade system.

Deployment and operations: package the control plane and guest-artifact pipeline for reproducible deployment, move workdir/snapshot storage to production-grade paths, add stronger observability (per-stage latency histograms, error taxonomy, saturation signals), and define upgrade/rollback playbooks for kernel/rootfs/snapshot compatibility.
Multitenancy and policy: enforce tenant-scoped isolation and ownership across sandboxes/snapshots, add quota and rate-limit controls, and harden API authz so snapshot lifecycle operations are safe under concurrent multi-user load.
Networking scale architecture: replace one-octet subnet assumptions with allocator-backed IPAM and explicit capacity signaling, then add host-level placement so sandbox networking scales horizontally instead of depending on one host's route/device limits.
User-configurable VM shape and boot inputs: evolve from fixed VM profiles to controlled user-selectable CPU/memory/disk shapes, and eventually allow users to boot custom disk images and snapshot inputs under explicit policy validation (artifact checks, resource limits, compatibility checks). This is likely the biggest product unlock, but it also raises scheduling, security, and admission-control complexity.
Storage model evolution: move from per-restore full disk materialization toward layered/overlay approaches where possible, so clone/restore costs scale better with tenant count and snapshot volume.

Follow my progress in this GitHub repo, and if there's anything that can be improved, I appreciate pointers and feedback.

This has been incredibly fun and I learned a lot, looking forward to taking this online next!

Building & Optimizing a Firecracker Control Plane

March 4, 2026 · 14 min read

firecracker vm systems performance infra

By the end of this iteration cycle, the same metric landed at 59.030ms p95. This write-up is for software engineers curious about VM internals (including me), not kernel specialists.

Throughout this post, TTI means:

TTI = /create (booting up a sandbox to a "ready" state) + first /exec (executing a simple echo command).

Here, ready means the guest responds to ping and guest network configuration has completed such that the sandbox can make outbound requests. Sandbox tear-downs are excluded from the measurements.

Benchmark Environment

All numbers in this post were collected on one development laptop (ThinkPad T14 Gen 4):

OS/kernel: Fedora Linux (6.18.13-200.fc43.x86_64)
CPU: 13th Gen Intel Core i5-1345U (12 logical CPUs)
Memory: 30GiB RAM
Firecracker: v1.13.1
Virtualization: KVM enabled
Workdir filesystem: btrfs (/home/...)
Note on /tmp: mounted as tmpfs on this machine, which matters for reflink behavior

Firecracker MicroVMs

MicroVMs are often described as "between containers and VMs". For this project, the useful framing was boot path and device model.

Firecracker gives you a strong foundation, but your control plane architecture determines whether that potential turns into low latency.

Where Latency Lives

At the API layer, the control plane is tiny:

POST /create
POST /exec
POST /destroy

Underneath, /create is a pipeline:

Prepare per-sandbox storage
Provision host networking
Start Firecracker
Wait for guest readiness
Make first command path usable

Determinism Before Speed

Early on, many symptoms looked like latency bugs but were actually boot correctness bugs. That made benchmark results noisy and hard to trust.

I could not optimize latency on top of an unstable guest substrate, because every regression looked like "network slowness" when it was really boot setup drift.

For reference, the build entry points are:

Kernel: guest/build-kernel.sh
Rootfs: guest/build-rootfs.sh

Kernel Build

In my runs, failures like root mount panic and virtio probe errors were the signal that boot-critical paths were not deterministic yet.

The main overrides do three things:

Force boot-critical paths to be built-in (for example CONFIG_VIRTIO_BLK, CONFIG_VIRTIO_NET, CONFIG_VIRTIO_MMIO, CONFIG_EXT4_FS, CONFIG_DEVTMPFS[_MOUNT]) so the guest can mount rootfs and configure networking without relying on optional runtime module loading.
Remove features unnecessary for this microVM workload or likely to introduce surprise state (CONFIG_MODULES=n, CONFIG_BLK_DEV_INITRD=n, CONFIG_ACPI=n).
Trim debug/strictness settings (CONFIG_DEBUG_KERNEL=n, CONFIG_KALLSYMS=n, CONFIG_WERROR=n) so build/runtime behavior is more predictable for repeatable benchmark runs.

In short, Firecracker's config was the compatibility baseline, and the overrides were workload-specific guardrails for deterministic boot and readiness.

So the kernel work here was essentially enforcing boot-critical invariants:

Storage/network virtio support available at boot
Root filesystem support available at boot
No accidental dependence on modules for critical boot path

Once that stabilized, boot behavior became boring and predictable. Only then did benchmark numbers become trustworthy.

Rootfs Build

Beyond fixing that concrete build issue, there was a broader performance reason: unstable artifacts introduce noise. If base image setup is non-deterministic, every latency chart is suspect.

Putting kernel and rootfs together, sandbox readiness depends on this chain:

Baseline Architecture

The rough shape of the baseline looked like this:

func createAndExecNaive(req ExecRequest) ExecResult {
    sb := allocateSandboxID()
    rootfs := cloneRootfs(baseRootfs, sb.ID) // cp --reflink=auto
    net := setupNetwork(sb.ID)               // tap + routes + per-sandbox NAT churn
    vm := firecrackerStart(kernelPath, rootfs, net)

    waitUntilSSHReady(vm.GuestIP)
    out := sshExec(vm.GuestIP, req.Command)
    return out
}

From 50 samples:

min	p50	p95	p99	max
1.685s	2.343s	2.725s	2.742s	2.757s

At this stage, optimization targets were obvious in principle but not in ordering. Should I optimize SSH first? Boot first? Host networking first? The next iterations answered that empirically.

Stage 1: Replacing SSH as the Critical Path

The first major reduction came from changing the control channel.

vsock was a better fit for host $\leftrightarrow$ guest control in this context:

Fewer moving parts for early command path
Tighter request/response semantics for readiness
Less dependence on the full guest network bring-up sequence

As a result, p95 dropped from 2.725s to 472.275ms (~5.8x).

This was the first structural win: path redesign, not micro-optimization.

In simplified form, the transition looked like:

func execAfterCreate(sb Sandbox, cmd string) ExecResult {
    // Old path:
    // waitUntilSSHReady(sb.IP)
    // return sshExec(sb.IP, cmd)

    // New fast path:
    waitUntilAgentPingOK(sb.VSock)
    return agentRPCExec(sb.VSock, cmd)
}

Stage 2: Replace Cold Boot with Golden Snapshots

After Stage 1, pure cold boot was still expensive. The next change was to stop paying that cost per request.

Golden Snapshots

A golden snapshot here is a pre-initialized baseline VM state captured ahead of request time:

Machine state (state.snap)
Memory state (mem.snap)
Aligned baseline disk artifact

The system boots and prepares this baseline once, snapshots it, and then restores from it on /create. In other words, initialization is shifted from request-time work to preparation-time work.

This is exactly the kind of trade you want for latency-sensitive control planes: move repeatable expensive work off the critical path.

Request-time create from the golden snapshot looked roughly like:

func createFromGoldenSnapshot(req CreateRequest) Sandbox {
    sb := allocateSandboxID()
    rootfs := cloneRootfs(snapshotBaseDisk, sb.ID) // reflink when available
    ns := acquireNetNSSlot()                       // pooled first, on-demand fallback

    vm := firecrackerStartInNetNS(ns, rootfs)
    firecrackerLoadSnapshot(vm, stateSnapPath, memSnapPath)

    waitUntilAgentPingOK(sb.VSock)
    agentRPCConfigureNetwork(sb.VSock, sb.IPConfig)
    return sb
}

Adding network namespaces

Combining snapshot restore with netns-based sandbox networking moved p95 to 190.617ms (~2.5x faster than previous stage).

Stage 3: Networking Overhead Became Dominant

With cold boot mostly out of the path, host networking setup emerged as the primary cost center.

Three changes mattered:

Per-sandbox iptables churn removed: Instead of mutating NAT rules for each sandbox, install a broader startup rule and stop touching iptables on every create/destroy.
Shell ip calls replaced with netlink APIs: Process spawning in the hot path adds overhead and failure surface. Direct netlink calls from Go removed both.
Network namespace pooling: Pre-create netns slots so create can often acquire prepared network context instead of building one from scratch.88. This is not indefinitely scalable on one host with one-octet subnet encoding and per-sandbox route/device growth. The next step is likely an allocator-based IPAM plus host-level scheduling for horizontal scale.

The theme here was amortization: if a setup step is repeated and deterministic, pre-do it or make it native.

At the networking layer, this was the core control flow:

func acquireNetNS() NetNSSlot {
    if slot, ok := pool.TryAcquire(shortWindow); ok {
        return slot
    }
    return createNetNSOnDemand()
}

func setupHostNetwork(sb Sandbox) error {
    // Startup path installs one broad MASQUERADE rule once.
    // Per-create path programs link/tap/routes via netlink.
    return netlinkProgram(sb.NetNS, sb.Veth, sb.Tap, sb.Routes)
}

The p95 latency is now 105.149ms.

Stage 4: Tightening the Tail

At this point there was no single giant bottleneck. Gains came from shaving serialization and cleaning tail behavior.

Parallelizing independent work: Some setup steps were serialized by implementation convenience, not real dependency. Running rootfs clone/materialization and netns acquisition concurrently reduced critical path length.
Polling interval tuning: Readiness loops were using conservative sleep intervals. Tightening those intervals reduced avoidable waiting without changing readiness criteria.
In-guest network config via netlink: Inside the guest agent, replacing shell ip commands with direct netlink operations reduced guest network setup time and variance once updated snapshots were rebuilt with the new agent.

By the time these landed and stabilized, measured p95 reached 59.030ms.

Now, the latency breakdown is as such:

This chart uses one instrumented create sample and shows per-stage duration magnitudes for /create.

prep_overlap: wall-clock overlap window while disk materialization and netns acquisition run concurrently.
socket_ready: time waiting for the Firecracker API Unix socket to become reachable after process start.
snapshot_load: Firecracker /snapshot/load request time (state + memory restore).
agent_ready: wait for post-restore in-guest agent handshake/ping readiness.
guest_net: in-guest network configuration work (interface/address/route setup via agent).
unattributed_overhead: remaining create-path time outside the explicit stage timers (bookkeeping, request/response plumbing, and small timing boundary gaps).

Final Architecture

The optimized system looks like this:

And here's a summary of the latency progression thus far:

Stage	p95 TTI
Baseline cold-boot + SSH	2725 ms
vsock RPC control path	472 ms
Golden snapshot restore + netns	191 ms
Netns pool + host netlink path	105 ms
Parallelized setup + polling + guest netlink	59 ms

Overall, this is a ~46x p95 reduction!

Additional Engineering Notes

Reliability Bugs That Showed Up As Latency Problems

Many symptoms looked like latency regressions, but root causes were ownership and sequencing errors. These bugs were major turning points in how I reasoned about the system.

One example was VM process lifetime tied to request context. That made process death coincide with HTTP handler completion and surfaced as route failures/exec timeouts.
Another was API socket path collisions when Firecracker processes did not get unique --api-sock paths.
Another was snapshot-load races: assuming Firecracker API socket availability too early caused intermittent ENOENT/ECONNREFUSED and long-tail retries.
Another was path/storage assumptions: relative workdir behavior and filesystem choices caused intermittent snapshot-file failures and jitter.

Why Guardrails Mattered

Some changes did not produce headline p95 drops:

Clone-mode guardrails (auto vs strict reflink-required behavior)
Startup diagnostics around storage/reflink capability
Policy consistency across create and restore clone paths
Workdir placement gotcha: using /tmp on this laptop (mounted as tmpfs) prevented the intended reflink fast path, so clone operations under --reflink=auto could fall back to real copies; moving MANTA_WORK_DIR to a btrfs-backed path restored copy-on-write clone behavior and improved create/restore consistency99. MANTA_WORK_DIR is the host-side directory where runtime artifacts are materialized (for example per-sandbox rootfs copies, snapshot files, and sandbox working directories), so its backing filesystem directly affects clone/restore behavior.

These are still performance work in the long game. They prevent silent fallback cliffs and environment-specific regressions that would otherwise erase gains later.

Trade-offs and Future Directions

The current design is intentionally pragmatic:

Snapshot artifacts are local to one host/workdir
User snapshot APIs are in place (though not benchmarked here), while richer policy/authz is still evolving
Restore still materializes per-sandbox disks rather than using a fully layered writable model

What comes next is less about shaving another few milliseconds and more about turning this into a production-grade system.

Deployment and operations: package the control plane and guest-artifact pipeline for reproducible deployment, move workdir/snapshot storage to production-grade paths, add stronger observability (per-stage latency histograms, error taxonomy, saturation signals), and define upgrade/rollback playbooks for kernel/rootfs/snapshot compatibility.
Multitenancy and policy: enforce tenant-scoped isolation and ownership across sandboxes/snapshots, add quota and rate-limit controls, and harden API authz so snapshot lifecycle operations are safe under concurrent multi-user load.
Networking scale architecture: replace one-octet subnet assumptions with allocator-backed IPAM and explicit capacity signaling, then add host-level placement so sandbox networking scales horizontally instead of depending on one host's route/device limits.
User-configurable VM shape and boot inputs: evolve from fixed VM profiles to controlled user-selectable CPU/memory/disk shapes, and eventually allow users to boot custom disk images and snapshot inputs under explicit policy validation (artifact checks, resource limits, compatibility checks). This is likely the biggest product unlock, but it also raises scheduling, security, and admission-control complexity.
Storage model evolution: move from per-restore full disk materialization toward layered/overlay approaches where possible, so clone/restore costs scale better with tenant count and snapshot volume.

Follow my progress in this GitHub repo, and if there's anything that can be improved, I appreciate pointers and feedback.

This has been incredibly fun and I learned a lot, looking forward to taking this online next!

#Benchmark Environment

#Firecracker MicroVMs

#Where Latency Lives

#Determinism Before Speed

#Kernel Build

#Rootfs Build

#Baseline Architecture

#Stage 1: Replacing SSH as the Critical Path

#Stage 2: Replace Cold Boot with Golden Snapshots

#Golden Snapshots

#Adding network namespaces

#Stage 3: Networking Overhead Became Dominant

#Stage 4: Tightening the Tail

#Final Architecture

#Additional Engineering Notes

#Reliability Bugs That Showed Up As Latency Problems

#Why Guardrails Mattered

#Trade-offs and Future Directions

#Benchmark Environment

#Firecracker MicroVMs

#Where Latency Lives

#Determinism Before Speed

#Kernel Build

#Rootfs Build

#Baseline Architecture

#Stage 1: Replacing SSH as the Critical Path

#Stage 2: Replace Cold Boot with Golden Snapshots

#Golden Snapshots

#Adding network namespaces

#Stage 3: Networking Overhead Became Dominant

#Stage 4: Tightening the Tail

#Final Architecture

#Additional Engineering Notes

#Reliability Bugs That Showed Up As Latency Problems

#Why Guardrails Mattered

#Trade-offs and Future Directions

Benchmark Environment

Firecracker MicroVMs

Where Latency Lives

Determinism Before Speed

Kernel Build

Rootfs Build

Baseline Architecture

Stage 1: Replacing SSH as the Critical Path

Stage 2: Replace Cold Boot with Golden Snapshots

Golden Snapshots

Adding network namespaces

Stage 3: Networking Overhead Became Dominant

Stage 4: Tightening the Tail

Final Architecture

Additional Engineering Notes

Reliability Bugs That Showed Up As Latency Problems

Why Guardrails Mattered

Trade-offs and Future Directions

Benchmark Environment

Firecracker MicroVMs

Where Latency Lives

Determinism Before Speed

Kernel Build

Rootfs Build

Baseline Architecture

Stage 1: Replacing SSH as the Critical Path

Stage 2: Replace Cold Boot with Golden Snapshots

Golden Snapshots

Adding network namespaces

Stage 3: Networking Overhead Became Dominant

Stage 4: Tightening the Tail

Final Architecture

Additional Engineering Notes

Reliability Bugs That Showed Up As Latency Problems

Why Guardrails Mattered

Trade-offs and Future Directions