Nexus-VMM: Rethinking Kubernetes Without Containers

💡

Important scope note: This article describes both what Nexus-VMM v0.1 already implements and the architectural path it validates for v0.2. Where a capability is planned rather than already integrated, I call that out explicitly.

The Limits of Shared Kernels

Kubernetes is the undisputed industry standard for infrastructure orchestration, but its default execution boundary relies heavily on Linux containers, cgroups, and namespaces. This is fundamentally a shared-kernel construct. For multi-tenant environments or highly sensitive workloads, shared-kernel isolation can be an inadequate boundary because a single kernel vulnerability can compromise every workload on the node. A single kernel vulnerability compromises the entire node. Hardware virtualization via KVM can provide a materially stronger isolation boundary than shared-kernel containers, making it a compelling foundation for zero-trust-oriented workload design.

The Abstraction Tax

The industry recognized this need, but the existing solutions introduce massive technical debt. Platforms like KubeVirt achieve virtualization using a "Matryoshka Doll" architecture. They typically run a hypervisor stack inside a Kubernetes-managed pod boundary, layering hardware isolation on top of a container-oriented control plane.

This abstraction layers hardware isolation on top of container-oriented orchestration, which can introduce measurable overheads and lifecycle complexity:

Memory Overhead: Wrapping a VM in a Kubernetes-managed pod can introduce significant per-instance memory overhead from launcher processes, device emulation, and orchestration plumbing, though the exact cost depends heavily on the runtime, device model, and workload profile.
I/O Bottlenecks: Network traffic must traverse the guest kernel, the hypervisor user-space, the container network namespace, and finally the host Linux bridge.
Lifecycle Friction: Kubernetes declarative APIs and lifecycle probes expect to interact with a container namespace, not a virtual machine trapped inside one.
The Zero-Management Interface: The architectural goal of Nexus-VMM is a transparent UX built around standard Kubernetes Pods rather than a separate VM-specific object model. In the target design, users would deploy ordinary Pods and select the VM-backed execution path through runtimeClassName: nexus-vmm, allowing the orchestrator to translate a familiar Kubernetes workload into a hardware-isolated runtime boundary. The VM is not presented as a user-managed “pet VM,” but as a tactical isolation primitive for standard container workloads.

Q: Eradicating the Wrapper

If the goal is bare-metal performance with hardware-level isolation, how do I achieve KVM isolation with native Kubernetes Container Runtime Interface (CRI) lifecycle parity, without the overhead of wrapping VMs in containers?

A: Nexus-VMM

The answer is Nexus-VMM. I built a custom CRI shim written entirely in Rust that abandons the container wrapper. It intercepts Kubernetes orchestration gRPC commands and translates them into the control-plane primitives required for a VM-backed runtime, with direct native KVM-backed execution planned as the next step of the architecture.

💡

Implementation status: Nexus-VMM v0.1 is a research prototype focused on validating the orchestration path, not a completed production VMM. The current implementation proves the CRI-side control-plane shape, asynchronous runtime plumbing, host-side memory mapping primitives, and the guest-execution transport. Native KVM-backed execution is the next architectural milestone and is planned for v0.2.

Read the v0.1 Prototype Code on GitHub

Below is a technical deconstruction of the three core modules I built to solve the hardest orchestration edge cases. I engineered this system from first principles, focusing purely on mechanical sympathy with the host operating system.

Module A, Defeating Orchestration Starvation (`nexus-cri`)

Status in v0.1

Implemented as prototype orchestration plumbing and mock runtime validation.

To support a VM-backed runtime path outside the standard container execution model, Nexus-VMM must implement the gRPC Container Runtime Interface directly. A critical operational requirement is executing Container Network Interface (CNI) plugins to provision IP addresses. However, executing these external binaries via the standard blocking std::process::Command introduces a severe impedance mismatch. It blocks the async gRPC event loop, starving the orchestrator and causing unacceptable latency spikes across the node under high scheduling loads.

I built an asynchronous CRI shim using tonic and tokio. I strictly mandate that all CNI setups run concurrently using tokio::process::Command, ensuring the main event loop remains unpinned.

The Battle Scars

Building a non-blocking CRI exposed edge cases in state management. I enforce strict crash-consistency. Instead of holding network state in volatile RAM, which wipes on service restarts and causes the Kubelet to aggressively garbage collect pods, I serialize CNI output directly to disk. Furthermore, I built symmetrical teardowns to prevent IPAM exhaustion attacks on the node. To prevent orchestration panics on clean nodes, the architecture demands defensive filesystem checks and process validation before parsing.

// Internal CNI execution logic using tokio reactor from nexus-cri/src/lib.rs
async fn execute_cni_setup(&self, sandbox_id: &str) -> Result<(), String> {
    let mut cmd = tokio::process::Command::new("sh");
    cmd.args(["-c", "sleep 1 && echo '{\"ip\":\"10.0.0.2\"}'"]);

    cmd.env("CNI_COMMAND", "ADD")
        .env("CNI_CONTAINERID", sandbox_id)
        .env("CNI_NETNS", format!("/var/run/netns/{}", sandbox_id))
        .env("CNI_IFNAME", "eth0")
        .env("CNI_PATH", "/opt/cni/bin");

    let output = cmd
        .output()
        .await
        .map_err(|e| format!("Failed to spawn CNI process: {}", e))?;

    // Defensive gate: Prevent parsing failures if CNI crashes
    if !output.status.success() {
        return Err("CNI ADD execution failed".into());
    }

   // Crash consistency: state serialization to disk
    let stdout_str = String::from_utf8_lossy(&output.stdout);
    let start = stdout_str.find('{').unwrap_or(0);
    let json_str = &stdout_str[start..];

    let path = format!("/var/lib/nexus/sandboxes/{}.json", sandbox_id);
   
    // Ensure directory exists before attempting write
    tokio::fs::create_dir_all("/var/lib/nexus/sandboxes/")
        .await
        .ok();
   tokio::fs::write(&path, json_str.as_bytes())
        .await
        .map_err(|e| format!("Failed to write CNI state: {}", e))?;

    Ok(())
}

Module B, Zero-Copy Data Ingestion ( `nexus-memory-mapper` )

Status in v0.1

Host-side file-backed memory mapping prototype. Direct KVM guest-memory registration and read-only exposure to the guest are planned for v0.2.

A zero-copy data path requires injecting Kubernetes volumes into the guest. Mounting Kubernetes Secrets and ConfigMaps via standard virtio-fs causes severe serialization bottlenecks under high concurrent read loads, locking up the hypervisor thread.

In Nexus-VMM, I intentionally constrain this path to Secrets and ConfigMaps that are treated as immutable snapshots for the lifetime of the guest mapping. I bypass the guest filesystem abstractions entirely by mapping the secret files directly to the host's page cache using the libc::mmap syscall with the PROT_READ and MAP_SHARED flags.

Why `Virtio-fs` Fails at Scale

While projects like Kata Containers often rely on virtio-fs for host-to-guest file sharing, that approach can introduce additional mediation, serialization points, and hypervisor overhead under some access patterns. Nexus-VMM pursues a more mechanically direct approach by treating selected Kubernetes Secrets and ConfigMaps as immutable snapshot-backed memory regions within the runtime design. By utilizing libc::mmap to create a host-side file-backed mapping as the foundation for a future guest-memory registration path, Nexus-VMM reduces filesystem mediation and establishes the groundwork for a lower-overhead ingestion model than a traditional shared filesystem path.

The Battle Scars

To prevent malicious or accidental guest-induced host page faults, this host-side mapping lays the groundwork for strict hypervisor security. At the system level, mapping an empty file yields an immediate failure, requiring explicit memory boundaries. When integrating the KVM execution loop in v0.2, I am architecturally constrained to pass these mappings to KVM with the strict KVM_MEM_READONLY flag and expose them to the guest via a virtio-pmem DAX device. This will ensure that if the guest kernel attempts to write a superblock, the operation is structurally blocked at the hypervisor level.

// Bypassing virtio-fs locks via direct memory mapping from nexus-memory-mapper/src/lib.rs
pub fn map_secret_read_only(path: &Path) -> io::Result<MappedSecret> {
    let file = File::open(path)?;
    let metadata = file.metadata()?;
    let len = metadata.len() as usize;

    // Structural safeguard: Calling mmap on an empty file will fail
    if len == 0 {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "Cannot mmap an empty file",
        ));
    }

   let fd = file.as_raw_fd();

    unsafe {
        // SAFETY: Direct mapping to the host page cache for zero-copy, read-only ingestion
        let ptr = libc::mmap(
            ptr::null_mut(),
            len,
            libc::PROT_READ,
            libc::MAP_SHARED,
            fd,
            0,
        );

        if ptr == libc::MAP_FAILED {
            return Err(io::Error::last_os_error());
        }

        Ok(MappedSecret::new(ptr, len))
    }
}

Module C, Resolving the ExecSync Void (`nexus-vsock-agent`)

Status in v0.1

Internal guest-execution transport prototype validated with mocked host-side streams. Full CRI-compliant ExecSync reassembly remains part of the next integration step.

Because I stripped away the container runtime, I destroyed the namespace abstraction. Consequently, standard Kubernetes liveness and readiness probes, which rely on the ExecSync RPC to run scripts inside the pod boundary, fail completely.

I deploy a statically compiled Rust micro-agent running deep inside the guest OS. When Kubelet issues an ExecSync command, the CRI shim intercepts it, serializes the payload, and tunnels it to the guest agent over an internal transport designed for an AF_VSOCK interface, currently validated with mocked duplex streams on the host side before full CRI response reassembly is wired in.

The Battle Scars

Data Integrity:

I do not dump raw stdout bytes over the socket, as this leads to framing collisions. Early prototypes attempted to return execution results through a simple JSON response structure, but that proved too rigid for a robust internal transport carrying interleaved stdout, stderr, and exit-state metadata. Instead, I built a strict Bounded MPSC (Multi-Producer, Single-Consumer) channel to enforce OS-level backpressure. I stream the output directly, wrapping every 4KB chunk in a strict Binary TLV (Type-Length-Value) framing protocol. At the CRI boundary, this internal TLV stream is intended to be reassembled back into the synchronous ExecSync response shape expected by Kubernetes.

Zombie Prevention & Routing:

If a probe attempts to run a command with arguments, parsing failures will cause silent hangs. The agent explicitly slices and routes arguments. Additionally, if the Kubelet drops the socket prematurely, I cannot afford to leak orphaned processes in the guest. I mapped Tokio's kill_on_drop(true) to a pre_exec closure utilizing the Linux prctl syscall. This provides a strong process-cleanup safeguard by ensuring the child receives a fatal signal if its creating parent context disappears unexpectedly.

// Enforcing OS-level zombie process prevention and argument routing from nexus-vsock-agent/src/lib.rs
let mut cmd = Command::new(&request.command[0]);

// Accurately route command arguments to prevent silent probe failures
if request.command.len() > 1 {
    cmd.args(&request.command[1..]);
}

cmd.stdout(Stdio::piped()).stderr(Stdio::piped());

// Ensure the command dies if the async task is dropped
cmd.kill_on_drop(true);

unsafe {
    cmd.pre_exec(|| {
        // Structurally guarantee process death if the parent agent terminates
        libc::prctl(libc::PR_SET_PDEATHSIG, libc::SIGKILL);
        Ok(())
    });
}

Roadmap:

Architectural Validation Before Engine Execution

Nexus-VMM v0.1 is a Research Prototype designed to validate a radical thesis: that Kubernetes CRI parity can be achieved without the 'Matryoshka Doll' container wrapper.

v0.1 (Current): Focuses on the 'Orchestration Plumbing'—the Async gRPC Shim, the Zero-Copy Memory Mapper, and the VSOCK Execution Tunnel.
v0.2 (Upcoming): Will integrate a native Virtual Machine Monitor (VMM) directly into the nexus-cri boundary using the rust-vmm ecosystem (specifically utilizing crates like kvm-ioctls and vmm-sys-util).

Decoupling the CRI orchestration logic from the future /dev/kvm execution loop allowed me to validate the Kubernetes-facing control-plane shape before taking on the complexity of KVM execution, guest boot flow, and device-model integration.

Nexus-VMM v0.1 validates the orchestration path for a native VMM-backed runtime and substantially de-risks the Kubernetes-facing control plane. The Rust-based, CRI-integrated execution loop itself remains the central engineering milestone for v0.2.

Review the architecture, challenge the implementation, or contribute to v0.2 here

What v0.1 proves today

Async CRI-side orchestration plumbing for a VM-backed runtime design
Crash-consistent state handling for the prototype network path
Host-side file-backed memory mapping primitives for immutable snapshot-style data
A guest-execution transport pattern suitable for a future VSOCK-backed control path

What remains for v0.2

Native KVM-backed execution integrated into the runtime
Guest-memory registration and read-only slot enforcement through KVM
Full CRI-compliant ExecSync response reassembly
End-to-end RuntimeClass-driven Kubernetes integration

Independent Research Statement:*
Nexus-VMM is an independent research project developed after my tenure with previous employers had ended. Its architecture is based on first-principles analysis of public Kubernetes/CRI specifications, public KVM interfaces, and open-source systems design. All implementation work was done using personal resources and publicly available documentation, with no reliance on proprietary code, internal roadmaps, or non-public methodologies.*

Command Palette

The Limits of Shared Kernels

The Abstraction Tax

Q: Eradicating the Wrapper

A: Nexus-VMM

Module A, Defeating Orchestration Starvation (nexus-cri)

The Battle Scars

Module B, Zero-Copy Data Ingestion ( nexus-memory-mapper )

Why Virtio-fs Fails at Scale

The Battle Scars

Module C, Resolving the ExecSync Void (nexus-vsock-agent)

The Battle Scars

Roadmap:

Comments

Module A, Defeating Orchestration Starvation (`nexus-cri`)

Module B, Zero-Copy Data Ingestion ( `nexus-memory-mapper` )

Why `Virtio-fs` Fails at Scale

Module C, Resolving the ExecSync Void (`nexus-vsock-agent`)