The one where the root is avoided

We need more ~~cowbells~~ isolation! At the moment my container runtime creates a new PID namespace, but somehow it still manages not to fully isolate the child process from the host.

And I definitely don’t want the child process to be root on the host machine. Time to fix that.

Initially, I had to run the runtime as root because Linux restricts who can create new namespaces. Creating most namespaces requires the CAP_SYS_ADMIN capability. What is a “capability”? I’m glad you asked. Think of it as the superuser split into a set of smaller, fine-grained privileges. For details, see man 7 capabilities.

There is, however, one neat trick that we can use. A process can pass the CLONE_NEWUSER flag to clone() to create a new user namespace for the child. Creating a user namespace does not require CAP_SYS_ADMIN, so an unprivileged process can do it.

Inside that new user namespace, the child process gets all capabilities—effectively becoming a superuser within that namespace. The kernel guarantees that the user namespace is created first, which means any other namespaces created alongside it will be owned by a process that already has CAP_SYS_ADMIN in the new user namespace.

TL;DR: you don’t need to run your container runtime as root, as long as it runs the containerized child in a new user namespace.

Let’s adjust the clone flags and try running the container again.

let clone_flags = CloneFlags::CLONE_NEWPID | CloneFlags::CLONE_NEWUSER;

No more sudo. We can now just run cargo run sh to start a shell inside the container. Then we check who we are with id.

$ cargo run sh Finished dev profile [unoptimized + debuginfo] target(s) in 0.02s Running target/debug/container sh started child with PID=453845 $ id uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup) $

WTF is that? On the one hand—success! We’re not root. On the other hand… nobody?

This happens because the shell is now running in a new user namespace, and that namespace needs its own UID and GID mappings to the host’s user and group IDs. Since we haven’t provided any mappings yet, the kernel falls back to a default mapping that maps everything to UID 65534: nobody.

To fix this, we need to explicitly define UID and GID mappings for the new user namespace. This is done by writing mapping information to:

/proc//uid_map
/proc//gid_map

Here, is the PID of the child process running in the new namespace. These special files can only be written once, and they expect data in the following format:

<inside_id> <outside_id> <length>\n

inside_id — UID/GID inside the namespace
outside_id — corresponding UID/GID in the parent namespace
length — number of consecutive IDs to map

For example, writing 0 1000 10 to uid_map maps:

0 → 1000
1 → 1001
…
9 → 1009

Before writing to gid_map, we also need to write "deny\n" to /proc/<PID>/setgroups. This disables setgroups() and prevents a security issue where a process could add itself to privileged groups before the mapping is fully set up.

There’s one more complication: the parent process must write these files after the child has been created, but before the child calls exec(). That means we need some form of synchronization between parent and child.

We’ll do this with a simple Unix pipe.

The idea is straightforward:

Create a pipe before calling clone().
The child blocks, waiting to read from the pipe.
The parent sets up UID/GID mappings.
The parent writes a byte to the pipe.
The child unblocks and continues execution.

I create the pipe before cloning the child process. The child inherits both file descriptors. The parent can immediately close its read end—we won’t need it.

I wasn’t sure how to cleanly transfer ownership of the pipe file descriptors into the clone() closure. nix::unistd::pipe() returns two OwnedFds, and the borrow checker complains if you try to move them into the child while still using them in the parent. I worked around this by converting them to raw file descriptors and reconstructing OwnedFds inside the child. I’m sure there’s a better way—this is where my lack of Rust expertise shows.

let (read_fd, write_fd) = pipe()?;

let child_read_fd = read_fd.as_raw_fd();
let child_write_fd = write_fd.as_raw_fd();

let child_pid = unsafe {
        clone( ... )
 }.context("Failed to clone process")?;

close(read_fd)?;

Next, the child code needs a few updates:

Recreate the read and write file descriptors
Close the write end of the pipe (the child won’t need it)
Block on reading from the pipe
Continue execution once the parent signals readiness

Box::new(move || {
    let read_fd = OwnedFd::from_raw_fd(child_read_fd);    // recreate file descriptors
    let write_fd = OwnedFd::from_raw_fd(child_write_fd);

    if let Err(e) = close(write_fd) {                     // close writing side of the pipe
        eprint!("failed to close pipe {}", e);
        return 1;
    }

    let mut buf = [0u8];
    if let Err(e) = read(read_fd, &mut buf) {            // read from pipe - blocks waiting on parent
        eprint!("failed to sync with parent {}", e);
        return 1;
    }

    // This runs in the child process with PID 1 in the new namespace
    if let Err(e) = child(command, args) {
        eprintln!("child process failed: {}", e);
        return 1;
    };
    return 0;
})

Now we need to create the UID and GID mappings. First, a helper function to write to /proc:

fn write_proc_file(child_pid: Pid, file_name: &str, data: &str) -> anyhow::Result<()> {
    let path = format!("/proc/{}/{}", child_pid, file_name);
    std::fs::write(&path, data).with_context(|| format!("failed to write to {}", path))?;
    Ok(())
}

Next, we map UID and GID 0 inside the namespace to the effective UID and GID of the user who started the container. Once that’s done, we write a byte to the pipe to let the child continue. The following code goes right after close(read_fd)?; in the parent:

let uid = unsafe { geteuid() };
let gid = unsafe { getegid() };

write_proc_file(child_pid, "uid_map", &format!("0 {} 1\n", uid))?;
write_proc_file(child_pid, "setgroups", "deny\n")?;
write_proc_file(child_pid, "gid_map", &format!("0 {} 1\n", gid))?;

write(&write_fd, b"1")?;
close(write_fd)?;

If we run the container again and execute id inside it, we now see this:

$ cargo run sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/container sh`
started child with PID=465060
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)

Important detail: the root user inside the namespace is not the same as the system root.

To prove this, let’s try to create a file somewhere that real root would normally be allowed to write to:

$ cargo run sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/container sh`
started child with PID=465060
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# touch /opt/new_file
touch: cannot touch '/opt/new_file': Permission denied

This is much closer to a real container. The child process is now running as root, but only within its own isolated namespace, and it does not have the same power as the “real” root on the host.

There’s a ton of additional detail about user namespaces in man 7 user_namespaces, and it’s well worth a read if you want to understand all the corner cases.

The source code for this post is on the Github

The one where the root is avoided

Other posts on these topics: