The one with the final touches

 

I'll almost finish the container runtime today. These are things that are not strictly necessary, but they move my toy container runtime closer to the real one just a little bit. One more namespace to add. Make container a zombie killer. Drop capabilities for extra security.

NEWUTS

Let's start with the simplest one: UTS namespace. UTS stands for Unix Time Sharing System. Yes, I know. Despite its name that reminds us of Bell labs, it's just a namespace that isolates the hostname and domain name of the container.

Adding the UTS namespace to the container runtime is straightforward. Just add CLONE_NEWUTS to the clone flags. I'll also add new command-line argument to allow setting the hostname when starting the container. We need it because of "capabilities drop" that we'll do later.

The code is simple. Add new flag to the arguments in main.rs and pass it to the run_in_container function:

~ #[derive(Parser, Debug)]
~ #[command(author, version, about, long_about = None)]
~ struct Args {

+    /// Hostname for the container
+    #[arg(long)]
+    hostname: Option<String>,

~    /// CPU shares for the container, e.g. 0.5, 1, etc
~    #[arg(short, long)]
~    cpu: Option<String>,

    ...

-    if let Err(e) = run_in_container(&args.command, &args.args, &args.cpu, &args.mem) {
+    if let Err(e) = run_in_container(&args.command, &args.cpu, &args.mem, &args.args, &args.hostname) {
~       eprintln!("Error: {:#}", e);
~       return ExitCode::FAILURE;
~    }

Small refactoring in container.rs to handle the new argument. I'll move all the parameters passed to the child into the new struct:

struct ContainerConfig {
    is_parent_root: bool,
    network_cidr: Ipv4Cidr,
    hostname: Option<String>,
}

The child() function will now take a ContainerConfig struct as an argument.

-fn child(command: &str, args: &[String], netw: &Ipv4Cidr, is_parent_root: bool) -> anyhow::Result<()> {
+fn child(command: &str, args: &[String], config: &ContainerConfig) -> anyhow::Result<()> {
    ...

-    net::bring_up_container_net(&netw, is_parent_root)?;
    net::bring_up_container_net(&config.network_cidr, config.is_parent_root)?;

And the final touch: set the hostname of the container. We'll add the code to do it right after we set up container network.

    if let Some(hostname) = &config.hostname {
        sethostname(hostname.as_str())?;
    }

Let's check if it works. First let's check the hostname of the host machine and run the container without setting a new hostname:

$ hostname
fedora
$ cargo run /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/container /bin/sh`
started child with PID=43821
/ # hostname
fedora
/ #

Now let's try to set a new hostname for the container with the command line argument and then change it inside the container.

$ cargo run -- --hostname=inside /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/container --hostname=inside /bin/sh`
started child with PID=43879
/ # hostname
inside
/ # hostname outside
/ # hostname
outside
/ #

If we run hostname on the host machine, it will still be fedora. The hostname is isolated from the host machine.

Being the init

So far, after we setup filesystem and network (if needed) we performed execve and replaced our child process with the command passed via command-line argument. This command replace our process and ran with PID 1. PID 1 has special meaning in Linux. It's the init process and it is responsible for killing zombies.

zombie killer license

When a process terminates, Linux kernel still keeps an entry in the process table for it, waiting for its parent to read its exit code. If parent has died before the child, then the child becomes a zombie. Parentless processes become orphans and get adopted by the init process.

Init process is responsible for calling wait() on all of its children, thus reaping them and freeing up resources. Init process is also responsible for forwarding signals to its children.

To do all that, we need to change the way the command is executed. Instead of directly using execve and replacing our child process with the command process, we can fork our process and then call execve on the child (or grand-child if we really keeping track) process, while our main container process (which is a child of original container process) will keep running with PID 1 and will be the init process.

Replacing execve with the following snipped will fork our process and replace the child process with the command process, while the parent will run run_as_init function with the child PID as and argument.

    match unsafe { fork() }.context("failed to fork")? {
        ForkResult::Child => {
            // execve replaces the current process, so this only returns on error
            execve(&cmd_cstring, &c_args, &c_env).context("failed to execute command")?;

            // This line is never reached if execve succeeds
            unreachable!()
        }
        ForkResult::Parent { child } => {
            run_as_init(child)?;
        }
    }

This is confusing, with so many "parents" and "children", so let me reiterate:

Here is the code of run_as_init function:


fn run_as_init(child: Pid) -> anyhow::Result<()> {
    let _ = nix::sys::prctl::set_child_subreaper(true);
    let mut signal_mask = SigSet::empty();
    signal_mask.add(Signal::SIGTERM);
    signal_mask.add(Signal::SIGINT);
    signal_mask.add(Signal::SIGQUIT);
    signal_mask.add(Signal::SIGCHLD);
    signal_mask.thread_block().context("signal therad_block")?;

    let signal_fd = SignalFd::new(&signal_mask)?;

    loop {
        let signal_info = signal_fd.read_signal()?.unwrap();
        let signal = Signal::try_from(signal_info.ssi_signo as i32).unwrap();
        match signal {
            Signal::SIGCHLD => {
                reap_zombies(child);
            }
            _ => {
                kill(child, signal)?;
            }
        }
    }
}

This code first blocks asynchronous delivery of the signals to this thread, instead we will be receiving them through SignalFd. Then we have infinite loop that waits for signals and handles them. If the child dies, we will receive a SIGCHLD signal and we will call reap_zombies function to clean up the child process. All other signals will be forwarded to the child process.

Final piece is the reap_zombies function. It will check for the child process to exit or be signaled and then call std::process::exit with the appropriate status. Any other results will be ignored.

fn reap_zombies(child: Pid) {
    loop {
        match waitpid(Pid::from_raw(-1), Some(WaitPidFlag::WNOHANG)) {
            Ok(WaitStatus::Exited(pid, status)) if pid == child => {
                println!("child exited with status {}", status);
                std::process::exit(status);
            }
            Ok(WaitStatus::Signaled(pid, sig, _)) if pid == child => {
                println!("child received signal {}", sig);
                std::process::exit(128 + sig as i32);
            }
            Ok(_) => break,
            Err(nix::errno::Errno::ECHILD) => break,
            Err(err) => {
                eprintln!("waitpid error: {}", err);
                std::process::exit(1);
            }
        }
    }
}

If we run the container and use ps -ef to see the processes, we will see that the container process with PID 1 and the child shell process with a different PID.

Drop capabilities

Guess what? There is a man page for this! man 7 capabilities

Our container is running the process with the root privileges. Yes, that is root inside the container, but we might want to limit what it can do anyway. We can drop capabilities to limit the privileges of the process.

When we start the container with the CLONE_NEWUSER flag, the new process inside the user namespace gets all the capabilities. We can limit what our child can do by limiting these capabilities. We can do it right before running the command.

Let's add crate 'caps' to our Cargo.toml file:

/ $ cargo add caps

And then create a function to drop capabilities.


fn drop_caps() -> anyhow::Result<()> {
    let mut caps_drop = caps::all();
    caps_drop.remove(&Capability::CAP_CHOWN);

    for cap in caps_drop {
        caps::drop(None, CapSet::Bounding, cap)
            .context(format!("failed to drop bounding capability {}", cap))?;
    }

    nix::sys::prctl::set_no_new_privs()?;
    Ok(())
}

We'll keep CAP_CHOWN just in case we need to change the ownership of files inside the container.

Each thread has several capabilities set assigned to it. We are interested in bounding set. Bounding set limits what capabilities can be gained by execve. Because we will start the command using execve, it will gain only the capabilites we have in the bounding set. We can keep all the capabilities for ourselves, but deny them to the command we will run.

Let's add command-line argument for dropping capabilities, just in case we need to run a command with full capabilities. I'll follow the same steps as with hostname argument, by adding --drop-caps argument:

/// Drop all the capabilities for the command
#[arg(long)]
drop_caps: bool,

I'll pass this argument through run_in_container function and into ContainerConfig struct:

struct ContainerConfig {
    is_parent_root: bool,
    network_cidr: Ipv4Cidr,
    hostname: Option<String>,
    drop_caps: bool,
}

And right before we fork for the command, we'll check if drop_caps is true. If it is, we'll drop all the capabilities.

if config.drop_caps {
    drop_caps()?;
}

Now let's test it. I'll run the container with --drop-caps argument and check the capabilities of the shell. cat /proc/$$/status returns status for the current PID and | grep Cap will filter out everything except capabilities.

As we can see CapBnd, CapEff and CapPrm (bounding, effective and permitted) capabilities sets are all set to 1. This means that the command only has CAP_CHOWN capability. Let's see if our child process, which is now the "init" still has its capabilities. cat /proc/1/status | grep Cap shows us full set of effective and permitted capabilities (000001ffffffffff)

/ $ $ cargo run -- --drop-caps /bin/sh
   Compiling container v0.1.0 (/home/raven/projects/container_blog)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.39s
     Running `target/debug/container --drop-caps /bin/sh`
started child with PID=49682
/ # cat /proc/$$/status | grep Cap
CapInh:	0000000000000000
CapPrm:	0000000000000001
CapEff:	0000000000000001
CapBnd:	0000000000000001
CapAmb:	0000000000000000
/ # cat /proc/1/status | grep Cap
CapInh:	0000000000000000
CapPrm:	000001ffffffffff
CapEff:	000001ffffffffff
CapBnd:	0000000000000001
CapAmb:	0000000000000000
/ #

This is why I added a --hostname flag. After we dropped CAP_SYS_ADMIN, we can't change the hostname with the hostname program from command line.

The code for this is in the GitHub, as per usual.



This wraps up the series on writing container runtime. I learned a lot and I also learned that I have no freaking idea about so many more things.

I've tried many other things which I didn't cover here. My original idea was to write a "sandbox" that will use overlay filesystem to give the sandboxed app the read-only view if the host filesystem. Boy, was I naive. A dozen of mount binds later I kinda sorta achieved my goal, but it was a mess.

I also wanted to have the sandbox to have an internet access in the rootless container, but I decided not to go there yet, it's a whole new can of worms I am not ready to deal with.

Instead I decided to scale down my ambitions and start writing a blog. I agree, not my brightest hour, but here we are.

Anyway, I'm done with the blog about container runtime, but I'm probably not done with the container runtime itself, so maybe there's more to come.

If you somehow found this pages and they were useful to you, you're welcome.

Other posts on these topics: