The one where the Rust project is created

 

Time to finally start writing some code. I’ll begin this post by creating a project and end it with a program running inside a new PID namespace.

I’m not going to waste time explaining obvious things. If you’re reading this and feel like you need help creating a Rust project, then this topic is probably not for you.

For this project I’m using the following dependencies:

[dependencies]
clap = { version = "4.5", features = ["derive"] }
anyhow = "1.0"
nix = { version = "0.30", features = ["sched", "process", "hostname", "mount", "fs"] }
libc = "0.2"

We’ll need more later, but this set will cover most of what we need for now.

In src/main.rs, let’s define a struct to parse command-line arguments into. We don’t strictly need it, but since I’ve already added clap to the dependency list, I might as well use it.

While we’re at it, we can also add the main function to parse arguments and call the run_in_container function, which we’ll define in a minute.

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// Command to execute in the container
    #[arg(required = true)]
    command: String,

    /// Arguments for the command
    #[arg(trailing_var_arg = true, allow_hyphen_values = true)]
    args: Vec<String>,
}


fn main() -> ExitCode {
    let args = Args::parse();

    if let Err(e) = run_in_container(&args.command, &args.args) {
        eprintln!("Error: {:#}", e);
        return ExitCode::FAILURE;
    }

    ExitCode::SUCCESS
}

I won’t go into details about how to write Rust programs. These notes aren’t meant to be a tutorial; they’re mostly for me, to remember what I’ve learned.

To start a new process, I’ll use clone() and execvp(). Since I want the child process to run in a new PID namespace, I need to pass CLONE_NEWPID in the clone() flags. The code looks like this:

// clone flags
let clone_flags = CloneFlags::CLONE_NEWPID;
// allocate stack for the child process
let mut stack = vec![0u8; STACK_SIZE];

let child_pid = unsafe {
    clone(
        Box::new(move || {
            if let Err(e) = child(command, args) {
                eprintln!("child process failed: {}", e);
                return 1;
            };
            return 0;
        }),
        &mut stack,
        clone_flags,
        Some(Signal::SIGCHLD as i32),
    )
}
.context("Failed to clone process")?;
    
println!("started child with PID={}", child_pid);
let _ = wait_for_child(child_pid);

Ok(())

wait_for_child is a helper function that waits for the child process to terminate:

fn wait_for_child(pid: Pid) -> anyhow::Result<i32> {
    use nix::sys::wait::{WaitStatus, waitpid};

    let result = match waitpid(pid, None).context("Failed to wait for child process")? {
        WaitStatus::Exited(_, code) => Ok(code),
        WaitStatus::Signaled(_, signal, _) => Ok(128 + signal as i32),
        _ => Ok(1),
    };

    result
}

The closure passed as the first argument to clone() is executed in the child process. All it does is call the child() function with the command name and arguments, plus some basic error handling.

The child() function itself is straightforward: convert the command and arguments into CStrings and call execvp:

fn child(command: &str, args: &[String]) -> anyhow::Result<()> {
    use nix::unistd::execvp;
    use std::ffi::CString;

    // Convert command to CString
    let cmd_cstring = CString::new(command).context("failed to convert command to CString")?;

    // Convert arguments to CStrings
    // The first argument should be the program name itself
    let mut c_args: Vec<CString> = Vec::new();
    c_args.push(cmd_cstring.clone());

    for arg in args {
        c_args.push(CString::new(arg.as_str()).context("failed to convert argument to CString")?);
    }

    // execvp replaces the current process, so this only returns on error
    execvp(&cmd_cstring, &c_args).context("failed to execute command")?;

    // This line is never reached if execvp succeeds
    unreachable!()
}

When I try to run this, I immediately get an error:

$ cargo run echo 'test'
   Compiling container v0.1.0 (/home/raven/projects/container)
   Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.86s
   Running `target/debug/container sh echo test`
Error: Failed to clone process: EPERM: Operation not permitted

Creating a new namespace requires the CAP_SYS_ADMIN capability, so the program needs to be run with sudo:

sudo target/debug/container echo 'test'
started child with PID=95829
test

Let’s try again, this time launching a shell and see what the PID of the shell process:

sudo target/debug/container sh
started child with PID=100031
# echo $$
1

Clearly, the new shell is running in a new PID namespace—it believes its PID is 1. Success!

However, the container is far from being isolated. If we run ps inside the new shell, we can see all of the host’s processes:

# ps
    PID TTY          TIME CMD
      1 ?        00:00:06 systemd
      2 ?        00:00:00 kthreadd
      3 ?        00:00:00 pool_workqueue_release
      4 ?        00:00:00 kworker/R-rcu_gp
    ...
  93850 ?        00:00:00 cupsd
  95647 ?        00:00:00 kworker/10:1-events
  95790 ?        00:00:00 kworker/3:1-mm_percpu_wq
  98490 ?        00:00:00 kworker/8:1-cgroup_free

This happens because ps reads process information from the /proc filesystem. Since the child shell still has access to the host’s /proc, it can see every process on the system.

Things get even worse. If I run id inside the child shell, I’ll see that I’m root. That’s expected, since I ran my “container runtime” with sudo. Unfortunately, this means the container is not only failing to isolate the child process from the host—it’s also giving it root privileges on the host system. That’s pretty much the opposite of what I’m trying to achieve.

Luckily, there is a way to fix this. The first step is to create a new user namespace for the container, so that the containerized application can run as root inside the container without being root on the host. This will also help with isolating /proc, although that part requires a bit more work.

That’s a topic for next time.

The source code is on Github

Other posts on these topics: