The one with the mounting

It’s time to get back to the problem I ran into earlier: the ps command inside the container was showing host processes. That clearly means something was missing. Time to employ the next kind of namespace — the mount namespace.

The mount namespace is actually the first namespace ever added to the Linux kernel, way back in 2002. You can tell from the clone flag name: CLONE_NEWNS. The NS stands for “namespace”. This predates the point where namespaces became a well-defined, coherent concept, but the name stuck.

As usual, man 7 mount_namespaces contains a lot of information on the topic.

My initial plan was to create an overlay filesystem for the container, using the host filesystem as the lower layer. If you’re not familiar with overlayfs, it’s what all the grown-up containers use these days. It allows you to stack multiple directories as layers: for example, a base root filesystem at the bottom, then /home/$USER on top of that, then /opt with additional tools, and so on.

Directories in the lower layers are immutable. All changes go into the upper layer. overlayfs also requires a work directory and, of course, a mount point. The big advantage is that you can have a base root filesystem that is never modified, plus optional layers for things like a Python runtime or Node.js. Any changes made by the containerized process won’t affect those lower layers, providing an effective sandbox. See the kernel documentation for more details.

As it turned out, giving a container access to the host filesystem via overlayfs is quite tricky. It’s possible, but not in a straightforward way. You need to create the overlay filesystem outside of the host root filesystem and then bind-mount selected host directories into it. I did manage to make it work, but it was far too complicated for this blog post.

So, Plan B.

Alpine linux provides a minimal root filesystem — all the directories and files you’d expect from Alpine, packaged as a .tgz file. I downloaded the x86_64 version and unpacked it into fs/rootfs in the project root. This will serve as the container’s root filesystem.

I've added a new function to setup container filesystem.

fn create_container_filesystem(root: &str) -> anyhow::Result<()> {
    // change the root fs propagation to private
    mount(
        None::<&str>,
        "/",
        None::<&str>,
        MsFlags::MS_REC | MsFlags::MS_PRIVATE,
        None::<&str>,
    )
    .context("private propagation for /")?;

    let rootfs = Path::new(root).join("rootfs");

    mount(
        Some(&rootfs),
        &rootfs,
        None::<&str>,
        MsFlags::MS_BIND | MsFlags::MS_REC,
        None::<&str>,
    )
    .context("bind mount rootfs")?;

    let proc = rootfs.join("proc");
    mount(
        Some("proc"),
        &proc,
        Some("proc"),
        MsFlags::empty(),
        None::<&str>,
    )
    .context("mount /proc")?;

    // prepare for pivot_root
    let old_root = rootfs.join(".old_root");
    if old_root.exists() {
        remove_dir_all(&old_root).context("remove old_root")?;
    }
    create_dir_all(&old_root).context("create old_root")?;

    // pivot_root and unmount old_root
    pivot_root(&rootfs, &old_root).context("pivot_root")?;
    chdir("/").context("chdir to /")?;

    // cleanup old root
    umount2("/.old_root", MntFlags::MNT_DETACH).context("umount old_root")?;
    let _ = remove_dir("/.old_root");

    Ok(())
}

Firstly, the function changes the propagation of the root filesystem to "private" inside my brand new, shiny mount namespace. This ensures that mount events don’t leak in or out of the namespace. See the aforementioned mount_namespaces man page for details on mount propagation.

Next, I bind-mount the directory containing the Alpine filesystem onto itself. The pivot_root syscall requires the new root to be a mount point, and a bind mount is the simplest way to achieve that. While we’re at it, I also mount "procfs" at rootfs/proc, which will become /proc inside the container.

After that setup, we prepare for switching the root filesystem. We create (or clean up) an .old_root directory inside rootfs, which is another requirement for pivot_root. The pivot_root syscall replaces the current root filesystem with the new one and moves the old root into the .old_root directory.

Once pivot_root completes, the .old_root directory can be unmounted and deleted, fully isolating the container from the host filesystem.

Now, if we run cargo run bash, we’re greeted with this:

$ cargo run bash
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/container bash`
started child with PID=160535
child process failed: failed to execute command: ENOENT: No such file or directory

Before introducing the new mount namespace, we were running programs from the host filesystem. Now we only have access to what’s available in Alpine’s rootfs — and bash isn’t there.

Let’s switch to sh instead:

$ cargo run sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/container sh`
started child with PID=162055
/ #

And finally, we can try the ps command, which a couple of weeks ago happily showed us every process on the host:

/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    2 root      0:00 ps

Now that looks like a real container!

Notes on Ubuntu

I’m writing and testing this both on my laptop (Fedora) and my desktop (Ubuntu). Fedora is my daily driver. After successfully implementing and testing mount namespaces on the laptop, I switched to the desktop to edit this blog post. Imagine my surprise when the program — which had worked perfectly before — started failing with "EACCES: Permission denied" errors.

After some investigation, I discovered that Ubuntu uses a stricter AppArmor configuration than Fedora, particularly when it comes to unprivileged user namespaces. See this Ubuntu 23.10 blog entry for details.

To temporarily disable AppArmor restrictions on unprivileged user namespaces, you can run:

sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0

To permanently lift restrictions for this program only, you can add an AppArmor profile. Create an empty file in /etc/apparmor.d and add the following content. Make sure to replace the full path to the executable:

abi <abi/4.0>,

include <tunables/global>

${PATH_TO_CONTAINER_BINARY}/container flags=(default_allow) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/container>
}

I wanted to move on to the next namespace on my list — the network namespace — but I don’t think I’ve fully unlocked the potential of mount namespaces yet. Next time, I’ll switch to using overlayfs on top of the Alpine mini root filesystem, which will provide a much richer set of programs that can be executed inside the container.

The source code, as usual, is on the Github

The one with the mounting

Notes on Ubuntu

Other posts on these topics: