I've put a filesystem into your filesystem

 

Simple minimal root filesystems work well as a base for my humble container runtime, but it would be nice to have some flexibility while still keeping the root image intact. This is where overlayfs comes to the rescue. Using an Alpine minimal rootfs as the filesystem for a sandboxed application is convenient. However, as soon as that application starts writing or deleting files, it pollutes the underlying filesystem. Ideally, the rootfs should be read-only.

Another issue with a minimal filesystem is that there simply aren’t that many programs available to run inside the container. In practice, there’s only busybox and a collection of symlinks pointing to it. What if I want to run a Node or Python application? I can install Python (one of the nice things about Alpine is its package manager), but that again modifies the rootfs—and I want to keep it clean. I also don’t want to reinstall Python every time I start a container.

The solution is to use overlayfs.

Overlayfs is a union filesystem that merges (or overlays) several filesystems on top of one another. The lower layers are read-only, while any changes—writes, renames, or deletions—are reflected only in the upper layer. If you’ve ever built a Docker image, you’ve already seen overlayfs in action: every instruction in a Dockerfile creates a new layer, and the final container filesystem is a union of all those layers.

The code changes required to support this are fairly minimal. The main difference is that the fs directory in my container project now hosts not only the base rootfs, but all filesystem layers as well.

Filesystem layers inside fs directory:

To support all this, I added a couple of helper functions to handle repetitive filesystem operations:

fn recreate_dir<P: AsRef<Path>>(dir: P) -> anyhow::Result<()> {
    if dir.as_ref().exists() {
        std::fs::remove_dir_all(dir.as_ref())
            .with_context(|| format!("failed to remove {:?}", dir.as_ref()))?;
    }
    std::fs::create_dir_all(dir.as_ref())
        .with_context(|| format!("failed to create {:?}", dir.as_ref()))?;
    Ok(())
}

fn create_overlay_dirs(root: &str) -> anyhow::Result<(String, String, String, String)> {
    let lower_dirs = find_lower_layers(root)?;

    let upper_dir = format!("{}/upper", root);
    recreate_dir(&upper_dir)?;

    let lower = if lower_dirs.is_empty() {
        format!("{}/rootfs", root)
    } else {
        format!("{}/rootfs:{}", root, lower_dirs)
    };

    let workdir = Path::new(root).join("workdir");
    let rootfs = Path::new(root).join("mount");
    recreate_dir(&workdir)?;
    recreate_dir(&rootfs)?;

    Ok((
        lower,
        upper_dir,
        workdir.to_string_lossy().to_string(),
        rootfs.to_string_lossy().to_string(),
    ))
}

pub fn find_lower_layers(root: &str) -> anyhow::Result<String> {
    let mut names: Vec<String> = Vec::new();

    for entry in std::fs::read_dir(root).context("failed to read root directory")? {
        let entry = entry.context("failed to read directory entry")?;
        let file_type = entry.file_type().context("failed to get file type")?;
        if !file_type.is_dir() {
            continue;
        }
        let os_name = entry.file_name();
        if let Some(name) = os_name.to_str() {
            // match "layer" followed by exactly two digits
            let is_match = name.len() == 7
                && name.starts_with("layer")
                && name.chars().skip(5).take(2).all(|c| c.is_ascii_digit());
            if is_match {
                names.push(format!("{}/{}", root, name.to_string()));
            }
        }
    }

    names.sort();
    Ok(names.join(":"))
}

The recreate_dir function does exactly what its name suggests: if a directory exists, it removes it and recreates an empty one. I use it for setting up the workdir and upper directories required by overlayfs.

The find_lower_layers function scans the fs directory for layerXX folders, sorts them alphabetically, and joins them into a single string separated by :. This is the format expected by overlayfs when specifying multiple lower directories.

Finally, create_overlay_dirs ties everything together. It discovers existing layerXX directories, prepends rootfs to ensure the Alpine root filesystem is always the lowest layer, creates the workdir, upper, and mount point if needed, and returns all four paths.

With that in place, a small change to create_container_filesystem is enough:

fn create_container_filesystem(root: &str) -> anyhow::Result<()> {
    // change the root fs propagation to private
    mount(
        None::<&str>,
        "/",
        None::<&str>,
        MsFlags::MS_REC | MsFlags::MS_PRIVATE,
        None::<&str>,
    )
    .context("private propagation for /")?;

    let (lower, upper, workdir, rootdir) = create_overlay_dirs(root)?;

    let rootfs = Path::new(&rootdir);

    let mount_opts = format!("lowerdir={},upperdir={},workdir={}", lower, upper, workdir);

    mount(
        Some("overlay"),
        rootfs,
        Some("overlay"),
        MsFlags::empty(),
        Some(mount_opts.as_str()),
    )
    .context("mount overlayfs")?;

    let proc = rootfs.join("proc");
    mount(
        Some("proc"),
        &proc,
        Some("proc"),
        MsFlags::empty(),
        None::<&str>,
    )
    .context("mount /proc")?;

    // prepare for pivot_root
    let old_root = rootfs.join(".old_root");
    if old_root.exists() {
        remove_dir_all(&old_root).context("remove old_root")?;
    }
    create_dir_all(&old_root).context("create old_root")?;

    // pivot_root and unmount old_root
    pivot_root(rootfs, &old_root).context("pivot_root")?;
    chdir("/").context("chdir to /")?;
    umount2("/.old_root", MntFlags::MNT_DETACH).context("umount old_root")?;
    let _ = remove_dir("/.old_root");

    Ok(())
}

The code that previously bind-mounted the rootfs is now replaced with logic that sets up and mounts an overlay filesystem. The pivot-root logic remains unchanged—the only difference is that the directory we pivot into now contains an overlayfs mount.

Testing it out

Alpine rootfs doesn’t ship with a default resolv.conf, so I created a fs/layer01/etc/resolv.conf file containing:

nameserver 127.0.0.1

If I start a shell in the container and inspect /etc/resolv.conf, I should see that file:

$ cargo run /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/container /bin/sh`
started child with PID=211854
/ # cat /etc/resolv.conf
nameserver 127.0.0.1
/ #

Let’s go further and install Python inside the container, then turn that into a new filesystem layer.

First, I need a working DNS server, so I replace nameserver 127.0.0.1 with nameserver 1.1.1.1 in layer01/etc/resolv.conf.

Second, I need to tweak Alpine’s package manager configuration. There are some issues with TLS certificates at the moment; I’ll dig into that later, but for now switching the URLs in rootfs/etc/apk/repositories from https to http is good enough.

With that out of the way, I can install Python:

cargo run /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/container /bin/sh`
started child with PID=223246
/ # apk add python3
( 1/17) Installing libbz2 (1.0.8-r6)
( 2/17) Installing libexpat (2.7.3-r0)
( 3/17) Installing libffi (3.5.2-r0)
( 4/17) Installing gdbm (1.26-r0)
( 5/17) Installing xz-libs (5.8.1-r0)
( 6/17) Installing libgcc (15.2.0-r2)
( 7/17) Installing libstdc++ (15.2.0-r2)
( 8/17) Installing mpdecimal (4.0.1-r0)
( 9/17) Installing ncurses-terminfo-base (6.5_p20251123-r0)
(10/17) Installing libncursesw (6.5_p20251123-r0)
(11/17) Installing libpanelw (6.5_p20251123-r0)
(12/17) Installing readline (8.3.1-r0)
(13/17) Installing sqlite-libs (3.51.1-r0)
(14/17) Installing python3 (3.12.12-r0)
(15/17) Installing python3-pycache-pyc0 (3.12.12-r0)
(16/17) Installing pyc (3.12.12-r0)
(17/17) Installing python3-pyc (3.12.12-r0)
Executing busybox-1.37.0-r30.trigger
OK: 47.0 MiB in 33 packages
/ # 

Now I exit the shell, rename fs/upper to fs/layer02, and start the container again. All changes made during the Python installation lived in the overlay’s upper layer; by turning it into a lower layer, Python becomes part of the read-only filesystem:

$ cargo run /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/container /bin/sh`
started child with PID=238975
/ # python3
Python 3.12.12 (main, Oct 11 2025, 01:16:26) [GCC 15.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Mission accomplished: the original rootfs is unchanged, Python is installed, and it now lives in a read-only overlay layer.

I can even delete it during runtime and watch it “disappear”:

# which python3
/usr/bin/python3
/ # rm /usr/bin/python3
/ # python3
/bin/sh: python3: not found
/ #

After restarting the container, it’s back again:

$ cargo run /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.17s
     Running `target/debug/container /bin/sh`
started child with PID=242728
/ # python3
Python 3.12.12 (main, Oct 11 2025, 01:16:26) [GCC 15.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

When overlayfs is asked to delete a file from a lower layer, it doesn’t actually remove it. Instead, it creates a whiteout: a marker in the upper layer indicating that the file should be hidden. Once the upper layer is cleared on the next container start, the file becomes visible again.

At this point, the filesystem sandbox not only protects the host filesystem, but also makes it easy to inspect and persist filesystem changes produced by sandboxed programs.

The source code, as usual, is on the Github

Other posts on these topics: