The one with network shenanigans

Just a bit of work is left for my "container runtime" to feel (and work!) almost like a real container runtime. Today I’ll tackle one of the bigger pieces: setting up networking between the container and the host.

As usual, there’s a lot of information buried in man pages. It’s all there, however deciphering it can take some time. There's no need to read everything up front — the links are here for reference when needed.

As you might expect, Linux provides a namespace for isolating a process’s network view. The CLONE_NEWNET flag is responsible for creating a new network namespace during clone().

Adding another clone flag is trivial:

let clone_flags =
    CloneFlags::CLONE_NEWPID
    | CloneFlags::CLONE_NEWUSER
    | CloneFlags::CLONE_NEWNS
    | CloneFlags::CLONE_NEWNET;

If we start a shell inside the container and run ip, we can see that it’s running inside a brand‑new, mostly empty network namespace:

$ cargo run /bin/sh
/ # ip addr list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

There’s only a loopback interface, and even that one is down. This is exactly what we want from an isolation perspective — the container can’t see the host network at all.

Unfortunately, it’s also pretty useless.

In most cases, processes running in a container need to talk to the outside world, so we need to poke a hole in this isolated network namespace.

Poking a hole

The plan is to create a veth pair — two virtual Ethernet devices that behave as if a cable connects them. Anything sent to one end appears on the other.

One end will live inside the container’s network namespace.
The other end will remain in the host namespace and be attached to a bridge.

Following the long‑standing tradition of unreadable interface names, let's select names for these interfaces:

Container side: veth0c0
Host side: veth0h0
Bridge: br0

The container interface will get an IP address. The host interface will be attached to the bridge device and the bridge will get another IP address.

I looked into two ways of configuring networking from Rust:

Executing ip link ... commands
Using the rtnetlink crate

rtnetlink requires an async runtime such as tokio. I haven't investigated it thoroughlye might be issues with the runtime after cloning. LLM suggests creating new runtime in the cloned process, but I am not sure if I trust it.

I experimented with creating a runtime only when needed and it worked, but this post is about networking, not async runtimes — something I’m not confident enough with to explain properly. I also couldn’t find a clean way to attach an interface to a bridge using rtnetlink, so I ended up needing ip anyway.

Given all that, I’ll use the ip command directly.

Privileges

Creating bridges and veth pairs requires CAP_SYS_ADMIN capability, which effectively means running as root. This goes against my original goal of keeping the runtime rootless, but for this toy runtime there’s not much I can do. Real rootless container runtimes solve this with more advanced techniques, but that’s far beyond the scope of this blog.

I've decided on a compromise. If the runtime runs as root, I will set up full networking, otherwise I'll create a network namespace with only the loopback interface.

Tying it all together

Setting up networking is more complex than anything we’ve done so far. Here’s the full sequence:

Create a bridge and veth pair (host)
Assign an IP address to the bridge (host)
Attach the host veth to the bridge (host)
Start the child process
Move the container veth into the child’s network namespace (host)
Notify the child it can continue (host)
Bring up lo and veth0c0 inside the container (container)
Assign an IP address to veth interface inside the container
Configure routing inside the container (container)
(Optional) Configure NAT and forwarding (host)
When the container exits:
- Remove NAT and forwarding rules
- Delete the bridge

Real container runtimes typically create the bridge once and reuse it. For simplicity, I create everything on startup and destroy on shutdown. If the runtime crashes, cleanup won’t happen — that’s a sacrifice I am willing to make.

alt text

Refactoring

Before touching networking, there's a need for a small cleanup.

Filesystem Code

All filesystem‑related functions were moved from container into a new fs module.

Switching from `execvp` to `execve`

execvp searches $PATH for the executable and that's convenient, but the environment is inherited from the parent and that $PATH has nothing to do with the Alpine rootfs inside the container.

To fix this, I switched to execve and explicitly set PATH:


-   execvp(&cmd_cstring, &c_args).context("failed to execute command")?;

+   let mut c_env: Vec<CString> = Vec::new();
+   for (key, value) in std::env::vars() {
+       let updated_value = if key == "PATH" {
+           // overwrite the PATH env variable to match alpine rootfs
+           String::from("/bin:/sbin:/usr/bin:/usr/sbin")
+       } else {
+           value
+       };
+       let pair = format!("{}={}", key, updated_value);
+       c_env.push(CString::new(pair).context("failed to convert env var to CString")?);
+   }
+   // execve replaces the current process, so this only returns on error
+   execve(&cmd_cstring, &c_args, &c_env).context("failed to execute command")?;

From now on I will need to pass an absolute path for the executable I want to run in the container.

Overlay Directories

Overlay filesystem directories are now created by the parent process. This avoids permission issues when running the container as root. The downside is that switching back to rootless execution requires cleaning those directories manually.

I split create_overlay_dirs into two functions - get_overlay_dirs that will be used in create_container_filesystem and create_overlay_dirs that will be called by the parent process before syncing with the child.

-fn create_overlay_dirs(root: &str) -> anyhow::Result<(String, String, String, String)> {
+fn get_overlay_dirs(root: &str) -> anyhow::Result<(String, String, String, String)> {

    let lower_dirs = find_lower_layers(root)?;
    let upper_dir = format!("{}/upper", root);

-    recreate_dir(&upper_dir)?;

    let lower = if lower_dirs.is_empty() {
        format!("{}/rootfs", root)
    } else {
        format!("{}/rootfs:{}", root, lower_dirs)
    };

    let workdir = Path::new(root).join("workdir");
    let rootfs = Path::new(root).join("mount");

-    recreate_dir(&workdir)?;
-    recreate_dir(&rootfs)?;

    Ok((
        lower,
        upper_dir,
        workdir.to_string_lossy().to_string(),
        rootfs.to_string_lossy().to_string(),
    ))
}

+pub(crate) fn create_overlay_dirs(root: &str) -> anyhow::Result<()> {
+    let upper_dir = format!("{}/upper", root);
+    recreate_dir(&upper_dir)?;
+
+    let workdir = Path::new(root).join("workdir");
+    let rootfs = Path::new(root).join("mount");
+    recreate_dir(&workdir)?;
+    recreate_dir(&rootfs)?;
+
+    Ok(())
+}

Host process will create the directories right after writing uid/gid mapping.

    write_proc_file(child_pid, "uid_map", &format!("0 {} 1\n", uid))?;
    write_proc_file(child_pid, "setgroups", "deny\n")?;
    write_proc_file(child_pid, "gid_map", &format!("0 {} 1\n", gid))?;

+    fs::create_overlay_dirs("fs")?;

This helps to run the container with root permissions, but also means that overlayfs directories are now owned by host root user, so if one runs container rootless after that, it will fail to recreate these directories - they will need to be deleted with sudo.

The code to set up the network

First, let's set some boundaries. I am going to hardcode the names of the interfaces and IP addresses - that's obviously not going to fly in real container runtime, but that's enough for my little toy. My goal is so that the container can reach the host machine via network. Accessing internet from inside the container is not my goal. Getting internet access is not that hard and I will show the commands to do this, but I won't include these commands in my network setup. And the last one - I am only going to worry about IPv4.

With all that out of the way, it's time to configure network. I'll create a new net.rs module for this. I will also need cidr crate for operations on network address ranges.

Let's add some constants and utility function to the net.rs module. These will be the network interface names and a function to run ip command to create interfaces, assign IP add

const BRIDGE_NAME: &str = "br0";
const VETH_HOST: &str = "veth0h0";
const VETH_CONTAINER: &str = "veth0c0";

/// executes ip command with arguments
fn ip(args: &[&str]) -> anyhow::Result<()> {
    Command::new("/sbin/ip")
        .args(args)
        .status()
        .context(format!("Failed to execute ip {:?}", args))?;
    Ok(())
}

Next we need functions to create veth pair and a bridge interface. Virtual Ethernet devices will serve as an Ethernet cable into our new network namespace. Bridge device will forward packets between the networks.

Bridge will also need an IP address.

/// creates a bridge with the given IP address and brings the interface up
fn create_bridge(ipaddr: &Ipv4Addr) -> anyhow::Result<()> {
    ip(&["link", "add", "name", BRIDGE_NAME, "type", "bridge"]).context("creating bridge")?;
    ip(&[
        "addr",
        "add",
        format!("{}/24", ipaddr).as_str(),
        "dev",
        BRIDGE_NAME,
    ])
    .context("adding IP address to bridge")?;
    ip(&["link", "set", "dev", BRIDGE_NAME, "up"]).context("bringing up bridge")?;
    Ok(())
}


/// creates a veth pair with the given container IP address and brings the interface up
fn create_veth_pair() -> anyhow::Result<()> {
    // create veth pair
    ip(&[
        "link",
        "add",
        "name",
        VETH_HOST,
        "type",
        "veth",
        "peer",
        "name",
        VETH_CONTAINER,
    ])
    .context("creating veth pair")?;

    // bring host side up
    ip(&["link", "set", "dev", VETH_HOST, "up"]).context("bringing up host side")?;

    // attach host veth side to the bridge interface
    ip(&["link", "set", "dev", VETH_HOST, "master", BRIDGE_NAME])
        .context("attaching host side to the bridge")?;

    Ok(())
}

I will allocate a whole /24 CIDR to the container network, and I will use the first address in the range for the bridge (host IP) and the second one for the container (container IP). To calculate these addresses from CIDR block I will need a new function

fn ips_from_cidr(netw: &Ipv4Cidr) -> anyhow::Result<(Ipv4Addr, Ipv4Addr)> {
    let mut cidr_iter = netw.iter();
    let host_ip = cidr_iter
        .nth(1)
        .context("get host address from cidr")?
        .address();
    let container_ip = cidr_iter
        .next()
        .context("get container address from cidr")?
        .address();
    Ok((host_ip, container_ip))
}

We need to create veth pair and bridge devices in the parent process. Let's add a function to do that:


/// setup the network on the host side:
/// - create bridge and assign first address in the CIDR to the bridge interface
/// - attach host veth side to the bridge interface
/// - assign IP address to container veth side
/// - move container veth side into container namespace
pub(crate) fn setup_network_host(netw: &Ipv4Cidr) -> anyhow::Result<()> {
    let (host_ip, _) = ips_from_cidr(netw)?;

    create_bridge(&host_ip)?;
    create_veth_pair()?;

    Ok(())
}

We don't need container IP at this point. Container side of the veth pair will need to be assigned its IP address inside the container. Assigning IP in the parent namespace and then moving the device into container namespace removes the assigned IP from the device. I think that's understandable, IP addresses shouldn't be transferrable between different namespaces.

Now to the container side of things. We'll need a function to move network device into the container namespace. This can be done using child's PID. We will also need to assign IP address to veth device in container namespace and bring it up.

pub(crate) fn move_into_container(child_pid: Pid) -> anyhow::Result<()> {
    // move child side to child namespace
    let pid_s: String = child_pid.to_string();
    ip(&[
        "link",
        "set",
        "dev",
        VETH_CONTAINER,
        "netns",
        pid_s.as_str(),
    ])
    .context("moving veth0c0 to child namespace")?;

    Ok(())
}


/// bring up the network on the container side:
/// - bring up the container veth side, if the `veth` parameter is true
/// - bring up the loopback interface
pub(crate) fn bring_up_container_net(netw: &Ipv4Cidr, veth_up: bool) -> anyhow::Result<()> {
    let (host_ip, container_ip) = ips_from_cidr(netw)?;

    if veth_up {
        // assign IP address to container veth side
        ip(&[
            "addr",
            "add",
            format!("{}/24", container_ip).as_str(),
            "dev",
            VETH_CONTAINER,
        ])
        .context("adding IP address to container veth")?;

        // bring container side up
        ip(&["link", "set", "dev", VETH_CONTAINER, "up"])
            .context("bringing up container veth side")?;

        // configure default gateway
        ip(&[
            "route",
            "add",
            "default",
            "via",
            host_ip.to_string().as_str(),
            "dev",
            VETH_CONTAINER,
        ])
        .context("configure default route")?;
    }
    ip(&["link", "set", "dev", "lo", "up"]).context("bringing up lo in container")?;

    Ok(())
}

One note about is_root: bool parameter - as I said before, to create new network devices the process must have CAP_SYS_ADMIN capability. In the case I don't want to run container with sudo, I need to skip all operations with br0 and veth devices. In that case, the only thing can be done inside the container is to start loopback interface. is_root parameter controls whether we can perform operations with veth0c0 device.

One final touch - the parent needs to clean up after the child is terminated. veth pair will be removed automatically, as the kernel removes network namespace. We only need to delete the bridge.

pub(crate) fn cleanup_network() -> anyhow::Result<()> {
    ip(&["link", "delete", BRIDGE_NAME]).context("removing bridge device")?;

    Ok(())
}

Host side

Now we have all the building blocks to add network configuration. In the container.rs in the run_in_container function we create our CIDR for the container network and update clone flags:

-    let clone_flags =
-        CloneFlags::CLONE_NEWPID | CloneFlags::CLONE_NEWUSER | CloneFlags::CLONE_NEWNS;

+    let clone_flags = CloneFlags::CLONE_NEWPID
+        | CloneFlags::CLONE_NEWUSER
+        | CloneFlags::CLONE_NEWNS
+        | CloneFlags::CLONE_NEWNET;

+    let container_net_cidr =
+        Ipv4Cidr::new(Ipv4Addr::new(192, 168, 200, 0), 24).context("invalid CIDR")?;

Closure in the clone() call is modified to pass parameters down into child() function


+                let is_parent_root = uid == 0;

                 // This runs in the child process with PID 1 in the new namespace
-                if let Err(e) = child(command, args) {
+                if let Err(e) = child(command, args, &container_net_cidr, is_parent_root) {

And we need to add network setup and cleanup in the run_in_container function as well. We only do anything related to the network, if we're running as root.

    write_proc_file(child_pid, "uid_map", &format!("0 {} 1\n", uid))?;
    write_proc_file(child_pid, "setgroups", "deny\n")?;
    write_proc_file(child_pid, "gid_map", &format!("0 {} 1\n", gid))?;

    fs::create_overlay_dirs("fs")?;

+    if uid == 0 {
+        net::setup_network_host(&container_net_cidr)?;
+        net::move_into_container(child_pid)?;
+    }

    write(&write_fd, b"1")?;
    close(write_fd)?;

    println!("started child with PID={}", child_pid);
    let _ = wait_for_child(child_pid);

+   if uid == 0 {
+       net::cleanup_network()?;
+   }

    Ok(())

Container side

The last thing we need to change is our child() function. It will accept the CIDR and the flag, indicating whether the parent was executed by root user. We will also add a call to the bring_up_container_net from the net.rs module, making sure that we call it AFTER the file system was setup. Setting up network relies on ip command being available and, mounting Alpine rootfs for the container achieves that.

-fn child(command: &str, args: &[String]) -> anyhow::Result<()> {

+ fn child(
+    command: &str,
+    args: &[String],
+    netw: &Ipv4Cidr,
+    is_parent_root: bool,
+ ) -> anyhow::Result<()> {

     fs::create_container_filesystem("fs")?;

+    net::bring_up_container_net(netw, is_parent_root)?;

Running

That's it. Let's run the container now and check the network

$ cargo build && sudo target/debug/container /bin/sh
   Compiling container v0.1.0 (/home/raven/projects/container_blog)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.38s
started child with PID=11807
/ # ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
7: veth0c0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1e:c8:6c:43:ce:f1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.200.2/24 scope global veth0c0
       valid_lft forever preferred_lft forever
    inet6 fe80::1cc8:6cff:fe43:cef1/64 scope link
       valid_lft forever preferred_lft forever
/ #

Yay! We have our veth0c0 interface with the 192.168.200.2 address assigned. Let's check if we can see the host

/ # ping 192.168.200.1
PING 192.168.200.1 (192.168.200.1): 56 data bytes
64 bytes from 192.168.200.1: seq=0 ttl=64 time=0.122 ms
64 bytes from 192.168.200.1: seq=1 ttl=64 time=0.102 ms
64 bytes from 192.168.200.1: seq=2 ttl=64 time=0.114 ms
^C
--- 192.168.200.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.102/0.112/0.122 ms

And let's check if host can see the container - we need to run this in normal terminal, not in the container shell:

$ ping 192.168.200.2
PING 192.168.200.2 (192.168.200.2) 56(84) bytes of data.
64 bytes from 192.168.200.2: icmp_seq=1 ttl=64 time=0.150 ms
64 bytes from 192.168.200.2: icmp_seq=2 ttl=64 time=0.074 ms
^C
--- 192.168.200.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1049ms
rtt min/avg/max/mdev = 0.074/0.112/0.150/0.038 ms

The shell runs in the isolated network namespace and it can connect to the host machine. It cannot connect to internet though

ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
^C
--- 1.1.1.1 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss

To be able to access internet from the container, host must have IPv4 forwarding enabled and have NAT and forwarding configured. This goes beyond the scope of this (already rather large) blog post, but here are the commands that can do that (without explanation):

sysctl -w net.ipv4.ip_forward=1
iptables -t nat -A POSTROUTING -s 192.168.200.0/24 -o <wan_intf> -j MASQUERADE
iptables -A FORWARD -i br0 -o <wan_intf> -j ACCEPT
iptables -A FORWARD -i <wan_intf> -o br0 -m state --state RELATED,ESTABLISHED -j ACCEPT

The source code for this post is on GitHub as always.

The container is almost done. It has PID, user, filesystem and networking isolation. There are a couple more things that need to be done:

add resource management with cgroups
drop capabilities
be a good PID 1 init process - kill zombies

I'll try to tackle all of these in the next post.