The one where resources are limited

 

One of the features of the modern containers is the ability to limit the resources that a container can use. I'll add the simplest and the most limited variant of this to my container runtime - setting CPU and memory limits with cgroups. Control groups, also known as cgroups, are a Linux kernel feature that allows you to limit and monitor the resource usage (CPU, memory, disk I/O, network, etc.) of a set of processes.

There are two versions of the cgroups feature: v1 is obsolete, and starting with Linux 4.5, v2 is the official version. As I am targeting a 6.17+ kernel, there’s no point in even talking about v1.

Kernel cgroups are mapped to the filesystem. The root cgroup is at /sys/fs/cgroup, and each cgroup is a directory under it.

Of course there is a man page for cgroups

To create a new cgroup, you need to create a new directory under the root cgroup. The name of the directory will be the name of the cgroup. To allow this new cgroup to control resources, we need to add the controllers we’re interested in to the cgroup.subtree_control file. For example, if we want to control CPU and memory and we’ve created a toy_container cgroup in /sys/fs/cgroup/toy_container, we need to write line +cpu +memory to the /sys/fs/cgroup/toy_container/cgroup.subtree_control file. Just in case these controllers are not enabled in the root cgroup, we should also write the same string to the /sys/fs/cgroup/cgroup.subtree_control file.

The man page for cgroups states the following:

a (nonroot) cgroup can't both (1) have member processes, and (2) distribute resources into child cgroups—that is, have a nonempty cgroup.subtree_control file.

This means we cannot add processes directly to our toy_container cgroup, so we need to create a new child cgroup under it. Quoting the man page again:

The recommended approach in cgroups v2 is to create a subdirectory called leaf for any nonleaf cgroup which should contain processes, but no child cgroups

We’ll create a new cgroup in /sys/fs/cgroup/toy_container/leaf and add our child process to it. To add a process to a cgroup, we need to write the process’s PID to the cgroup.procs file in the cgroup’s directory.

One last note before we implement this: to create and manipulate cgroups, we need root privileges. Oh, and it would be nice to clean everything up when we’re done.

For the sake of doing it a bit differently, I'll implement cgroups as a Rust struct and implement Drop trait to clean up the cgroups when the struct goes out of scope.

I won't post the full code here, anyone interested can check it out on GitHub.

The struct itself is very simple, it has a path to the cgroup directory and a name of the cgroup.

pub struct Cgroup {
    path: PathBuf,
    cgroup: String,
}

Here are three functions to set memory and cpu limits and add process to the cgroup.

/// Adds a process to this cgroup.
///
/// # Arguments
/// * `pid` - Process ID to add to the cgroup
pub fn add_process(&self, pid: i32) -> Result<()> {
    let procs_file = self.path.join(&self.cgroup).join("cgroup.procs");
    fs::write(&procs_file, pid.to_string())
        .with_context(|| format!("Failed to add process {} to cgroup", pid))?;
    Ok(())
}

/// Sets the memory limit for a cgroup.
///
/// # Arguments
/// * `path` - Path to the cgroup directory
/// * `limit` - Memory limit string (e.g., "100M", "1G")
pub fn set_memory_limit(&self, limit: &str) -> Result<()> {
    let memory_max = self.path.join(&self.cgroup).join("memory.max");
    fs::write(&memory_max, limit)
        .with_context(|| format!("Failed to write to {:?}", memory_max))?;
    Ok(())
}

/// Sets the CPU limit for a cgroup.
///
/// # Arguments
/// * `path` - Path to the cgroup directory
/// * `quota` - CPU quota as a decimal string (e.g., "0.5" for 50%)
pub fn set_cpu_limit(&self, quota: &str) -> Result<()> {
    let cpu_quota_str = parse_cpu_quota(quota)
        .with_context(|| format!("Failed to parse CPU quota '{}'", quota))?;

    let cpu_max = self.path.join(&self.cgroup).join("cpu.max");
    fs::write(&cpu_max, cpu_quota_str)
        .with_context(|| format!("Failed to write to {:?}", cpu_max))?;
    Ok(())
}

To use the cgroup, I'll add two command line arguments to configure memory and CPU limits.

~ struct Args {
+    /// CPU shares for the container, e.g. 0.5, 1, etc
+    #[arg(short, long)]
+    cpu: Option<String>,
+
+    /// Memory limit for the container in bytes or Mb/Gb, e.g. 128M, 1Gb, etc
+    #[arg(short, long)]
+    mem: Option<String>,

I will also add these as arguments to the run_in_container function

~pub fn run_in_container(
~    command: &str,
~    args: &[String],
+    cpu: &Option<String>,
+    mem: &Option<String>,
~) -> anyhow::Result<()> {

The last change is to instantiate the Cgroup struct and use it to set the limits, if any limits were configured. As we must be root to set up cgroups, the code will only run if the effective UID is 0.

+    // keep variable here, so if we use cgroup, it will be dropped automatically
+    // when run_in_container finishes
+    let mut _cgroup: Option<Cgroup> = None;
~
~    if uid == 0 {
~        net::setup_network_host(&container_net_cidr)?;
~        net::move_into_container(child_pid)?;
+
+        let cg = Cgroup::new(cpu, mem)?;
+        cg.add_process(child_pid.as_raw())?;
+        _cgroup = Some(cg);
~    }

To test it we can check the cgroup file under the PID of the child process in /proc:

$ cargo build && sudo target/debug/container --mem 512M --cpu 0.5 /bin/sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
Setting up cgroup "/sys/fs/cgroup/toy_container"
started child with PID=120933
/ # cat /proc/$$/cgroup
0::/toy_container/leaf
/ #

We can verify the memory and CPU limits by looking at the memory.max and cpu.max files in the cgroup directory. Run these commands from outside the container:

$ cat /sys/fs/cgroup/toy_container/leaf/memory.max
536870912
$ cat /sys/fs/cgroup/toy_container/leaf/cpu.max
50000 100000

When we exit the shell and the container terminates, the cgroup is automatically deleted - just verify that there's no longer a "toy_container" directory in /sys/fs/cgroup.

Now we can limit the resources of the container. This just scratches the surface of cgroups, but it'll do for now.

As usual, the source code for this post is on Github.

Other posts on these topics: