1.8. Under the hood

A closer look at the docker command and the runtime

We’ve learned that the term “Docker” is used somewhat imprecisely. It refers to various components such as the CLI (Command Line Interface), the Docker Engine, the OCI image format, and the Container runtime. Let’s take a closer look at what’s happening when we use the command:

docker run --rm -d --name sleep-container alpine sleep 900

We will see the meaning of -rm and the other arguments later. For now, we only need to know that we started a container that sleeps for 900 seconds on our host. First let us get the process id of the sleep process we just started:

docker inspect --format '{{.State.Pid}}' sleep-container

Let us see the process running. In the webshell we have the docker backend running in another container, let us first change into that:

kubectl exec  $(kubectl get pod -l "app.kubernetes.io/name"=webshell -o name) -it -c dind -- sh

Don’t worry the command will make sense after the Kubernetes Security training.

Let us see the process running on the host now and its parents. We don’t have the necessary tools installed so we script it a bit:

PID=$(pgrep sleep)
while [ "$PID" != "1" ] && [ -n "$PID" ]; do
    PPID=$(awk '/^PPid:/ {print $2}' /proc/$PID/status)
    USER=$(stat -c %U /proc/$PID); UIDN=$(stat -c %u /proc/$PID)
    CMD=$(tr '\0' ' ' < /proc/$PID/cmdline)
    if [ -z ${fi+x} ]; then
      fi=true
      echo "PID   PPID  USER(UIDN)     CMD"
    fi
    echo "$PID  $PPID  $USER($UIDN)  $CMD"
    PID=$PPID
done

We see hopefully the same PID and something like this

PID   PPID  USER(UIDN)      CMD
3301  3277  rootless(1000)  sleep 900 
3277  1     rootless(1000)  /usr/local/bin/containerd-shim-runc-v2 -namespace moby -id 08ce2ee7d4e3194f47ec6249360b51b02e6e36834f6791854ca2b24a4b15768c -address /run/user/1000/docker/containerd/containerd.sock 

Indeed we see that we don’t use docker as a container runtime but containerd at a higher level and runc at a lower level (btw moby is the internal name Docker uses for its network namespaces). The parent of each of these containerd-shim-runc-v2 processes is PID 1 on the system.

The shim becomes the parent process of the containerized application. It is responsible for tasks such as reaping zombie processes, handling container process I/O (standard input, output, error), and ensuring proper container cleanup upon exit. As a result, containerd can upgrade and restart without affecting running containers.

Secondly, we see that in the end a container is just a process running on the host. If not running in rootless mode it runs as root!

Let us see the different isolation techniques being used (since we have lsns not installed we script it as well):

PID=$(pgrep sleep)
unset first
for NSPATH in /proc/$PID/ns/*; do
    TYPE=$(basename "$NSPATH")
    NS=$(stat -Lc %i "$NSPATH")
    NPROCS=$(find /proc/[0-9]* /proc/*/task/* 2>/dev/null -lname "*[$NS]" | wc -l)
    USER=$(stat -c %U /proc/$PID)
    UID=$(stat -c %u /proc/$PID)
    CMD=$(tr '\0' ' ' < /proc/$PID/cmdline)
    if [ -z ${first+x} ]; then
      first=true
      printf "%-12s %-6s %-6s %-6s %-12s %s\n" "NS" "TYPE" "NPROCS" "PID" "USER" "COMMAND"
    fi
    printf "%-12s %-6s %-6s %-6s %-12s %s\n" "$NS" "$TYPE" "$NPROCS" "$PID" "$USER($UID)" "$CMD"
done

Which shows use the different (and newly created) namespaces being used for this container:

NS           TYPE   NPROCS PID    USER         COMMAND
4026532969   cgroup 0      3984   rootless(1000) sleep 900 
4026532967   ipc    0      3984   rootless(1000) sleep 900 
4026532965   mnt    0      3984   rootless(1000) sleep 900 
4026532970   net    0      3984   rootless(1000) sleep 900 
4026532968   pid    0      3984   rootless(1000) sleep 900 
4026532968   pid_for_children 0      3984   rootless(1000) sleep 900 
4026531834   time   0      3984   rootless(1000) sleep 900 # time isolation
4026531834   time_for_children 0      3984   rootless(1000) sleep 900 
4026532832   user   0      3984   rootless(1000) sleep 900 # uid gid isolation (root inside is not root outside)
4026532966   uts    0      3984   rootless(1000) sleep 900 #hostname isolation

By comparision, a simple sleep command in the current shell would run in the same namespaces as the parent shell giving no isolation. Also thanks to rootless mode this process runs with UID 1000 instead of root.

Don’t forget to exit our Docker backend container if you work in the webshell.

exit

🤔 Which time will be displayed when you execute uptime inside the container? Try it out and explain what you see and why.

Show me the solution

docker run –rm -i alpine uptime
uptime reads /proc/uptime
The /proc filesystem is a kernel-generated virtual filesystem, not something Docker emulates.
So if it is not namespaced (like PID or Hostname) you will get information directly from the host. These are things like:
/proc/uptime,
/proc/cpuinfo → all host CPUs visible,
/pro/meminfo → host memory,
parts of /proc/sys/.. → global kernel parameters