I want to share a capability that we have at $WORK that is not industry standard, but I think is underrated.
Netflix’s container platform Titus has the capability of sharing folders between two containers in a kubernetes (k8s) pod.
You might be asking, “can’t k8s already do that”?
No.
You are probably thinking of emptyDir
, which is a way of setting up an empty folder that multiple containers can share.
This is really not the same.
What I’m talking about is one container that has files, perhaps from a volume or perhaps just in its image, and the sidecar container can just see arbitrary locations on the first container.
This is a Container To Container mount.
Example Pod
You don’t have to have a fancy platform to do this. If you are OK running a random privileged container (mine) off the internet, you can try this out!
Here is an example pod. It has:
- A busybox container that only writes out an html file
- An nginx container to serve files
- A cross-mounter binary that connects the two.
apiVersion: v1
kind: Pod
metadata:
name: shared-folder-example
namespace: default
spec:
containers:
- name: data-provider
image: busybox
command: ["sh", "-c", "mkdir -p /app/build && echo '<h1>Hello from BusyBox!</h1><p>This was created in busybox, served by nginx</p>' > /app/build/index.html && sleep infinity"]
- name: nginx
image: nginx:alpine
ports:
- containerPort: 80
- name: folder-mounter
image: docker.io/solarkennedy/k8s-container-cross-mounter:latest
imagePullPolicy: IfNotPresent
args:
- "--src-container"
- "data-provider"
- "--dst-container"
- "nginx"
- "--src-path"
- "/app/build"
- "--dst-path"
- "/usr/share/nginx/html"
securityContext:
privileged: true
volumeMounts:
- name: containerd-socket
mountPath: /run/containerd/containerd.sock
readOnly: false
- name: proc
mountPath: /host-proc
readOnly: false
volumes:
- name: containerd-socket
hostPath:
path: /run/containerd/containerd.sock
type: Socket
- name: proc
hostPath:
path: /proc
type: Directory
Feel free to try it!
If you browse to the pod’s ip on port 80, you will see the message from the busybox container.
If you exec into the nginx container, you will see:
overlay on /usr/share/nginx/html type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/128/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/127/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/126/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/73/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/143/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/143/work)
This is the cross-mounting code at work.
How It Works
At a high level, the cross-mounting container runs a very special move_mount
syscall.
You can see the actual command takes arguments for source/destination container names and source/destination paths.
You can read all about the gory technical linuxy details on my other blog posts: 1 2 3 4
Why Not CSI?
This type of mounting can never be a CSI driver.
This is because all CSI drivers need to run before any of the containers are created.
This is fundamentally incompatible with how the cross-mounting works, which requires the containers to exist so it can actually enter their namespaces.
Why Not Volume
/VolumeMounts
?
At Netflix we DO use the Volume
/VolumeMount
constructs to define these mounts.
But we get away with this by doing heavy customization to the runtime.
In practice, k8s ignores our special shared volumeMounts because of a special noop FlexVolume driver. Kubelet will noop over those volumes/volumeMounts, then our runtime customizations do the “real” mounting outside of kubelet, once the container is actually running.
We do a special trick where we pause the started containers, and only let them go once the storage is mounted.
Why Not An Init Container?
We can’t use an init container for this because to cross-mount containers they must be running.
At init container time, the other containers are NOT running, so we’ll never find their PIDs to use to mount.
Unless you are going to make runtime customizations, a normal container in the pod is the only way.
What About Ordering?
There is no strict ordering here, the mount will happen only after the other containers are running. This means there is a race. If your other containers need the mount precisely they start, then just add a sleep at the beginning. It doesn’t take very long for the cross-mounter to mount.
Conclusion
I think it is unlikely that cross-mounting containers will ever be a native feature of k8s. It is too linuxy and weird.
But, if you have the appetite for linuxy and weird, you too can use this cross-mounting method to compose containers together in new and creative ways.
This is particularly useful at big companies where one team might produce an image that another team can consume, but you don’t always want to build FROM
that image, maybe you just want to mount in their tools?
Or maybe you have some python code, and you want to deploy that code, but you want to keep the python runtime parts in a different image? With this technique, you can compose them together, just like the busybox+nginx example.
If you want to learn more and see the code, it is in a repo on Codeberg. It uses the existing Titus Open Source code (archived).
Comment via email