For the longest time I’ve thought that we were doing container images “wrong”1.
Container images always felt so unnecessarily BIG.
Yes, there are lots and lots of ways to make images smaller, but still fundamentally they are shipping a root filesystem around. “Works on my machine!” Let’s ship the machine then.
Multi-stage builds came around and solved one class of bloat problem. At least we don’t need to embed lots of intermediate layers.
Still, images felt big to me. Ubuntu doesn’t even really have a “thin” image?
Google’s distroless container images were finally getting close to the ideal.
No shell, just the stuff you need. Great.
Yet…
What If We Took “Thin Images” To The Extreme?
Even the absolute base distroless image still has ca-certificates and tzdata.
This still kinda feels wrong to me.
Yes, you want your application to be reproducible, but why does every container image need its own independent copy of the timezone files (and likewise ca-certificates)?
Everyone needs to build and repush for any updates on those.
At the cost of being able to run this container locally on your laptop, what if we stripped out EVERYTHING except the app, and instead composed its input in?
Introducing the Kubernetes Image Volume Type
Image Volumes is a Kubernetes beta feature (as of v1.33, KEP) that lets you mount OCI images directly as volumes in your pods.
Here’s what that looks like:
volumes:
- name: tzdata
image:
reference: docker.io/library/tzdata:latest
pullPolicy: IfNotPresent
The image is pulled from the registry, unpacked by containerd, and mounted as a read-only volume.
You can mount this volume into any container in the pod at arbitrary mount locations using a volumeMount.
Do A Demo!
Here is this idea in action:
~5MB
Port 8080 HTTP Server"] end tzdata["tzdata volume
OCI Image
~450KB"] cacerts["CA certs volume
OCI Image
~220KB"] config["config volume
OCI Image
~180B"] tzdata -.->|mount at
/usr/share/zoneinfo| app cacerts -.->|mount at
/etc/ssl/certs| app config -.->|mount at
/config| app end style container fill:#42a5f5,stroke:#1565c0,stroke-width:2px style app fill:#66bb6a,stroke:#2e7d32,stroke-width:2px style tzdata fill:#ab47bc,stroke:#6a1b9a,stroke-width:2px style cacerts fill:#ab47bc,stroke:#6a1b9a,stroke-width:2px style config fill:#ab47bc,stroke:#6a1b9a,stroke-width:2px
Four independent images, but still one container:
- Main app: Scratch + statically compiled Go binary (5.47MB)
- Timezone data: Alpine’s zoneinfo extracted into scratch (446KB)
- CA certificates: Alpine’s CA bundle extracted into scratch (219KB)
- Configuration: JSON file in scratch (181 bytes!)
If you have a cluster that meets the requirements below, you can run this pod now:
apiVersion: v1
kind: Pod
metadata:
name: ultra-minimal-demo
spec:
containers:
- name: app
image: docker.io/solarkennedy/ultra-minimal-go-app:latest
ports:
- containerPort: 8080
volumeMounts:
- name: tzdata
mountPath: /usr/share/zoneinfo
readOnly: true
- name: cacerts
mountPath: /etc/ssl/certs
readOnly: true
- name: config
mountPath: /config
readOnly: true
volumes:
- name: tzdata
image:
reference: docker.io/solarkennedy/tzdata-volume:latest
pullPolicy: IfNotPresent
- name: cacerts
image:
reference: docker.io/solarkennedy/ca-certificates-volume:latest
pullPolicy: IfNotPresent
- name: config
image:
reference: docker.io/solarkennedy/config-volume:latest
pullPolicy: IfNotPresent
Or you can create a KIND cluster locally (after installing):
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
ImageVolume: true
nodes:
- role: control-plane
image: kindest/node:v1.34.0
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
feature-gates: "ImageVolume=true"
EOF
And then run:
kubectl apply -f https://codeberg.org/solarkennedy/images-as-volumes-demo/raw/branch/master/pod.yaml
kubectl wait --for=condition=ready pod/ultra-minimal-demo --timeout=60s
kubectl port-forward pod/ultra-minimal-demo 8080:8080
Then visit http://localhost:8080 to see the demo in action.
All the code from this exploration is available on Codeberg: images-as-volumes-demo. It includes instructions for running this locally using KIND, so you don’t have to be blocked on running a cutting edge k8s cluster.
Requirements
Getting this to work requires specific versions:
- Kubernetes: v1.33+ (beta feature, needs the
ImageVolume=truefeature gate enabled) - containerd: 2.1.0+ (CRI implementation of ImageVolume)
- KIND (if testing this locally): v0.30.0+ ships with Kubernetes v1.34.0 and containerd 2.1.32
Pros / Cons
Pros:
- Independent Versioning: The app doesn’t have anything else in it, other than the app. All other related container images can be rev’d independently. Changing configuration or patching CA certificates doesn’t require rebuilding the app image - just swap the volume.
- Node-Level Sharing: These volume images can be shared between other pods on the same node, unlike layers which are tied to specific images. One tzdata volume can serve hundreds of pods.
- Smaller App Images: 5MB app image instead of 50MB+ monolithic image means faster pulls and less registry storage.
Cons:
- More Pulls: Now you need to pull 4 images to get a container up instead of 1.
- No Local Run: Since this is a k8s feature, you can’t run this pod locally on your laptop (without KIND?).
- Orchestration Complexity: Deployment manifests become more complex with multiple image references to manage and version.
- Debugging Difficulty: When something breaks, you need to check 4 different image versions instead of 1.
- Tooling Immaturity: Still a beta K8s feature with limited tooling support and subject to change.
Real-World Use Cases
Machine Learning Model Serving: Swap models without rebuilding inference servers. Version models and tokenizers independently.
Configuration Management: Deploy the same app globally with region-specific config images. Use OCI images for config instead of ConfigMaps.
Shared Libraries As a Service: Mount common dependencies (numpy, scipy, frontend assets) across multiple pods, HPC-style. Make my libraries dream a reality.
Compliance and Security: Centrally manage audited CA bundles. One scanned image mounted everywhere—no embedding in every app. Nobody has to rebuild anything to fix Shellshock or maybe even Log4Shell?3
Conclusion
This approach changes how we think about container composition:
────────────────
CA Certificates
────────────────
Timezone Data
────────────────
Base OS"] style layers fill:#ef5350,stroke:#c62828,stroke-width:2px end subgraph after["After: Images Composed"] direction TB vol1["CA Certs Volume"] vol2["Timezone Data Volume"] vol3["Config Volume"] app["App Binary"] vol1 -.-> app vol2 -.-> app vol3 -.-> app style app fill:#66bb6a,stroke:#2e7d32,stroke-width:2px style vol1 fill:#ab47bc,stroke:#6a1b9a,stroke-width:2px style vol2 fill:#ab47bc,stroke:#6a1b9a,stroke-width:2px style vol3 fill:#ab47bc,stroke:#6a1b9a,stroke-width:2px end before ~~~ after
I really feel like it represents something big: moving beyond layers and into composed workloads.
Yes, humanity should stop copying the tzdata database for the billionth time.
That to me isn’t actually the biggest problem, just an annoyance.
And yes, multi-container pods also allow you to compose workloads together but they also suck in their own way.
I’m really looking forward to the future where nobody ever says FROM ubuntu at all, and instead we use a shared base image, and we compose our app into it.
And we make big image rebuilds a thing of the past, because apps don’t have to rebuild at all when the composed input gets updated4.
-
Wrong for who? Wrong for platform owners where you control the containers (internal developer platform). If you’re building containers for public distribution, the tradeoffs are completely different. ↩︎
-
If you see an error like
mkdir '': no such file or directory, it probably means you need a newer containerd version. ↩︎ -
I don’t know how this would work exactly, but it feels like it should be possible to bundle jars like this? ↩︎
-
Of course, it isn’t this simple, but it could be. We would still pin to minor versions. Ideally the information about how to compose the workload together is inside the repo directly, even if that means git-committing a pod.yaml, so that when you do have to do a major base upgrade, it is tied with whatever associated changes also come with the app to make that work. ↩︎
Comment via email