Hardening the Cluster: Implementing User Namespaces for Container Isolation

There's a class of Kubernetes security bugs that makes platform engineers lose sleep. Not the "wrong RBAC rule" kind. The quiet, structural kind where your container runtime is operating exactly as designed, and that design is the problem.
For years, the problem was this: a process running as UID 0 inside a container was also UID 0 on the worker node. Kernel namespaces isolated the view of the filesystem and processes. But the identity who you are from the kernel's perspective leaked through.
User Namespaces graduating to GA in Kubernetes 1.36 finally closes that gap.
The Mental Model: What "Root" Actually Means Here
When you run a container without User Namespaces, the kernel credential tables look like this:
Container process: UID 0 (root)
Host kernel sees: UID 0 (root)
If your container runtime has any exploitable vulnerability a crafted seccomp bypass, a bad mount syscall, anything that escaped process lands on the host as root. Game over.
User Namespaces insert an indirection layer. The kernel maintains a UID mapping table per namespace:
Container UID 0 → Host UID 100000
Container UID 1 → Host UID 100001
Container UID 65535 → Host UID 165535
The process inside the container still thinks it's root. It can chown, chmod, do the things root does within its namespace. But the host kernel sees it as UID 100000, which has zero privileges on the node. No capability grants. No access to host files outside the mapped range. A container breakout now lands you as an unprivileged user on the host, not root.
This is fundamentally different from runAsNonRoot: true. That tells the container to start as a non-root user. User Namespaces remaps what "root" means even if the process inside is UID 0, it's unprivileged from the host's perspective.
How the Mapping Is Allocated
Kubernetes (via the kubelet and the OCI runtime) handles the UID/GID range allocation automatically when you opt in. Each pod gets a unique, non-overlapping 65,536-UID range from the host's subordinate UID space typically configured in /etc/subuid and /etc/subgid on the node.
The kernel /proc/self/uid_map inside a running pod with User Namespaces enabled looks like:
0 100000 65536
Three columns: container_start host_start count. The entire UID range 0–65535 inside the container maps to 100000–165535 on the host. No overlap with any other pod's range, no overlap with real system UIDs.
Enabling It: hostUsers: false
The API surface is intentionally minimal. One field, one decision:
apiVersion: v1
kind: Pod
metadata:
name: isolated-workload
spec:
hostUsers: false # <-- this is the entire opt-in
securityContext:
runAsUser: 0 # still "root" from inside the container
containers:
- name: app
image: your-app:latest
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
hostUsers: false tells the kubelet to create the pod inside a new user namespace with a remapped UID range. That's it. The rest of the securityContext fields still apply and still matter User Namespaces is a layer underneath them, not a replacement.
One nuance worth understanding: capabilities become namespaced under User Namespaces. If you grant CAP_SYS_ADMIN or CAP_NET_ADMIN inside a user-namespaced pod, those capabilities are valid only within the pod's namespace boundary. CAP_NET_ADMIN can manipulate the pod's network interfaces but cannot touch the host's. CAP_SYS_MODULE (kernel module loading) becomes entirely inert it will fail. This is why drop: ["ALL"] is still correct defense-in-depth practice, not redundant ceremony: you're containing the damage if a vulnerability grants unexpected capabilities inside the container.
For a Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hardened-service
spec:
replicas: 3
selector:
matchLabels:
app: hardened-service
template:
metadata:
labels:
app: hardened-service
spec:
hostUsers: false
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: service
image: your-app:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
Node-Side Requirements
Not all nodes support this out of the box. You need:
- Linux kernel ≥ 6.3 this is a hard floor, not a soft recommendation. idmap mounts were introduced in kernel 5.12 and overlayfs gained idmap support in 5.19, but
tmpfsdidn't get it until 6.3. Since Kubernetes mounts service account tokens and Secrets via tmpfs by default, even a pod with no PVCs will fail on kernels older than 6.3. - A container runtime that supports User Namespaces: containerd ≥ 2.0, CRI-O ≥ 1.25
idmapmount support in the filesystem driver (ext4, xfs, overlayfs, tmpfs all support it in 6.3+; some CSI drivers do not more on this below)
Check your node's kernel with:
uname -r
# Must be 6.3+
# Verify tmpfs idmap support (the real gating check)
cat /proc/filesystems | grep tmpfs
# Then confirm kernel version; tmpfs idmap requires 6.3+
The Real Hurdles: Where It Gets Complicated
This is the part most blog posts skip. User Namespaces is genuinely powerful, but the interaction with persistent storage is the current pain point that will catch you in production.
Problem 1: Volume File Permissions
When a PersistentVolume is mounted into a pod without User Namespaces, the kernel maps file ownership directly. A file owned by UID 1000 on the NFS server is seen as UID 1000 by the container process.
With User Namespaces, the container's UID 1000 maps to host UID 101000. But the files on the NFS share are still owned by UID 1000 (the real one). The container process opens the volume and sees files it doesn't own permission denied.
The fix is idmap mounts, which apply the same UID translation table to the volume's ownership as seen by the container. The kernel translates on access: file owned by host UID 1000 → container sees it as UID 0 (or whatever maps correctly). This requires kernel ≥ 6.3 and CSI driver support.
As of Kubernetes 1.36, most in-tree volume types support idmap mounts. But two categories won't work regardless of kernel version or driver freshness:
- NFS: the Linux NFS client has no idmap mount support at all. If you're running stateful workloads on NFS-backed PVCs,
hostUsers: falseis a hard block until upstream NFS gains idmap support. There's no workaround short of migrating off NFS. - Older enterprise CSI drivers: many haven't implemented idmap mount support yet. Check your specific driver before committing to user namespaces for stateful workloads.
# Check if your node's kernel supports idmap for a given filesystem
cat /proc/mounts | grep "idmapped"
If your CSI driver doesn't support idmap mounts, User Namespaces + persistent volumes simply won't work. You'll need to either fix the permissions at the storage layer (pre-chown files to the remapped UID range) or wait for driver updates.
Problem 2: SELinux and Volume Labels
On SELinux-enabled nodes (RHEL, CentOS Stream, some hardened distros), Kubernetes automatically relabels volume directories with the pod's SELinux context using chcon. This is how multi-tenant pods can share a node without SELinux denials each pod gets a unique MCS label like s0:c123,c456.
User Namespaces adds a wrinkle: the SELinux label for the volume needs to account for the UID mapping and the MCS category, and the interaction between the two wasn't cleanly supported until recent kernel patches.
seLinuxChangePolicy was introduced as alpha in Kubernetes 1.32 and graduated to GA in 1.36. It is not related to the original user namespaces alpha in 1.25 it's a separate, newer field. The good news: as of 1.36, MountOption is the default value, so you don't need to set it explicitly unless you're opting out:
spec:
hostUsers: false
securityContext:
seLinuxOptions:
level: "s0:c123,c456"
# seLinuxChangePolicy defaults to MountOption in 1.36 only set
# seLinuxChangePolicy: Recursive if you need multiple pods with
# different SELinux labels to share the same volume simultaneously
volumes:
- name: data
persistentVolumeClaim:
claimName: my-pvc
MountOption tells Kubernetes to apply the SELinux label via mount options rather than recursive chcon. This matters beyond correctness it makes volume mounting effectively O(1). Recursive chcon on a large database volume used to destroy pod startup time; mount-option labeling is instant regardless of volume size.
Important caveat: the O(1) behavior only activates when the SELinuxMount feature gate is enabled. As of Kubernetes 1.36, SELinuxMount is still disabled by default. seLinuxChangePolicy reaching GA means the API field is stable; it does not mean the performance optimization is on. Enable the feature gate explicitly if you want the startup time benefit, and check that your CSI driver has spec.seLinuxMount: true in its CSIDriver instance:
kubectl get storageclass <name> -o yaml | grep -i selinux
kubectl get csidriver <driver-name> -o yaml | grep -i selinuxMount
Problem 3: hostPID, hostNetwork, hostIPC All Incompatible
User Namespaces requires full namespace isolation. If your pod uses any of:
spec:
hostPID: true # pod shares host PID namespace
hostNetwork: true # pod shares host network namespace
hostIPC: true # pod shares host IPC namespace
...then the pod will be rejected at admission in Kubernetes 1.36. This combination is now a hard validation error, not a silent no-op. The whole point of User Namespaces is that your pod is isolated sharing host namespaces collapses that isolation.
This means any workload that legitimately needs host network (common in DaemonSets for CNI components, node exporters, etc.) cannot use User Namespaces. That's an acceptable tradeoff those workloads are typically privileged by design but it means you can't blanket-enforce hostUsers: false cluster-wide via policy.
Problem 4: Image Layers and UID Ownership
Container images built assuming UID 0 will often have files owned by 0:0. Inside a user-namespaced pod, this is fine UID 0 in the container still maps correctly. But images that embed files owned by specific non-zero UIDs say, a database image that has data dirs owned by UID 999 (postgres convention) need that UID to fall within the mapped range.
The mapped range is 0–65535. UID 999 is within that range, so it maps to host UID 100999. The container process running as 999 will own its files. This works.
Where it breaks: if an image embeds files owned by UIDs above 65535. That's rare in practice but worth auditing for any custom base images.
Putting It Together: Defense in Depth
User Namespaces isn't a silver bullet, and it doesn't replace your existing controls. Think of it as adding a new layer to the stack:
Application layer → RBAC, network policies, auth
Container layer → seccomp, AppArmor/SELinux, capabilities drop
Namespace layer → User Namespaces (UID remapping) ← new
Runtime layer → rootless containerd, gVisor, kata
Host layer → node hardening, kernel lockdown
The right baseline for a multi-tenant cluster today looks like:
securityContext:
runAsNonRoot: true # belt
runAsUser: 65534 # nobody, belt
seccompProfile:
type: RuntimeDefault
# at pod spec level:
hostUsers: false # suspenders
What It Protects and What It Doesn't
This is the part that gets glossed over, and the K8s community has been vocal about User Namespaces being oversold.
What it protects: A container breakout via the runtime now lands the attacker as UID 100000-something on the host unprivileged, owning nothing, in a blank UID range. That's a fundamentally different blast radius from landing as root. Node compromise via container escape is dramatically harder.
What it does not protect:
- API server access. A mounted
ServiceAccounttoken is still a mountedServiceAccounttoken. If your pod has a token withcluster-adminor broad RBAC grants, an attacker inside the container can still curl the API server and own the cluster. User Namespaces does nothing here that's network and RBAC policy territory. - Lateral network movement. The pod still has a network interface, still has DNS, still can reach other services in the cluster. Compromising the container and pivoting to other pods is still possible. NetworkPolicies are your control here.
- Secrets mounted as files. Env vars and volume-mounted secrets are still readable by the container process. User Namespaces doesn't encrypt or restrict access to secrets the pod is already authorized for.
Think of it precisely: User Namespaces protects the node OS from a container escape. It says nothing about the cluster once the attacker is already operating inside the container's legitimate identity.
Bonus: Pod Security Standards Get Easier
There's a platform team quality-of-life win here that often goes unmentioned.
The Restricted Pod Security Standard requires, among other things, runAsNonRoot: true and no privilege escalation. Developers push back on this constantly because many off-the-shelf images run as root internally. With hostUsers: false, those containers are running as "root" inside the namespace but as an unprivileged UID on the host which satisfies the intent of the Restricted policy even if the container process thinks it's UID 0.
This doesn't automatically make Restricted PSS pass (the admission controller checks the Pod spec fields, not the runtime behavior), but it meaningfully changes the risk conversation with your security team. You have a structural argument: "this container runs as root internally, but that root has no host privileges" which is a different conversation than "we need a root exception."
What to Watch
The rough edges are real but actively being worked on:
- CSI driver idmap support is the biggest gap right now. Track your storage vendor's changelog directly idmap support is implemented per driver, not centrally. KEP-127 is the core User Namespaces feature KEP (useful for understanding the feature's evolution and known limitations), but CSI-specific progress lives in each driver's own issue tracker.
- SELinux + idmap interaction is kernel-version-sensitive. If you're on RHEL 9 / kernel 5.14.x, you may hit issues before the backports land.
- StatefulSets with RWO volumes need careful testing fsGroup interactions with idmap mounts have edge cases that depend on your CSI driver's idmap implementation.
The feature is GA. The ecosystem is catching up. For stateless workloads, enable it today. For stateful workloads on unverified CSI drivers, test in staging with your exact storage stack before rolling to production.





