Skip to main content

Command Palette

Search for a command to run...

Exploring Kubernetes v1.35 'Timbernetes': All About the World Tree Version

Updated
β€’20 min read
Exploring Kubernetes v1.35 'Timbernetes': All About the World Tree Version

Author's Note: This piece was inspired by Nicolas VermandΓ© comprehensive analysis at ScaleOps. His work shaped the structure and focus of my coverage. Check out his article for additional insights.


The Release That Makes You Choose: Upgrade or Archaeology

Picture this: You're sipping your morning coffee, scrolling through your Slack, and someone drops the Kubernetes v1.35 release notes. "Another quarterly release," you think. "Probably just some minor tweaks and the usual beta promotions."

Wrong. So very, very wrong.

Kubernetes v1.35 is the release equivalent of your apartment landlord saying "we're doing renovations" and then showing up with a wrecking ball. This isn't just another feature release,it's an infrastructure intervention. And if you're still running CentOS 7 nodes... well, I'm not saying you should panic, but maybe start stress-testing your resume template.

Let me explain.


The Theme: Yggdrasil, Squirrels, and Existential Questions About Operating Systems

First, can we just appreciate the theme? Timbernetes. The World Tree. Inspired by Yggdrasil from Norse mythology,the cosmic tree that connects all realms. The logo features three adorable squirrels: a wizard holding an LGTM scroll (for reviewers), a warrior with an axe and Kubernetes shield (for release crews), and a rogue with a lantern (for triagers who bring light to dark issue queues).

It's wholesome. It's nerdy. It's the kind of branding that makes you want to print stickers and put them on your laptop next to that one from KubeCon 2019 that's starting to peel.

But here's what the cheerful squirrels don't tell you: Kubernetes v1.35 represents a philosophical shift in how the project sees itself. This isn't just about features,it's about focus.

Kubernetes is doubling down on being infrastructure.

What does that mean? Think about the role of foundational systems. They provide powerful, reliable building blocks, mechanisms for resource management, workload placement, and lifecycle control,but they don't prescribe how you should use them. They give you the tools; you bring the strategy.

v1.35 delivers exactly that: in-place resource mutation, coordinated gang scheduling, structured device allocation, and enhanced observability. These are sophisticated capabilities that unlock new possibilities. But they're capabilities, not complete solutions.

The native controllers,HPA, VPA, the default scheduler,provide baseline functionality. They work. They're reliable. But they're increasingly designed as reference implementations rather than production-optimized systems for every use case.

This creates an interesting dynamic: the primitives are maturing rapidly, but the intelligence layer,the part that decides when to resize a pod, where to place an AI workload, or how to optimize for cost,is increasingly left as an exercise for the platform team.

It's not a bug. It's a design choice. And it has implications for how you approach Kubernetes in production.


Breaking Changes: The Modernization Mandate

Let's start with the uncomfortable stuff. You know how doctors say "this won't hurt" right before it definitely hurts? Yeah, this is like that.

cgroup v1 Is Dead. No, Really, Actually Dead.

Remember cgroup v1? That venerable Linux resource management system that's been around since... forever? It's gone. Not deprecated with a gentle "please consider migrating" message. Removed. Deleted. Sent to the great /dev/null in the sky.

If your kubelet detects cgroup v1 on startup, it will fail. Hard. No negotiation. No "just this once." It's like trying to run Windows 95 programs on Windows 11, technically there's compatibility mode, but do you really want to be that person?

Here's how to check if you're about to have a very bad day:

stat -fc %T /sys/fs/cgroup

If you see cgroup2fs, congratulations! You're living in 2025 (soon to be 2026). If you see tmpfs... I have bad news. You're running cgroup v1, and your Friday afternoon just got a lot more interesting.

The impact on legacy fleets is, shall we say, spicy. CentOS 7 (which hit EOL in June 2024, by the way,you really should have migrated by now), RHEL 7, Ubuntu 18.04... they all default to cgroup v1. Even if you're on a modern distro, if your kubelet is explicitly set to cgroupDriver: cgroupfs instead of systemd, you're going to hit this wall like a cartoon character hitting a pane of glass.

There is an escape hatch: you can set failCgroupV1: false in your KubeletConfiguration. But using it is like continuing to smoke after your doctor shows you the lung X-rays. Sure, technically you can, but it locks you out of all the cool v2-only features: memory QoS, certain swap configurations, and,here's the kicker,Pressure Stall Information (PSI) metrics.

PSI is the metric that tells you not just that CPU usage is high, but that processes are actively stalling waiting for CPU. It's the difference between "we're busy" and "we're drowning." It's a game-changer for autoscaling intelligence. And you can't have it on cgroup v1.

So yeah, time to upgrade those nodes.

containerd 1.x: The Final Season

Here's another fun one: Kubernetes v1.35 is the last version that supports containerd 1.x. In v1.36, it's gone. This is your final warning, like when Netflix sends you three emails saying a show is about to leave the platform.

Why does this matter? Because containerd 2.0 removes support for Docker Schema 1 images. You know, those ancient container images that were pushed five years ago and have been lurking in your registry like digital archaeology? They won't pull anymore.

Before you upgrade, you need to:

  1. Check your container runtime versions:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'
  1. Scan for Schema 1 images (yes, this is tedious):
skopeo inspect docker://your-registry/old-image:tag | jq '.schemaVersion'
  1. Update your containerd configs. And I mean really update them,containerd 2.0 removes deprecated registry.configs and registry.auths structures. If your automated node upgrade scripts inject old configs, your runtime will crash. Don't discover this at 2 AM on a Saturday.

IPVS Mode Gets the Deprecation Talk

For years, IPVS mode in kube-proxy was the recommendation for large clusters because iptables couldn't scale. It was faster, more efficient, and made you feel like a sophisticated network engineer.

But here's the thing: maintaining IPVS behavior that perfectly matches iptables semantics while also supporting every new Service feature turned out to be... complicated. Like, "we're spending more time on compatibility than innovation" complicated.

So Kubernetes v1.35 deprecates IPVS mode. It still works! You'll just get a warning on startup. The future is nftables,a more modern, programmable backend that fixes iptables' scaling issues without IPVS's maintenance burden.

Check your mode:

kubectl get configmap kube-proxy -n kube-system -o yaml | grep -i "mode"

If you see mode: ipvs, you've got until v1.38 to migrate. That's probably about a year, give or take. Start testing nftables in staging. Take your time. But do take it seriously.


The Flagship Feature: In-Place Pod Resizing Goes GA πŸŽ‰

Alright, enough doom and gloom. Let's talk about the feature that's going to make stateful workload operators weep with joy: In-Place Pod Resizing is now Generally Available.

This is huge. Like, "finally, VPA might actually be usable in production" huge.

The Historical Inefficiency (aka "The Restart Tax")

Let me paint you a picture. You've got a production database. It's been running smoothly, serving queries, living its best life. Then you realize: "Hey, this needs more memory. Let's change the limit from 4GB to 8GB."

In Kubernetes v1.32 and earlier, here's what happens:

  1. Pod gets terminated

  2. All accumulated state evaporates (JIT compilation cache, warm database connections, Redis data if it's not persisted)

  3. New pod gets created

  4. New pod might fail to schedule (oops, no capacity)

  5. New pod might fail readiness probes (cold start blues)

  6. New pod might land on a worse node

  7. Your pager goes off

  8. You question your career choices

This is why Vertical Pod Autoscaler (VPA) was relegated to "recommendation mode" in most organizations. Sure, it could tell you what resources you needed, but actually applying those recommendations meant disruption. So teams would use it once at deploy time, like a sizing calculator, and then overprovision everything "just in case."

Result? Clusters running at 30-40% utilization. The tool that was supposed to optimize resource usage became a one-time measurement device.

The v1.35 Magic

With KEP-1287 graduating to GA, the resources field in a Pod spec is now mutable. You can patch it via a new /resize subresource, the kubelet evaluates feasibility, and,get this,the container keeps running.

No restart. No new container ID. No reset restartCount. The memory cgroup limit just... changes. From inside the container, it's seamless.

Here's what it looks like:

kubectl patch pod my-database --subresource resize --type='merge' -p '{
  "spec": {
    "containers": [{
      "name": "postgres",
      "resources": {
        "requests": {"memory": "8Gi"},
        "limits": {"memory": "8Gi"}
      }
    }]
  }
}'

And just like that, your database has more memory. No downtime. No data loss. No 3 AM incident.

The Gotchas (Because Of Course There Are Gotchas) :)

QoS Class Is Still Immutable

Kubernetes has three QoS classes: Guaranteed (requests == limits), Burstable (requests < limits), and BestEffort (no resources specified). These determine scheduling priority and eviction behavior. And they're immutable.

So if you try to resize a Guaranteed pod by changing only the limits (which would make it Burstable), the API server will reject it with a very polite "Pod QOS Class may not change as a result of resizing."

The fix? For Guaranteed pods, always resize requests and limits together. Keep them equal.

The Memory Shrink Hazard

Increasing memory is safe. Decreasing memory is... interesting.

Let's say you have a container with a 4GB limit, currently using 3GB, and you decide to resize down to 2GB. What happens?

In v1.35.0-rc.1, the kubelet is smart enough to say "nope." The resize enters PodResizeInProgress with an error message like "attempting to set pod memory limit below current usage." The cgroup limit doesn't decrease. The container keeps running at 4GB.

Your spec says 2GB. Reality is 4GB. The resize is stuck in limbo. And somewhere, a platform engineer is staring at this state wondering if Kubernetes is broken (it's not,it's protecting you).

The solution? Use resizePolicy to specify that memory changes should trigger a container restart:

apiVersion: v1
kind: Pod
metadata:
  name: safe-resize-app
spec:
  containers:
  - name: app
    resources:
      requests:
        cpu: "500m"
        memory: "256Mi"
      limits:
        cpu: "1"
        memory: "512Mi"
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired      # Hot resize for CPU
    - resourceName: memory
      restartPolicy: RestartContainer  # Restart for memory,clean slate

Now CPU changes are instant, and memory changes trigger a controlled restart. Best of both worlds.

Native VPA: The Mechanism Works, The Intelligence... Doesn't

Here's the awkward part. VPA now supports updateMode: InPlaceOrRecreate. The mechanism works beautifully,pods resize without eviction, container IDs stay the same, everything is smooth.

But the VPA recommender is still... how do I put this delicately... not great?

VPA relies on Metrics Server, which polls every 15-60 seconds and analyzes historical averages. By the time it detects a memory spike and issues a patch, the OOM might have already occurred. It's reactive, not predictive. It sees "high memory usage" but doesn't know if that's a leak, valid cache expansion, or normal JVM heap behavior. So it scales up blindly (hello, cost waste) or hesitates to scale down (hello, permanently oversized pods).

The API is production-ready. The intelligence layer isn't.

And that's kind of the theme of this whole release, isn't it? Kubernetes gives you the primitives. You provide the smarts.


Gang Scheduling: Finally, Native "All-or-Nothing"

If you're in the AI/ML space, this is your headline feature: Gang Scheduling has landed as an alpha feature.

The Problem

You're training a distributed model. It needs 100 GPUs. The scheduler places 95 pods successfully, but then hits capacity. Now you've got 95 pods sitting there, holding 95 expensive GPUs, waiting for 5 more that might never come.

Meanwhile, other jobs are starving because those 95 GPUs are locked. You've created a deadlock. The cluster is effectively stuck. Someone's burning budget. Someone else is burning out.

Previously, solving this required external schedulers like Volcano or Kueue. They work great! But they're external dependencies with their own learning curves, deployment complexities, and operational overhead.

The v1.35 Solution

Kubernetes v1.35 introduces native gang scheduling via the new Workload API. Here's how it works:

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: distributed-training
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        minCount: 10  # All-or-nothing: need all 10 to start
---
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  parallelism: 10
  completions: 10
  template:
    spec:
      workloadRef:           # Real Pod field, not an annotation!
        name: distributed-training
        podGroup: workers
      containers:
      - name: trainer
        image: pytorch/pytorch:latest
        resources:
          limits:
            nvidia.com/gpu: 1

The scheduler sees workloadRef and holds all pods until it can place the entire gang. No partial allocation. No deadlock. No wasted GPU-hours.

It's elegant. It's native. It's...alpha.

The Native Scheduler Gap

Here's the catch: the native scheduler's gang scheduling implementation is basic. It handles the mechanics (don't schedule anything until you can schedule everything), but it doesn't handle the economics.

There's no queue management. No fair-share policies. No sophisticated backfill. No preemption intelligence for gang workloads.

For serious AI supercomputing, you'll still want Volcano or Kueue managing the queue strategy. The difference is now Kubernetes handles the gang semantics natively, and your external orchestrator handles the scheduling policy.

It's a division of labor. Kubernetes is the kernel. Your orchestrator is the user space. Sounds familiar :) ?


Opportunistic Batching: Not What You Think

Opportunistic Batching (KEP-5598) graduated to beta and is enabled by default. The name makes it sound like Kubernetes will now schedule 1,000 identical pods in one massive batch operation.

That's not what this is.

What It Actually Does

When the scheduler places a pod, it might keep the ranked node list in a small cache. For the very next pod with an identical "scheduling signature," it can return a hint: "try node X first." It's not "schedule 1,000 pods at once." It's "maybe skip some work for pod #2 if it's identical to pod #1."

It's opportunistic. It's a micro-optimization. And it comes with two non-obvious requirements.

Requirement 1: Pods Must Be "Signable"

The scheduler computes a signature from all the fields that affect placement. If any scheduler plugin can't produce a signature fragment, the whole pod becomes "unbatchable."

On day one of testing v1.35.0-rc.1, we hit this immediately. PodTopologySpread with system-default constraints blocked signatures for every pod. Zero batching happened,not because our pods were different, but because a default plugin refused to sign them.

The fix is to disable system-default topology constraints:

# /etc/kubernetes/kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: PodTopologySpread
    args:
      defaultingType: List
      defaultConstraints: []  # Disable defaults for batching

Requirement 2: Pods Must "Fill" Nodes

The scheduler only reuses the cached ranking if the previously chosen node becomes infeasible for the next pod. If the node still has capacity, it flushes the cache (node_not_full) because reusing might cause suboptimal packing.

Translation: Batching works great for "fat" pods that fill entire nodes (think GPU workers consuming 8 cores, 64GB RAM, 1-8 GPUs). For "tiny" microservice pods that pack 50-to-a-node, batching can't safely reuse hints.

Who Benefits?

Distributed training jobs with node-filling workers. The pods are identical, they're large, and they benefit from both gang scheduling (all-or-nothing placement) and batching (faster scheduling decisions).

For dense microservice workloads? Don't expect miracles. And that's by design,the scheduler is protecting against suboptimal bin-packing.

Diagnostic metrics to watch:

  • scheduler_batch_attempts_total{result="hint_used"} β†’ Batching is working

  • scheduler_batch_cache_flushed_total{reason="node_not_full"} β†’ Pods too small

  • scheduler_batch_cache_flushed_total{reason="pod_not_batchable"} β†’ Signature problems


HPA Gets Precision Tuning 🎯

The Horizontal Pod Autoscaler's fixed 10% tolerance has been a pain point forever. For a 1,000-replica deployment, that's a 100-pod dead zone where HPA just... won't react.

Kubernetes v1.35 promotes Configurable Tolerance to beta (enabled by default). You can now set tolerance per-HPA, and,even better,set it differently for scale-up versus scale-down.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend
spec:
  scaleTargetRef:
    apiVersion: apps/1
    kind: Deployment
    name: frontend
  minReplicas: 10
  maxReplicas: 500
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      tolerance: 0.02  # 2% , respond faster to traffic spikes
      stabilizationWindowSeconds: 60
    scaleDown:
      tolerance: 0.15  # 15% , conservative, avoid thrashing
      stabilizationWindowSeconds: 300

This asymmetric pattern (tight scale-up, loose scale-down) maps perfectly to how humans handle incidents: scale up on smoke, scale down on proof.

Quick gotchas:

  • Tolerance is stored as Quantity, so 0.02 becomes 20m in the API (don't be confused by the format)

  • Two HPAs targeting the same workload = silent failure with AmbiguousSelector

  • There's a warm-up period after HPA creation where you might see "did not receive metrics"

But overall? This is a clean, practical improvement that makes HPA more usable for high-scale workloads.


Security: The Year of Not Leaking Credentials

Image Pull Credential Verification (Beta, Default On)

Here's a fun multi-tenant security gap that's finally closed: Previously, if Tenant A pulled a private image with valid credentials, Tenant B could use that cached image without any credentials at all. The kubelet only verified on first pull.

In v1.35, the KubeletEnsureSecretPulledImages feature is enabled by default. The kubelet now re-validates credentials for every pod, even if the image is already cached locally.

This means:

  • The image cache is no longer a "free pass" in multi-tenant clusters

  • If a pull secret expires or rotates, pods that previously started fine (due to caching) will now fail with ImagePullBackOff

  • You need to monitor pull secret expiry and treat cache-dependent startups as a bug

The feature is configurable via imagePullCredentialsVerificationPolicy in KubeletConfiguration:

  • AlwaysVerify , Default in v1.35, check credentials for every pod

  • NeverVerify , Old behavior (insecure)

  • NeverVerifyAllowlistedImages , Skip verification only for specific image patterns

Structured Authentication Config (GA)

Multiple OIDC providers without restarting the API server? Yes, please.

The old way involved juggling --oidc-* flags, restarting the API server every time you wanted to add a provider, and generally feeling like you were living in 2015.

Kubernetes v1.35 graduates Structured Authentication Configuration to GA. You now use a dedicated config file:

# /etc/kubernetes/auth-config.yaml

apiVersion: apiserver.config.k8s.io/v1
kind: AuthenticationConfiguration
jwt:
# Production IdP for humans
- issuer:
    url: https://okta.example.com
    audiences:
    - production-cluster
  claimMappings:
    username:
      expression: 'claims.email.split("@")[0]'
    groups:
      expression: 'claims.groups.map(g, "okta:" + g)'
  claimValidationRules:
  - expression: 'claims.exp - claims.iat <= 3600'
    message: "Token lifetime cannot exceed 1 hour"

# CI/CD IdP for pipelines
- issuer:
    url: https://gitlab.example.com
    audiences:
    - ci-cluster
  claimMappings:
    username:
      claim: preferred_username
    groups:
      claim: roles

Enable it with --authentication-config=/etc/kubernetes/auth-config.yaml on the API server, and you're done. Multiple providers, dynamic reloads, CEL expressions for custom logic. It's beautiful.

Pod Certificates (Beta)

Native workload identity without external controllers, CRDs, or sidecars. The kubelet generates keys, requests certificates via PodCertificateRequest, writes credential bundles directly to the Pod's filesystem, and auto-rotates.

It's still beta and disabled by default (you need to enable certificates.k8s.io/v1beta1 and the PodCertificateRequest feature gate), but it's the future of service mesh and zero-trust architectures.

Pure mTLS flows with no bearer tokens in the issuance path. The kube-apiserver enforces node restriction at admission time. It's elegant, it's secure, and it's coming.


Quick Hits: Other Cool Stuff

Because this post is already longer than a CVS receipt, here are some other notable features in rapid-fire mode:

  1. Deployment Terminating Replicas (Beta) , Ever seen a rolling update trigger quota errors despite having capacity? The controller was ignoring terminating pods when counting replicas. v1.35 adds .status.terminatingReplicas so you can finally see the overlap. Mystery solved.

  2. Storage Version Migration (Beta) , Native support for migrating stored data to new schema versions, no external tools needed. Historically this required manual "read/write loops" piping kubectl commands together like it's 1999.

  3. StatefulSet MaxUnavailable (Beta) , Parallel updates for StatefulSets! Set maxUnavailable and watch multiple pods update simultaneously instead of one-at-a-time. Perfect for stateful apps that can tolerate some downtime.

  4. KYAML (Beta) , A safer, less ambiguous subset of YAML designed specifically for Kubernetes. Addresses the infamous "Norway Bug" and other YAML footguns. Enabled by default (disable with KUBECTL_KYAML=false).

  5. User Namespaces (Beta) , Containers can run as root internally while being mapped to unprivileged users on the host. Reduces privilege escalation risk if a container gets compromised.

  6. Node Declared Features (Alpha) , Nodes can advertise their supported feature gates via .status.declaredFeatures. The scheduler uses this to avoid placing pods on incompatible nodes during mixed-version upgrades. Finally, a real answer for heterogeneous clusters.

  7. Extended Toleration Operators (Alpha) , Tolerations can now use numeric comparison: "only schedule on nodes with SLA > 95%." Auto-evict if it drops below threshold. Numeric intent for placement!


The Philosophical Shift (Or: The Part Where I Get Existential)

Here's the thing about Kubernetes v1.35 that nobody's saying out loud: it's clarifying the project's identity in a way that might make some people uncomfortable.

A few years ago, the expectation was that upstream Kubernetes would deliver production-grade autoscaling, intelligent scheduling, and operational maturity out of the box. Native VPA would get smarter. HPA would understand seasonality. The scheduler would learn topology economics and cost optimization.

That hasn't happened. And based on the trajectory of v1.35, it's probably not going to.

Kubernetes is choosing to be a kernel.

It's focusing on:

  • βœ… Robust, low-level primitives (in-place resize, DRA, gang scheduling)

  • βœ… Safe, performant APIs (structured auth, pod certificates, storage migration)

  • βœ… Well-defined extension points (Workload API, DRA device classes)

It's leaving to users:

  • ❌ When to resize (intelligence, not mechanism)

  • ❌ Where to place AI workloads (economics, not mechanics)

  • ❌ How to optimize bin-packing (strategy, not structure)

The native controllers,HPA, VPA, default scheduler, are becoming reference implementations, not production-grade optimization engines.

And you know what? That's probably the right call for an open-source project at this scale. You can't be everything to everyone. Focus on the primitives, nail the APIs, and let the ecosystem build the intelligence layer.

But it does mean the burden shifts. Platform teams need to either:

  1. Build intelligence layers themselves (hard, but flexible)

  2. Adopt ecosystem tools that provide intelligence (easier, but adds dependencies)

  3. Accept the limitations of native controllers (simplest, but leaves optimization on the table)

There's no wrong answer. But there is a choice to make.


The Pre-Upgrade Checklist (aka "How to Not Ruin Your Weekend")

Before you upgrade to v1.35, here's what you absolutely, positively need to check:

πŸ”΄ BLOCKERS (Fix These Or Don't Upgrade)

cgroup v2 on all nodes

stat -fc %T /sys/fs/cgroup  # Must show "cgroup2fs"

containerd 2.0 or later

kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.containerRuntimeVersion}'

If either of these fails, stop. Do not pass Go. Do not collect $200. Fix your infrastructure first.

🟠 HIGH PRIORITY (Fix Soon After Upgrade)

  • Scan for Docker Schema 1 images , skopeo inspect every image in your registries

  • Verify kube-proxy mode , Make sure you're not on IPVS (or have a migration plan)

  • Update containerd configs , Remove deprecated registry structures

  • Check image pull secrets , Credential verification is mandatory now

  • RBAC for exec/attach , Add create verb for pod subresources

🟑 MEDIUM PRIORITY (Plan For It)

  • Run kubepug or pluto to find deprecated APIs in your manifests

  • Consider switching VPA to InPlaceOrRecreate mode

  • Evaluate maxUnavailable for StatefulSets that can tolerate parallel updates

  • Test HPA configurable tolerance for high-scale workloads

Post-Upgrade Validation

# Verify PSI is available

cat /proc/pressure/memory

# Test in-place resize

kubectl patch pod test-pod --subresource resize --type='merge' -p '{"spec":{"containers":[{"name":"app","resources":{"requests":{"memory":"512Mi"}}}]}}'

# Check scheduling batching metrics (Prometheus)

scheduler_batch_attempts_total{result="hint_used"}

# Verify feature gates

kubectl get --raw /metrics | grep -i "kubernetes_feature_enabled"

The Bottom Line: What Should You Do?

If you're a platform engineer: Treat this as a checkpoint release. Your technical debt is due. Schedule time for infrastructure upgrades, not just manifest changes. Test cgroup v2, plan containerd 2.0 migration, audit your RBAC policies. This isn't optional.

If you're running AI/ML workloads: Explore gang scheduling and batching, but understand they're primitives. You still need orchestration intelligence for production-grade queue management. The scheduler handles mechanics; you provide strategy.

If you operate stateful workloads: In-place resize is production-ready. Test it. Love it. Deploy it. But if you're using native VPA, keep expectations realistic,the mechanism is GA, the intelligence isn't.

If you're a developer: Most of this won't affect you directly. Enjoy the more precise HPA, safer YAML with KYAML, and better Pod lifecycle tracking. Your platform team is handling the hard stuff.


A Note for Managed Kubernetes Users (EKS, AKS, GKE, etc.)

If you're running on a managed Kubernetes service, your upgrade experience will be different,and in many ways, easier.

The Good News:

  • No cgroup v2 migration pain , Cloud providers handle node OS upgrades. By the time EKS/AKS/GKE offer v1.35, their node images will already be running cgroup v2-compatible systems.

  • Automatic containerd updates , Managed services bundle compatible container runtime versions. You won't manually upgrade containerd.

  • Control plane upgrades managed , API server, scheduler, controller-manager upgrades happen with a button click (or API call).

What You Still Need to Handle:

  • Application compatibility , Test your workloads with v1.35 API changes, especially if you use newer beta features.

  • RBAC updates , The WebSocket permission changes for exec/attach/portforward still affect your roles and service accounts.

  • Image pull secrets , The stricter credential verification affects all clusters. Audit your pull secret lifetimes and rotation policies.

  • Cost implications , New features like in-place resize and better autoscaling can improve efficiency, but you need to configure them. Managed Kubernetes doesn't automatically optimize your resource usage.

Timeline Expectations:

  • EKS typically lags 2-4 weeks behind upstream releases

  • GKE usually within 2-3 weeks for Rapid channel, longer for Regular/Stable

  • AKS generally 2-4 weeks post-release

Check your provider's release notes,they often add their own enhancements or defer certain alpha features.

Bottom Line: Managed Kubernetes handles infrastructure concerns, but you're still responsible for application architecture, resource optimization, and operational intelligence. v1.35's primitives are available to you; making them useful is still your job.


Final Thoughts

Kubernetes v1.35 is a fascinating release. It's not the most feature-packed release we've ever seen (v1.16's "The One With Everything" still holds that crown). But it's clarifying. It's opinionated about what Kubernetes is and what it isn't.

It's a kernel. It provides primitives. It's your job to make them smart.

And honestly? That's liberating. Because now we know where we stand. The expectations are clear. The division of labor is explicit.

Build on the primitives. Fill the intelligence gaps. Make something amazing.

The World Tree has another growth ring. What will you build in its branches?


Written with chai , mild existential dread, and genuine excitement for the future of Kubernetes. May your upgrades be smooth and your cgroups be v2. β˜•οΈ

Happy upgrading! 🐿️