<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Ping to Production]]></title><description><![CDATA[Ping to Production]]></description><link>https://blogs.akshatsinha.dev</link><generator>RSS for Node</generator><lastBuildDate>Fri, 01 May 2026 11:38:46 GMT</lastBuildDate><atom:link href="https://blogs.akshatsinha.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[LiteLLM Supply Chain Attack: The AI Package That Turned Into a Fork Bomb (But Stole Your AWS Keys First)]]></title><description><![CDATA[Why I run my AI agents in isolated infra , and why March 24th proved me right

I get a lot of raised eyebrows when I tell people I run all my AI agent experiments in isolated infrastructure. Separate ]]></description><link>https://blogs.akshatsinha.dev/litellm-supply-chain-attack</link><guid isPermaLink="true">https://blogs.akshatsinha.dev/litellm-supply-chain-attack</guid><category><![CDATA[Devops]]></category><category><![CDATA[cybersecurity]]></category><category><![CDATA[ai agents]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Python]]></category><category><![CDATA[Security]]></category><category><![CDATA[litellm]]></category><dc:creator><![CDATA[Akshat Sinha]]></dc:creator><pubDate>Wed, 25 Mar 2026 11:39:25 GMT</pubDate><content:encoded><![CDATA[<p><em>Why I run my AI agents in isolated infra , and why March 24th proved me right</em></p>
<hr />
<p>I get a lot of raised eyebrows when I tell people I run all my AI agent experiments in isolated infrastructure. Separate VMs. No cloud credentials mounted. Network egress restricted. "Isn't that overkill for hobby stuff?" Maybe. But then March 24, 2026 happened, and I spent the day watching my timeline fill with developers discovering their SSH keys, AWS credentials, Kubernetes configs, and crypto wallet seeds had just been quietly exfiltrated , courtesy of a Python package they trusted.</p>
<p>The package was <code>litellm</code>. The attack was elegant, the cleanup was brutal, and the lessons are directly relevant to anyone who uses AI tooling on their dev machine. Let me walk you through it.</p>
<hr />
<h2>What Even Is LiteLLM?</h2>
<p>Quick context if you're not deep in the AI stack: LiteLLM is a Python library that gives you one unified API to talk to 100+ LLM providers , OpenAI, Anthropic, Gemini, Bedrock, you name it. Instead of writing provider-specific code for each model, you call LiteLLM and it handles the translation layer.</p>
<p>It's extremely popular. About <strong>3.4 million downloads per day</strong> popular.</p>
<p>It gets used in two ways: as a Python SDK in your code, and as a standalone proxy server that your entire org routes model calls through. That second use case is important. When you run LiteLLM as a proxy, that one machine holds API keys for <em>every</em> AI provider your team uses. From an attacker's perspective, that's a very interesting machine to compromise.</p>
<hr />
<h2>How the Attack Got In (This Part Is Wild)</h2>
<p>The attackers didn't phish a developer. They didn't brute-force anything. They played the long game through the toolchain.</p>
<p><strong>Five days before the attack</strong>, they compromised <code>trivy-action</code> , the GitHub Action for Trivy, an open-source container security scanner made by Aqua Security. They rewrote the Git tags in the repo to point to a malicious release. LiteLLM used Trivy in its CI/CD pipeline, pulling it from <code>apt</code> without a pinned version.</p>
<p>Think about that for a second. A security scanner was the entry point.</p>
<p>When LiteLLM's CI next ran, it pulled the poisoned Trivy action, which silently exfiltrated the <code>PYPI_PUBLISH</code> token from the GitHub Actions runner environment. The attackers now had direct publish rights to the <code>litellm</code> package on PyPI.</p>
<p>The day before the attack, they registered <code>models.litellm.cloud</code> , a domain crafted to look official, registered a single day before it would be used as the exfiltration endpoint.</p>
<p>On March 24, they published two malicious versions in 13 minutes.</p>
<p>Here's the detail that should give any DevOps person pause: <strong>neither version appears anywhere in the LiteLLM GitHub release history.</strong> The repo only goes up to <code>v1.82.6.dev1</code>. Versions 1.82.7 and 1.82.8 were uploaded directly to PyPI using the stolen token, bypassing every CI/CD workflow, every review, every safeguard the team had in place. The package registry was updated. The repository never was.</p>
<hr />
<h2>The Two Payloads</h2>
<p>The attackers published two versions, each with a different delivery mechanism , suggesting they were iterating in real time.</p>
<p><strong>v1.82.7</strong> embedded the malicious payload inside <code>litellm/proxy/proxy_server.py</code>. It fires when anything imports <code>litellm.proxy</code>, which is the standard import path for running LiteLLM's proxy server.</p>
<p><strong>v1.82.8</strong> went further. It added a file called <code>litellm_init.pth</code> to <code>site-packages</code>. If you're not familiar with <code>.pth</code> files: Python automatically executes them on <em>every</em> interpreter startup. Not on import. Not on first use. On every startup , including when you run <code>pip</code>, when your IDE's language server initializes, when a subprocess spawns. No <code>import litellm</code> required, ever.</p>
<p>The <code>.pth</code> payload looks like this:</p>
<pre><code class="language-python">import os, subprocess, sys
subprocess.Popen([sys.executable, "-c", "import base64; exec(base64.b64decode('...'))"])
</code></pre>
<p>Double base64-encoded, so it survives naive grep. And here's the kicker: the file is correctly declared in the wheel's <code>RECORD</code> with a valid checksum:</p>
<pre><code class="language-plaintext">litellm_init.pth,sha256=ceNa7wMJnNHy1kRnNCcwJaFjWX3pORLfMh7xGL8TUjg,34628
</code></pre>
<p><code>pip install --require-hashes</code> would pass. You're verifying you received exactly what the attacker published, and you did. The integrity guarantees of the package ecosystem assume the signing credentials are trustworthy. Once those are stolen, that assumption is gone.</p>
<hr />
<h2>What the Malware Actually Did</h2>
<p>The community reverse-engineered the decoded payload within hours of the disclosure. Here's what it harvested from affected machines:</p>
<p><strong>Credentials and keys</strong>: SSH private keys (<code>id_rsa</code>, <code>id_ed25519</code>, <code>id_ecdsa</code>), <code>.env</code> files, shell history across bash/zsh/sh, <code>.netrc</code>, <code>.npmrc</code>, <code>.vault-token</code>, <code>.git-credentials</code>, <code>.gitconfig</code>.</p>
<p><strong>Cloud credentials</strong>: AWS credentials file + config, live IMDS queries with full IMDSv2 signing to pull from Secrets Manager and SSM Parameter Store, GCP application default credentials, Azure token files.</p>
<p><strong>Kubernetes</strong>: <code>~/.kube/config</code>, all of <code>/etc/kubernetes/*.conf</code>, in-cluster service account tokens and certificates.</p>
<p><strong>Containers</strong>: Docker <code>config.json</code> from every known path, including Kaniko.</p>
<p><strong>CI/CD configs</strong>: <code>terraform.tfvars</code>, <code>.gitlab-ci.yml</code>, <code>.travis.yml</code>, <code>Jenkinsfile</code>, <code>.drone.yml</code>. Anything that might contain secrets that ended up in version-controlled configs.</p>
<p><strong>Databases and services</strong>: <code>.my.cnf</code>, <code>.pgpass</code>, <code>.mongorc.js</code>, Redis CLI history.</p>
<p><strong>Crypto wallets</strong>: Bitcoin, Ethereum keystore, Solana, Cardano, Monero, Ledger hardware wallet configs.</p>
<p>Everything was encrypted (AES-256-CBC, session key wrapped with a hardcoded 4096-bit RSA public key) and POST'd to <code>https://models.litellm.cloud/</code>. Temporary files , <code>session.key</code>, <code>payload.enc</code>, <code>tpcp.tar.gz</code> , were created in <code>/tmp</code> during exfiltration.</p>
<p>If the malware found a Kubernetes service account token, it went further. It read all secrets across every namespace in the cluster, then attempted to deploy privileged <code>alpine:latest</code> pods to every node in <code>kube-system</code>, mounting the host filesystem and installing a persistent backdoor. Pods were named <code>node-setup-{node_name}</code>.</p>
<p>The backdoor itself , written to <code>~/.config/sysmon/sysmon.py</code> and registered as a systemd user service , polls <code>https://checkmarx.zone/raw</code> every five minutes for a URL and executes whatever it finds. The attacker can push live payloads to compromised machines at will.</p>
<hr />
<h2>The Bug That Saved People (Accidentally)</h2>
<p>Here's the most darkly ironic part of this whole story.</p>
<p>The <code>.pth</code> mechanism fires on every Python startup. The first thing the payload does is spawn a new Python subprocess. That subprocess also triggers <code>.pth</code> execution since <code>litellm_init.pth</code> is still in <code>site-packages</code>. Which spawns another. Which spawns another.</p>
<p>An unintended fork bomb , a bug in the malware itself.</p>
<p>This is why Callum McMahon at FutureSearch noticed anything was wrong in the first place. His 48GB Mac ground to a halt. <code>htop</code> took tens of seconds to open. 11,000 processes running. Without that mistake, the payload would have exfiltrated credentials silently in the background, planted its backdoor, cleaned up temp files, and disappeared. Nobody would have known until someone tried to use a rotated key and found it already being used.</p>
<p>As Andrej Karpathy put it on X: the malware's own poor quality is what made it visible.</p>
<hr />
<h2>The Disclosure: Community 1, Attackers 0</h2>
<p>Once Callum's team identified the malicious package, they posted a detailed technical disclosure in <a href="https://github.com/BerriAI/litellm/issues/24512">GitHub issue #24512</a> at 11:48 UTC. It hit Hacker News about 45 minutes later and reached 324 points.</p>
<p>The attackers responded by flooding the issue with <strong>88 bot comments from 73 previously-compromised developer accounts</strong> in a 102-second window. Then they used the stolen <code>krrishdholakia</code> maintainer account , the actual LiteLLM CEO's account , to close issue #24512 as "not planned."</p>
<p>The community opened <a href="https://github.com/BerriAI/litellm/issues/24518">a new tracking issue (#24518)</a>, noted what had happened, and kept the discussion alive on Hacker News. PyPI quarantined both versions at ~13:38 UTC. Total exposure window: about three hours.</p>
<p>By 15:09 UTC, the LiteLLM maintainers confirmed all GitHub, Docker, and PyPI credentials had been rotated and maintainer accounts moved to new identities. Google's Mandiant team was brought in for forensic analysis of the build pipeline.</p>
<p>Major downstream projects , DSPy, MLflow, CrewAI, OpenHands, Arize Phoenix , filed emergency PRs to pin away from the compromised versions the same day.</p>
<hr />
<h2>This Wasn't a One-Off. It Was Phase 09.</h2>
<p>The group behind this, tracked as <strong>TeamPCP</strong>, has been running an ongoing campaign since at least December 2025. LiteLLM was Phase 09.</p>
<p>The same RSA public key appears in the Trivy, KICS (a Checkmarx IaC scanner), and LiteLLM payloads. Same <code>tpcp.tar.gz</code> naming. Same infrastructure registrar. The target selection across all three is deliberate: each is a tool that requires elevated, broad access to the systems it operates on. A container scanner, an IaC scanner, an LLM gateway , all of them sit deep inside CI/CD pipelines and developer machines, with legitimate reasons to read credentials.</p>
<p>TeamPCP also deployed something called CanisterWorm, which uses the Internet Computer Protocol (ICP) as a C2 channel. ICP canisters can't be taken down by domain registrars or hosting providers. They're also apparently using an AI agent for automated attack targeting. Supply chain attacks are now getting automated. Fun times.</p>
<hr />
<h2>What This Means If You Run AI Tooling (Read: Probably You)</h2>
<p>Here's the thing that makes this incident different from a typical npm leftpad situation. The AI developer ecosystem has converged on patterns that are genuinely great for productivity and genuinely terrible for security:</p>
<p><code>uvx</code> <strong>and</strong> <code>npx</code> <strong>auto-pull the latest version of everything.</strong> When Cursor loads an MCP server, it runs it via <code>uvx</code>, which automatically resolves and downloads dependencies. Unpinned, from the internet, on your dev machine, which has your AWS credentials, SSH keys, and Kubernetes config sitting in well-known paths that have been in default locations for twenty years.</p>
<p><strong>Transitive dependencies are invisible.</strong> Callum didn't install litellm. His MCP server had an unpinned litellm dependency. <code>uvx</code> pulled the latest version, which happened to have been maliciously published 13 minutes earlier. The attack surface was a dependency of a plugin of an IDE.</p>
<p><strong>LLM gateways are credential aggregators by design.</strong> If you're running LiteLLM as a proxy , which is the recommended production pattern , that machine holds API keys for every model provider you use. Compromising it is a one-stop shop.</p>
<p>For what it's worth: this is exactly why I run AI experiments in isolated infra. Not because I'm paranoid, but because the ergonomics of the AI tooling ecosystem , auto-pulling dependencies, local execution, broad filesystem access , are a different threat model than running a web server. A compromised nginx config doesn't exfiltrate your AWS credentials. A compromised Python package that fires on every interpreter startup might.</p>
<hr />
<h2>What You Should Actually Do</h2>
<p><strong>If you installed litellm between 10:39 and ~13:38 UTC on March 24, 2026</strong>, assume the machine is compromised regardless of whether you ran any application code. The <code>.pth</code> mechanism fires during <code>pip install</code> itself.</p>
<p>Check for the persistence backdoor:</p>
<pre><code class="language-bash">ls ~/.config/sysmon/sysmon.py
systemctl --user status sysmon.service
</code></pre>
<p>Check for the <code>.pth</code> file:</p>
<pre><code class="language-bash">find $(python3 -c "import site; print(' '.join(site.getsitepackages()))") \
  -name "*.pth" -exec grep -l "base64\|subprocess\|exec" {} \;
</code></pre>
<p>Check Kubernetes:</p>
<pre><code class="language-bash">kubectl get pods -A | grep node-setup-
</code></pre>
<p>Then rotate everything: SSH keys, cloud credentials, API keys, database passwords, Kubernetes tokens. Audit AWS Secrets Manager and SSM Parameter Store if instance metadata was accessible. It's a brutal checklist but a necessary one.</p>
<p><strong>Going forward</strong>, regardless of whether you were affected:</p>
<ul>
<li><p><strong>Pin your dependencies.</strong> Use lock files with checksums. Unpinned transitive dependencies are your attack surface.</p>
</li>
<li><p><strong>Audit</strong> <code>.pth</code> <strong>files in your environments.</strong> Most legitimate packages don't install them. If you see one you don't recognize: that's a red flag.</p>
</li>
<li><p><strong>Treat your dev machine like it has prod credentials.</strong> Because it probably does.</p>
</li>
<li><p><strong>If you run MCP servers locally</strong>, check their dependency manifests. Anything pulling in unpinned versions of large, popular libraries is an exposure.</p>
</li>
<li><p><strong>Consider isolated infra for AI agent experiments.</strong> A VM with no cloud credentials mounted, egress restricted to what it actually needs. Yes, it's friction. It's also a lot less friction than rotating all your credentials and auditing your Kubernetes cluster.</p>
</li>
</ul>
<hr />
<h2>The Thing That Sticks With Me</h2>
<p>The AI tooling security conversation usually centers on prompt injection , tricking LLMs into doing bad things, the "lethal trifecta" of tool use, memory, and exfiltration. That's a real and evolving threat.</p>
<p>But the attack that actually hit people on March 24th required no AI manipulation whatsoever. No jailbreaking. No clever prompt. Just stolen CI/CD credentials, a malicious PyPI upload, and Python's decades-old <code>.pth</code> mechanism doing exactly what it was designed to do. The most sophisticated-looking threat in the AI ecosystem was beaten by the oldest trick in the supply chain book.</p>
<p>The irony is that LiteLLM, a tool purpose-built to manage access to AI systems, became the delivery vehicle for an attack that had nothing to do with AI at all. It was just a package. With dependencies. In a pipeline. Like everything else.</p>
<p>Pin your dependencies. Isolate your infra. And maybe double-check which security scanners your CI/CD is pulling.</p>
<hr />
<p><em>References:</em> <a href="https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/"><em>FutureSearch , the original technical disclosure</em></a> <em>|</em> <a href="https://futuresearch.ai/blog/no-prompt-injection-required/"><em>FutureSearch , first-person account</em></a> <em>|</em> <a href="https://docs.litellm.ai/blog/security-update-march-2026"><em>LiteLLM official security update</em></a> <em>|</em> <a href="https://snyk.io/articles/poisoned-security-scanner-backdooring-litellm/"><em>Snyk deep-dive</em></a> <em>|</em> <a href="https://github.com/BerriAI/litellm/issues/24512"><em>GitHub #24512 , original disclosure</em></a> <em>|</em> <a href="https://github.com/BerriAI/litellm/issues/24518"><em>GitHub #24518 , clean tracking issue</em></a></p>
]]></content:encoded></item><item><title><![CDATA[Terraform vs Crossplane: The Ultimate DevOps Infrastructure Showdown]]></title><description><![CDATA[The Infrastructure Management Odyssey
Imagine it's 2 AM, and you are in a digital wrestling match with cloud configurations that seem to have a mind of their own. As a DevOps engineer, I've been there, drowning in a sea of manual deployments, battlin...]]></description><link>https://blogs.akshatsinha.dev/terraform-vs-crossplane-iac-guide</link><guid isPermaLink="true">https://blogs.akshatsinha.dev/terraform-vs-crossplane-iac-guide</guid><category><![CDATA[Infrastructure as code]]></category><category><![CDATA[#IaC]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[CloudProvisioning]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[cloud native]]></category><category><![CDATA[gitops]]></category><category><![CDATA[AWS]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Azure]]></category><dc:creator><![CDATA[Akshat Sinha]]></dc:creator><pubDate>Tue, 20 Jan 2026 03:30:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768660557902/9b2f7a0e-ea44-4af8-b436-9b7bc170a7b4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-the-infrastructure-management-odyssey">The Infrastructure Management Odyssey</h2>
<p>Imagine it's 2 AM, and you are in a digital wrestling match with cloud configurations that seem to have a mind of their own. As a DevOps engineer, I've been there, drowning in a sea of manual deployments, battling configuration drift, and desperately seeking a way to bring order to infrastructure chaos.</p>
<p>Multiple cloud providers, endless configuration files, and the constant fear of inconsistent deployments have haunted me for days. Enter the game-changers: Infrastructure as Code (IaC):</p>
<h2 id="heading-meet-the-infrastructure-provisioning-titans">Meet the Infrastructure Provisioning Titans</h2>
<h3 id="heading-terraform-the-established-veteran">Terraform: The Established Veteran</h3>
<p>Developed by HashiCorp, Terraform has been the backbone of infrastructure provisioning for years. With its declarative HashiCorp Configuration Language (HCL), it's essentially the Swiss Army knife of cloud infrastructure. Describe your entire infrastructure as code, version control it, and deploy across multiple cloud providers with surgical precision.</p>
<h3 id="heading-crossplane-the-cloud-native-disruptor">Crossplane: The Cloud-Native Disruptor</h3>
<p>If Terraform is the seasoned veteran, Crossplane is the innovative newcomer challenging the status quo. Built with a Kubernetes-native approach, Crossplane reimagines infrastructure management by leveraging Kubernetes Custom Resource Definitions (CRDs). Applying a YAML to create a K8s Cluster has its own sense of satisfaction.</p>
<hr />
<h2 id="heading-deep-dive-technical-comparison">Deep Dive: Technical Comparison</h2>
<h3 id="heading-flexibility-and-reach">Flexibility and Reach</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Dimension</td><td>Terraform</td><td>Crossplane</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Provider Support</strong></td><td>100+ cloud providers</td><td>Multi-cloud with Kubernetes-native approach</td></tr>
<tr>
<td><strong>Configuration Language</strong></td><td>Custom HCL</td><td>Kubernetes YAML</td></tr>
<tr>
<td><strong>State Management</strong></td><td>Explicit state files</td><td>Stateless, Kubernetes reconciliation</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-detailed-configuration-examples">Detailed Configuration Examples</h2>
<h3 id="heading-terraform-aws-ec2-instance-deployment">Terraform: AWS EC2 Instance Deployment</h3>
<p>Provision a basic web server:</p>
<pre><code class="lang-hcl">resource "aws_instance" "web_server" {
  # Specific Amazon Machine Image (AMI)
  ami           = "ami-0c55b159cbfafe1f0"

  # Instance type selection
  instance_type = "t2.micro"

  # Resource tagging for management
  tags = {
    Name = "WebServer"
    Environment = "Production"
    ManagedBy = "Terraform"
  }
}
</code></pre>
<h3 id="heading-crossplane-kubernetes-native-resource-provisioning">Crossplane: Kubernetes-Native Resource Provisioning</h3>
<p>Crossplane resource definition for AWS EC2 instance:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">ec2.aws.upbound.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Instance</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">web-server-crossplane</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">forProvider:</span>
    <span class="hljs-comment"># Identical AMI and instance type</span>
    <span class="hljs-attr">imageId:</span> <span class="hljs-string">ami-0c55b159cbfafe1f0</span>
    <span class="hljs-attr">instanceType:</span> <span class="hljs-string">t2.micro</span>

    <span class="hljs-comment"># Enhanced metadata and region specification</span>
    <span class="hljs-attr">region:</span> <span class="hljs-string">us-east-1</span>
    <span class="hljs-attr">tags:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">Name</span>
        <span class="hljs-attr">value:</span> <span class="hljs-string">WebServer</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">Environment</span>
        <span class="hljs-attr">value:</span> <span class="hljs-string">Production</span>
</code></pre>
<hr />
<h2 id="heading-performance-and-architectural-considerations">Performance and Architectural Considerations</h2>
<h3 id="heading-1-terraforms-approach">1. Terraform's Approach</h3>
<p><strong>State Management:</strong> Maintains explicit state files.</p>
<h3 id="heading-pros"><strong>Pros:</strong></h3>
<ul>
<li>Predictable infrastructure tracking</li>
<li>Detailed change planning</li>
</ul>
<h3 id="heading-cons"><strong>Cons:</strong></h3>
<ul>
<li>Potential state drift</li>
<li>Requires careful state file management</li>
</ul>
<h3 id="heading-2-crossplanes-strategy">2. Crossplane's Strategy</h3>
<p><strong>Kubernetes Native Reconciliation:</strong> Stateless resource management.</p>
<h3 id="heading-pros-1"><strong>Pros:</strong></h3>
<ul>
<li>Dynamic resource composition</li>
<li>Seamless GitOps workflows</li>
</ul>
<h3 id="heading-cons-1"><strong>Cons:</strong></h3>
<ul>
<li>Steeper learning curve</li>
<li>Kubernetes dependency</li>
</ul>
<hr />
<h2 id="heading-when-to-choose-what">When to Choose What</h2>
<h3 id="heading-terraform-is-your-best-bet-if">Terraform is Your Best Bet If:</h3>
<ul>
<li>You require extensive multi-cloud support.</li>
<li>Your team is comfortable with the HashiCorp ecosystem.</li>
<li>You need complex, stateful infrastructure management.</li>
<li>Detailed change planning is crucial.</li>
</ul>
<h3 id="heading-crossplane-shines-when">Crossplane Shines When:</h3>
<ul>
<li>Kubernetes is central to your infrastructure strategy.</li>
<li>You embrace GitOps principles.</li>
<li>Dynamic, composable infrastructure is a priority.</li>
<li>You want tighter integration with cloud-native tools.</li>
</ul>
<h2 id="heading-hybrid-approach-the-best-of-both-worlds">Hybrid Approach: The Best of Both Worlds</h2>
<p>In the world of infrastructure management, adopting a hybrid approach can be a game-changer. Instead of rigidly choosing between Terraform and Crossplane, consider them as complementary tools.</p>
<p>Use Terraform for initial, comprehensive infrastructure setup across cloud providers, and then leverage Crossplane's dynamic Kubernetes-native capabilities for ongoing, flexible management. This strategy allows you to implement each tool's unique strengths precisely where they provide the most value, creating a more adaptive and powerful infrastructure provisioning ecosystem.</p>
<hr />
<h2 id="heading-the-human-element-in-infrastructure-as-code">The Human Element in Infrastructure as Code</h2>
<p>Remember, no tool is universally perfect. The right choice depends on:</p>
<ol>
<li><strong>Infrastructure Needs:</strong> Your specific technical requirements serve as the primary navigation compass. Understanding the unique architectural demands of your project is essential.</li>
<li><strong>Team Expertise:</strong> The skill set and comfort level of your team influence tool selection. A tool that aligns with your team's existing knowledge can speed up implementation and reduce the learning curve.</li>
<li><strong>Cloud Environment Complexity:</strong> Whether you're managing a simple single-cloud deployment or a complex multi-cloud ecosystem, your chosen tool must provide the flexibility and robustness to handle your current and future infrastructure landscape.</li>
<li><strong>Long-term Vision:</strong> Look beyond immediate requirements. Select a tool that can scale, adapt, and support your architectural roadmap, ensuring your infrastructure can evolve seamlessly with your organizational growth and technological ambitions.</li>
</ol>
<blockquote>
<p><strong>P.S.</strong> If you are on AWS, do check out my colleague's article on Karpenter and how it helped us move from Reactive Scaling to Developer-Aware scaling: <a target="_blank" href="https://www.linkedin.com/pulse/autoscaling-evolved-our-journey-karpenter-kamal-acharya-ml9dc">Autoscaling Evolved: Our Journey with Karpenter</a></p>
</blockquote>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p><em>Infrastructure as Code isn't just about selecting the right provisioning tool. It's about creating predictable, manageable, and scalable environments that adapt to your organization's evolving needs.</em></p>
]]></content:encoded></item><item><title><![CDATA[Exploring Kubernetes v1.35 'Timbernetes': All About the World Tree Version]]></title><description><![CDATA[Author's Note: This piece was inspired by Nicolas Vermandé comprehensive analysis at ScaleOps. His work shaped the structure and focus of my coverage. Check out his article for additional insights.

The Release That Makes You Choose: Upgrade or Archa...]]></description><link>https://blogs.akshatsinha.dev/kubernetes-1-35</link><guid isPermaLink="true">https://blogs.akshatsinha.dev/kubernetes-1-35</guid><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Developer]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Azure]]></category><category><![CDATA[Google Cloud Platform]]></category><category><![CDATA[cloud native]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Platform Engineering ]]></category><dc:creator><![CDATA[Akshat Sinha]]></dc:creator><pubDate>Fri, 19 Dec 2025 03:00:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1766089802181/4b98e0c5-d029-49dd-b7b7-a3c72c0d56c5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong><em>Author's Note:</em></strong> <em>This piece was inspired by</em> <a target="_blank" href="https://www.linkedin.com/in/vnicolas/"><strong><em>Nicolas Vermandé</em></strong></a> <em>comprehensive analysis at ScaleOps. His work shaped the structure and focus of my coverage. Check out</em> <a target="_blank" href="https://scaleops.com/blog/kubernetes-1-35-release-overview/"><em>his article</em></a> <em>for additional insights.</em></p>
<hr />
<h2 id="heading-the-release-that-makes-you-choose-upgrade-or-archaeology"><strong>The Release That Makes You Choose: Upgrade or Archaeology</strong></h2>
<p>Picture this: You're sipping your morning coffee, scrolling through your Slack, and someone drops the Kubernetes v1.35 release notes. "Another quarterly release," you think. "Probably just some minor tweaks and the usual beta promotions."</p>
<p><strong>Wrong.</strong> So very, very wrong.</p>
<p>Kubernetes v1.35 is the release equivalent of your apartment landlord saying "we're doing renovations" and then showing up with a wrecking ball. This isn't just another feature release,it's an infrastructure intervention. And if you're still running CentOS 7 nodes... well, I'm not saying you should panic, but maybe start stress-testing your resume template.</p>
<p>Let me explain.</p>
<hr />
<h2 id="heading-the-theme-yggdrasil-squirrels-and-existential-questions-about-operating-systems"><strong>The Theme: Yggdrasil, Squirrels, and Existential Questions About Operating Systems</strong></h2>
<p>First, can we just appreciate the theme? <strong>Timbernetes</strong>. The World Tree. Inspired by Yggdrasil from Norse mythology,the cosmic tree that connects all realms. The logo features three adorable squirrels: a wizard holding an LGTM scroll (for reviewers), a warrior with an axe and Kubernetes shield (for release crews), and a rogue with a lantern (for triagers who bring light to dark issue queues).</p>
<p>It's wholesome. It's nerdy. It's the kind of branding that makes you want to print stickers and put them on your laptop next to that one from KubeCon 2019 that's starting to peel.</p>
<p>But here's what the cheerful squirrels don't tell you: Kubernetes v1.35 represents a philosophical shift in how the project sees itself. This isn't just about features,it's about <em>focus</em>.</p>
<p><strong>Kubernetes is doubling down on being infrastructure.</strong></p>
<p>What does that mean? Think about the role of foundational systems. They provide powerful, reliable building blocks, mechanisms for resource management, workload placement, and lifecycle control,but they don't prescribe <em>how</em> you should use them. They give you the tools; you bring the strategy.</p>
<p>v1.35 delivers exactly that: in-place resource mutation, coordinated gang scheduling, structured device allocation, and enhanced observability. These are sophisticated capabilities that unlock new possibilities. But they're <em>capabilities</em>, not complete solutions.</p>
<p>The native controllers,HPA, VPA, the default scheduler,provide baseline functionality. They work. They're reliable. But they're increasingly designed as reference implementations rather than production-optimized systems for every use case.</p>
<p>This creates an interesting dynamic: the primitives are maturing rapidly, but the intelligence layer,the part that decides <em>when</em> to resize a pod, <em>where</em> to place an AI workload, or <em>how</em> to optimize for cost,is increasingly left as an exercise for the platform team.</p>
<p>It's not a bug. It's a design choice. And it has implications for how you approach Kubernetes in production.</p>
<hr />
<h2 id="heading-breaking-changes-the-modernization-mandate"><strong>Breaking Changes: The Modernization Mandate</strong></h2>
<p>Let's start with the uncomfortable stuff. You know how doctors say "this won't hurt" right before it definitely hurts? Yeah, this is like that.</p>
<h3 id="heading-cgroup-v1-is-dead-no-really-actually-dead"><strong>cgroup v1 Is Dead. No, Really, Actually Dead.</strong></h3>
<p>Remember cgroup v1? That venerable Linux resource management system that's been around since... forever? It's gone. Not deprecated with a gentle "please consider migrating" message. <strong>Removed.</strong> Deleted. Sent to the great <code>/dev/null</code> in the sky.</p>
<p>If your kubelet detects cgroup v1 on startup, it will fail. Hard. No negotiation. No "just this once." It's like trying to run Windows 95 programs on Windows 11, technically there's compatibility mode, but do you <em>really</em> want to be that person?</p>
<p>Here's how to check if you're about to have a very bad day:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">stat</span> -<span class="hljs-built_in">fc</span> %T /sys/fs/cgroup
</code></pre>
<p>If you see <strong><em>cgroup2fs</em></strong>, congratulations! You're living in 2025 (soon to be 2026). If you see tmpfs... I have bad news. You're running cgroup v1, and your Friday afternoon just got a lot more interesting.</p>
<p>The impact on legacy fleets is, shall we say, <em>spicy</em>. CentOS 7 (which hit EOL in June 2024, by the way,you really should have migrated by now), RHEL 7, Ubuntu 18.04... they all default to cgroup v1. Even if you're on a modern distro, if your kubelet is explicitly set to cgroupDriver: cgroupfs instead of systemd, you're going to hit this wall like a cartoon character hitting a pane of glass.</p>
<p><strong>There <em>is</em> an escape hatch:</strong> you can set failCgroupV1: false in your KubeletConfiguration. But using it is like continuing to smoke after your doctor shows you the lung X-rays. Sure, technically you <em>can</em>, but it locks you out of all the cool v2-only features: memory QoS, certain swap configurations, and,here's the kicker,<strong>Pressure Stall Information (PSI) metrics</strong>.</p>
<p>PSI is the metric that tells you not just that CPU usage is high, but that <em>processes are actively stalling waiting for CPU</em>. It's the difference between "we're busy" and "we're drowning." It's a game-changer for autoscaling intelligence. And you can't have it on cgroup v1.</p>
<p>So yeah, time to upgrade those nodes.</p>
<h3 id="heading-containerd-1x-the-final-season"><strong>containerd 1.x: The Final Season</strong></h3>
<p>Here's another fun one: Kubernetes v1.35 is the <em>last</em> version that supports containerd 1.x. In v1.36, it's gone. This is your final warning, like when Netflix sends you three emails saying a show is about to leave the platform.</p>
<p>Why does this matter? Because containerd 2.0 removes support for Docker Schema 1 images. You know, those ancient container images that were pushed five years ago and have been lurking in your registry like digital archaeology? They won't pull anymore.</p>
<p>Before you upgrade, you need to:</p>
<ol>
<li>Check your container runtime versions:</li>
</ol>
<pre><code class="lang-bash">kubectl get nodes -o jsonpath=<span class="hljs-string">'{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'</span>
</code></pre>
<ol start="2">
<li>Scan for Schema 1 images (yes, this is tedious):</li>
</ol>
<pre><code class="lang-bash">skopeo inspect docker://your-registry/old-image:tag | jq <span class="hljs-string">'.schemaVersion'</span>
</code></pre>
<ol start="3">
<li>Update your containerd configs. And I mean <em>really</em> update them,containerd 2.0 removes deprecated registry.configs and registry.auths structures. If your automated node upgrade scripts inject old configs, your runtime will crash. Don't discover this at 2 AM on a Saturday.</li>
</ol>
<h3 id="heading-ipvs-mode-gets-the-deprecation-talk"><strong>IPVS Mode Gets the Deprecation Talk</strong></h3>
<p>For years, IPVS mode in kube-proxy was <em>the</em> recommendation for large clusters because iptables couldn't scale. It was faster, more efficient, and made you feel like a sophisticated network engineer.</p>
<p>But here's the thing: maintaining IPVS behavior that perfectly matches iptables semantics while also supporting every new Service feature turned out to be... complicated. Like, "we're spending more time on compatibility than innovation" complicated.</p>
<p>So Kubernetes v1.35 deprecates IPVS mode. It still works! You'll just get a warning on startup. The future is <strong>nftables</strong>,a more modern, programmable backend that fixes iptables' scaling issues without IPVS's maintenance burden.</p>
<p>Check your mode:</p>
<pre><code class="lang-bash">kubectl get configmap kube-proxy -n kube-system -o yaml | grep -i <span class="hljs-string">"mode"</span>
</code></pre>
<p>If you see mode: ipvs, you've got until v1.38 to migrate. That's probably about a year, give or take. Start testing nftables in staging. Take your time. But do take it seriously.</p>
<hr />
<h2 id="heading-the-flagship-feature-in-place-pod-resizing-goes-ga"><strong>The Flagship Feature: In-Place Pod Resizing Goes GA 🎉</strong></h2>
<p>Alright, enough doom and gloom. Let's talk about the feature that's going to make stateful workload operators weep with joy: <strong>In-Place Pod Resizing is now Generally Available</strong>.</p>
<p>This is huge. Like, "finally, VPA might actually be usable in production" huge.</p>
<h3 id="heading-the-historical-inefficiency-aka-the-restart-tax"><strong>The Historical Inefficiency (aka "The Restart Tax")</strong></h3>
<p>Let me paint you a picture. You've got a production database. It's been running smoothly, serving queries, living its best life. Then you realize: "Hey, this needs more memory. Let's change the limit from 4GB to 8GB."</p>
<p>In Kubernetes v1.32 and earlier, here's what happens:</p>
<ol>
<li><p>Pod gets terminated</p>
</li>
<li><p>All accumulated state evaporates (JIT compilation cache, warm database connections, Redis data if it's not persisted)</p>
</li>
<li><p>New pod gets created</p>
</li>
<li><p>New pod <em>might</em> fail to schedule (oops, no capacity)</p>
</li>
<li><p>New pod <em>might</em> fail readiness probes (cold start blues)</p>
</li>
<li><p>New pod <em>might</em> land on a worse node</p>
</li>
<li><p>Your pager goes off</p>
</li>
<li><p>You question your career choices</p>
</li>
</ol>
<p>This is why Vertical Pod Autoscaler (VPA) was relegated to "recommendation mode" in most organizations. Sure, it could <em>tell</em> you what resources you needed, but actually applying those recommendations meant disruption. So teams would use it once at deploy time, like a sizing calculator, and then overprovision everything "just in case."</p>
<p>Result? Clusters running at 30-40% utilization. The tool that was supposed to optimize resource usage became a one-time measurement device.</p>
<h3 id="heading-the-v135-magic"><strong>The v1.35 Magic</strong></h3>
<p>With KEP-1287 graduating to GA, the resources field in a Pod spec is now <em>mutable</em>. You can patch it via a new /resize subresource, the kubelet evaluates feasibility, and,get this,<strong>the container keeps running</strong>.</p>
<p>No restart. No new container ID. No reset restartCount. The memory cgroup limit just... changes. From inside the container, it's seamless.</p>
<p>Here's what it looks like:</p>
<pre><code class="lang-bash">kubectl patch pod my-database --subresource resize --<span class="hljs-built_in">type</span>=<span class="hljs-string">'merge'</span> -p <span class="hljs-string">'{
  "spec": {
    "containers": [{
      "name": "postgres",
      "resources": {
        "requests": {"memory": "8Gi"},
        "limits": {"memory": "8Gi"}
      }
    }]
  }
}'</span>
</code></pre>
<p>And just like that, your database has more memory. No downtime. No data loss. No 3 AM incident.</p>
<h3 id="heading-the-gotchas-because-of-course-there-are-gotchas"><strong>The Gotchas (Because Of Course There Are Gotchas) :)</strong></h3>
<p><strong>QoS Class Is Still Immutable</strong></p>
<p>Kubernetes has three QoS classes: Guaranteed (requests == limits), Burstable (requests &lt; limits), and BestEffort (no resources specified). These determine scheduling priority and eviction behavior. And they're <em>immutable</em>.</p>
<p>So if you try to resize a Guaranteed pod by changing only the limits (which would make it Burstable), the API server will reject it with a very polite "Pod QOS Class may not change as a result of resizing."</p>
<p>The fix? For Guaranteed pods, always resize requests and limits together. Keep them equal.</p>
<p><strong>The Memory Shrink Hazard</strong></p>
<p>Increasing memory is safe. Decreasing memory is... interesting.</p>
<p>Let's say you have a container with a 4GB limit, currently using 3GB, and you decide to resize down to 2GB. What happens?</p>
<p>In v1.35.0-rc.1, the kubelet is smart enough to say "nope." The resize enters <strong><em>PodResizeInProgress</em></strong> with an error message like "attempting to set pod memory limit below current usage." The cgroup limit doesn't decrease. The container keeps running at 4GB.</p>
<p>Your spec says 2GB. Reality is 4GB. The resize is stuck in limbo. And somewhere, a platform engineer is staring at this state wondering if Kubernetes is broken (it's not,it's protecting you).</p>
<p>The solution? Use resizePolicy to specify that memory changes should trigger a container restart:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">safe-resize-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">app</span>
    <span class="hljs-attr">resources:</span>
      <span class="hljs-attr">requests:</span>
        <span class="hljs-attr">cpu:</span> <span class="hljs-string">"500m"</span>
        <span class="hljs-attr">memory:</span> <span class="hljs-string">"256Mi"</span>
      <span class="hljs-attr">limits:</span>
        <span class="hljs-attr">cpu:</span> <span class="hljs-string">"1"</span>
        <span class="hljs-attr">memory:</span> <span class="hljs-string">"512Mi"</span>
    <span class="hljs-attr">resizePolicy:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">resourceName:</span> <span class="hljs-string">cpu</span>
      <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">NotRequired</span>      <span class="hljs-comment"># Hot resize for CPU</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">resourceName:</span> <span class="hljs-string">memory</span>
      <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">RestartContainer</span>  <span class="hljs-comment"># Restart for memory,clean slate</span>
</code></pre>
<p>Now CPU changes are instant, and memory changes trigger a controlled restart. Best of both worlds.</p>
<h3 id="heading-native-vpa-the-mechanism-works-the-intelligence-doesnt"><strong>Native VPA: The Mechanism Works, The Intelligence... Doesn't</strong></h3>
<p>Here's the awkward part. VPA now supports updateMode: InPlaceOrRecreate. The mechanism works beautifully,pods resize without eviction, container IDs stay the same, everything is smooth.</p>
<p>But the VPA <em>recommender</em> is still... how do I put this delicately... not great?</p>
<p>VPA relies on Metrics Server, which polls every 15-60 seconds and analyzes historical averages. By the time it detects a memory spike and issues a patch, the OOM might have already occurred. It's reactive, not predictive. It sees "high memory usage" but doesn't know if that's a leak, valid cache expansion, or normal JVM heap behavior. So it scales up blindly (hello, cost waste) or hesitates to scale down (hello, permanently oversized pods).</p>
<p>The API is production-ready. The intelligence layer isn't.</p>
<p>And that's kind of the theme of this whole release, isn't it? Kubernetes gives you the primitives. You provide the smarts.</p>
<hr />
<h2 id="heading-gang-scheduling-finally-native-all-or-nothing"><strong>Gang Scheduling: Finally, Native "All-or-Nothing"</strong></h2>
<p>If you're in the AI/ML space, this is your headline feature: <strong>Gang Scheduling</strong> has landed as an alpha feature.</p>
<h3 id="heading-the-problem"><strong>The Problem</strong></h3>
<p>You're training a distributed model. It needs 100 GPUs. The scheduler places 95 pods successfully, but then hits capacity. Now you've got 95 pods sitting there, holding 95 expensive GPUs, waiting for 5 more that might never come.</p>
<p>Meanwhile, other jobs are starving because those 95 GPUs are locked. You've created a deadlock. The cluster is effectively stuck. Someone's burning budget. Someone else is burning out.</p>
<p>Previously, solving this required external schedulers like Volcano or Kueue. They work great! But they're external dependencies with their own learning curves, deployment complexities, and operational overhead.</p>
<h3 id="heading-the-v135-solution"><strong>The v1.35 Solution</strong></h3>
<p>Kubernetes v1.35 introduces native gang scheduling via the new <strong>Workload API</strong>. Here's how it works:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">scheduling.k8s.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Workload</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">distributed-training</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podGroups:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">workers</span>
    <span class="hljs-attr">policy:</span>
      <span class="hljs-attr">gang:</span>
        <span class="hljs-attr">minCount:</span> <span class="hljs-number">10</span>  <span class="hljs-comment"># All-or-nothing: need all 10 to start</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">batch/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Job</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pytorch-training</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">parallelism:</span> <span class="hljs-number">10</span>
  <span class="hljs-attr">completions:</span> <span class="hljs-number">10</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">workloadRef:</span>           <span class="hljs-comment"># Real Pod field, not an annotation!</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">distributed-training</span>
        <span class="hljs-attr">podGroup:</span> <span class="hljs-string">workers</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">trainer</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">pytorch/pytorch:latest</span>
        <span class="hljs-attr">resources:</span>
          <span class="hljs-attr">limits:</span>
            <span class="hljs-attr">nvidia.com/gpu:</span> <span class="hljs-number">1</span>
</code></pre>
<p>The scheduler sees workloadRef and holds <em>all</em> pods until it can place the entire gang. No partial allocation. No deadlock. No wasted GPU-hours.</p>
<p>It's elegant. It's native. It's...<strong>alpha</strong>.</p>
<h3 id="heading-the-native-scheduler-gap"><strong>The Native Scheduler Gap</strong></h3>
<p><strong><em>Here's the catch:</em></strong> the native scheduler's gang scheduling implementation is <em>basic</em>. It handles the mechanics (don't schedule anything until you can schedule everything), but it doesn't handle the economics.</p>
<p>There's no queue management. No fair-share policies. No sophisticated backfill. No preemption intelligence for gang workloads.</p>
<p>For serious AI supercomputing, you'll still want Volcano or Kueue managing the <em>queue strategy</em>. The difference is now Kubernetes handles the <em>gang semantics</em> natively, and your external orchestrator handles the <em>scheduling policy</em>.</p>
<p>It's a division of labor. Kubernetes is the kernel. Your orchestrator is the user space. <em>Sounds familia</em>r :) ?</p>
<hr />
<h2 id="heading-opportunistic-batching-not-what-you-think"><strong>Opportunistic Batching: Not What You Think</strong></h2>
<p><strong>Opportunistic Batching</strong> (KEP-5598) graduated to beta and is enabled by default. The name makes it sound like Kubernetes will now schedule 1,000 identical pods in one massive batch operation.</p>
<p>That's not what this is.</p>
<h3 id="heading-what-it-actually-does"><strong>What It Actually Does</strong></h3>
<p>When the scheduler places a pod, it might keep the ranked node list in a small cache. For the <em>very next</em> pod with an identical "scheduling signature," it can return a hint: "try node X first." It's not "schedule 1,000 pods at once." It's "maybe skip some work for pod #2 if it's identical to pod #1."</p>
<p>It's opportunistic. It's a micro-optimization. And it comes with two non-obvious requirements.</p>
<p><strong>Requirement 1: Pods Must Be "Signable"</strong></p>
<p>The scheduler computes a signature from all the fields that affect placement. If <em>any</em> scheduler plugin can't produce a signature fragment, the whole pod becomes "unbatchable."</p>
<p>On day one of testing v1.35.0-rc.1, we hit this immediately. PodTopologySpread with system-default constraints blocked signatures for every pod. Zero batching happened,not because our pods were different, but because a default plugin refused to sign them.</p>
<p>The fix is to disable system-default topology constraints:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># /etc/kubernetes/kube-scheduler-config.yaml</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">kubescheduler.config.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">KubeSchedulerConfiguration</span>
<span class="hljs-attr">profiles:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">pluginConfig:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PodTopologySpread</span>
    <span class="hljs-attr">args:</span>
      <span class="hljs-attr">defaultingType:</span> <span class="hljs-string">List</span>
      <span class="hljs-attr">defaultConstraints:</span> []  <span class="hljs-comment"># Disable defaults for batching</span>
</code></pre>
<p><strong>Requirement 2: Pods Must "Fill" Nodes</strong></p>
<p>The scheduler only reuses the cached ranking if the previously chosen node becomes <em>infeasible</em> for the next pod. If the node still has capacity, it flushes the cache (node_not_full) because reusing might cause suboptimal packing.</p>
<p>Translation: Batching works great for "fat" pods that fill entire nodes (think GPU workers consuming 8 cores, 64GB RAM, 1-8 GPUs). For "tiny" microservice pods that pack 50-to-a-node, batching can't safely reuse hints.</p>
<h3 id="heading-who-benefits"><strong>Who Benefits?</strong></h3>
<p>Distributed training jobs with node-filling workers. The pods are identical, they're large, and they benefit from both gang scheduling (all-or-nothing placement) and batching (faster scheduling decisions).</p>
<p>For dense microservice workloads? Don't expect miracles. And that's by design,the scheduler is protecting against suboptimal bin-packing.</p>
<p>Diagnostic metrics to watch:</p>
<ul>
<li><p>scheduler_batch_attempts_total{result="hint_used"} → Batching is working</p>
</li>
<li><p>scheduler_batch_cache_flushed_total{reason="node_not_full"} → Pods too small</p>
</li>
<li><p>scheduler_batch_cache_flushed_total{reason="pod_not_batchable"} → Signature problems</p>
</li>
</ul>
<hr />
<h2 id="heading-hpa-gets-precision-tuning"><strong>HPA Gets Precision Tuning 🎯</strong></h2>
<p>The Horizontal Pod Autoscaler's fixed 10% tolerance has been a pain point forever. For a 1,000-replica deployment, that's a 100-pod dead zone where HPA just... won't react.</p>
<p>Kubernetes v1.35 promotes <strong>Configurable Tolerance</strong> to beta (enabled by default). You can now set tolerance per-HPA, and,even better,set it <em>differently</em> for scale-up versus scale-down.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">autoscaling/v2</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">HorizontalPodAutoscaler</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">web-frontend</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">scaleTargetRef:</span>
    <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/1</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">minReplicas:</span> <span class="hljs-number">10</span>
  <span class="hljs-attr">maxReplicas:</span> <span class="hljs-number">500</span>
  <span class="hljs-attr">metrics:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">Resource</span>
    <span class="hljs-attr">resource:</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">cpu</span>
      <span class="hljs-attr">target:</span>
        <span class="hljs-attr">type:</span> <span class="hljs-string">Utilization</span>
        <span class="hljs-attr">averageUtilization:</span> <span class="hljs-number">70</span>
  <span class="hljs-attr">behavior:</span>
    <span class="hljs-attr">scaleUp:</span>
      <span class="hljs-attr">tolerance:</span> <span class="hljs-number">0.02</span>  <span class="hljs-comment"># 2% , respond faster to traffic spikes</span>
      <span class="hljs-attr">stabilizationWindowSeconds:</span> <span class="hljs-number">60</span>
    <span class="hljs-attr">scaleDown:</span>
      <span class="hljs-attr">tolerance:</span> <span class="hljs-number">0.15</span>  <span class="hljs-comment"># 15% , conservative, avoid thrashing</span>
      <span class="hljs-attr">stabilizationWindowSeconds:</span> <span class="hljs-number">300</span>
</code></pre>
<p>This asymmetric pattern (tight scale-up, loose scale-down) maps perfectly to how humans handle incidents: scale up on smoke, scale down on proof.</p>
<p>Quick gotchas:</p>
<ul>
<li><p>Tolerance is stored as Quantity, so 0.02 becomes 20m in the API (don't be confused by the format)</p>
</li>
<li><p>Two HPAs targeting the same workload = silent failure with AmbiguousSelector</p>
</li>
<li><p>There's a warm-up period after HPA creation where you might see "did not receive metrics"</p>
</li>
</ul>
<p>But overall? This is a clean, practical improvement that makes HPA more usable for high-scale workloads.</p>
<hr />
<h2 id="heading-security-the-year-of-not-leaking-credentials"><strong>Security: The Year of Not Leaking Credentials</strong></h2>
<h3 id="heading-image-pull-credential-verification-beta-default-on"><strong>Image Pull Credential Verification (Beta, Default On)</strong></h3>
<p>Here's a fun multi-tenant security gap that's finally closed: Previously, if Tenant A pulled a private image with valid credentials, Tenant B could use that <em>cached</em> image without any credentials at all. The kubelet only verified on first pull.</p>
<p>In v1.35, the <strong>KubeletEnsureSecretPulledImages</strong> feature is <strong>enabled by default</strong>. The kubelet now re-validates credentials for every pod, even if the image is already cached locally.</p>
<p>This means:</p>
<ul>
<li><p>The image cache is no longer a "free pass" in multi-tenant clusters</p>
</li>
<li><p>If a pull secret expires or rotates, pods that previously started fine (due to caching) will now fail with ImagePullBackOff</p>
</li>
<li><p>You need to monitor pull secret expiry and treat cache-dependent startups as a bug</p>
</li>
</ul>
<p>The feature is configurable via imagePullCredentialsVerificationPolicy in KubeletConfiguration:</p>
<ul>
<li><p>AlwaysVerify , Default in v1.35, check credentials for every pod</p>
</li>
<li><p>NeverVerify , Old behavior (insecure)</p>
</li>
<li><p>NeverVerifyAllowlistedImages , Skip verification only for specific image patterns</p>
</li>
</ul>
<h3 id="heading-structured-authentication-config-ga"><strong>Structured Authentication Config (GA)</strong></h3>
<p>Multiple OIDC providers without restarting the API server? Yes, please.</p>
<p>The old way involved juggling <em>--oidc-*</em> flags, restarting the API server every time you wanted to add a provider, and generally feeling like you were living in 2015.</p>
<p>Kubernetes v1.35 graduates <strong>Structured Authentication Configuration</strong> to GA. You now use a dedicated config file:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># /etc/kubernetes/auth-config.yaml</span>

<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apiserver.config.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">AuthenticationConfiguration</span>
<span class="hljs-attr">jwt:</span>
<span class="hljs-comment"># Production IdP for humans</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">issuer:</span>
    <span class="hljs-attr">url:</span> <span class="hljs-string">https://okta.example.com</span>
    <span class="hljs-attr">audiences:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">production-cluster</span>
  <span class="hljs-attr">claimMappings:</span>
    <span class="hljs-attr">username:</span>
      <span class="hljs-attr">expression:</span> <span class="hljs-string">'claims.email.split("@")[0]'</span>
    <span class="hljs-attr">groups:</span>
      <span class="hljs-attr">expression:</span> <span class="hljs-string">'claims.groups.map(g, "okta:" + g)'</span>
  <span class="hljs-attr">claimValidationRules:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">expression:</span> <span class="hljs-string">'claims.exp - claims.iat &lt;= 3600'</span>
    <span class="hljs-attr">message:</span> <span class="hljs-string">"Token lifetime cannot exceed 1 hour"</span>

<span class="hljs-comment"># CI/CD IdP for pipelines</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">issuer:</span>
    <span class="hljs-attr">url:</span> <span class="hljs-string">https://gitlab.example.com</span>
    <span class="hljs-attr">audiences:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">ci-cluster</span>
  <span class="hljs-attr">claimMappings:</span>
    <span class="hljs-attr">username:</span>
      <span class="hljs-attr">claim:</span> <span class="hljs-string">preferred_username</span>
    <span class="hljs-attr">groups:</span>
      <span class="hljs-attr">claim:</span> <span class="hljs-string">roles</span>
</code></pre>
<p>Enable it with <strong><em>--authentication-config=/etc/kubernetes/auth-config.yaml</em></strong> on the API server, and you're done. Multiple providers, dynamic reloads, CEL expressions for custom logic. It's beautiful.</p>
<h3 id="heading-pod-certificates-beta"><strong>Pod Certificates (Beta)</strong></h3>
<p>Native workload identity without external controllers, CRDs, or sidecars. The kubelet generates keys, requests certificates via PodCertificateRequest, writes credential bundles directly to the Pod's filesystem, and auto-rotates.</p>
<p>It's still beta and disabled by default (you need to enable <a target="_blank" href="http://certificates.k8s.io/v1beta1">certificates.k8s.io/v1beta1</a> and the PodCertificateRequest feature gate), but it's the future of service mesh and zero-trust architectures.</p>
<p>Pure mTLS flows with no bearer tokens in the issuance path. The kube-apiserver enforces node restriction at admission time. It's elegant, it's secure, and it's coming.</p>
<hr />
<h2 id="heading-quick-hits-other-cool-stuff"><strong>Quick Hits: Other Cool Stuff</strong></h2>
<p>Because this post is already longer than a CVS receipt, here are some other notable features in rapid-fire mode:</p>
<ol>
<li><p><strong>Deployment Terminating Replicas (Beta)</strong> , Ever seen a rolling update trigger quota errors despite having capacity? The controller was ignoring terminating pods when counting replicas. v1.35 adds .status.terminatingReplicas so you can finally see the overlap. Mystery solved.</p>
</li>
<li><p><strong>Storage Version Migration (Beta)</strong> , Native support for migrating stored data to new schema versions, no external tools needed. Historically this required manual "read/write loops" piping kubectl commands together like it's 1999.</p>
</li>
<li><p><strong>StatefulSet MaxUnavailable (Beta)</strong> , Parallel updates for StatefulSets! Set maxUnavailable and watch multiple pods update simultaneously instead of one-at-a-time. Perfect for stateful apps that can tolerate some downtime.</p>
</li>
<li><p><strong>KYAML (Beta)</strong> , A safer, less ambiguous subset of YAML designed specifically for Kubernetes. Addresses the infamous "Norway Bug" and other YAML footguns. Enabled by default (disable with KUBECTL_KYAML=false).</p>
</li>
<li><p><strong>User Namespaces (Beta)</strong> , Containers can run as root internally while being mapped to unprivileged users on the host. Reduces privilege escalation risk if a container gets compromised.</p>
</li>
<li><p><strong>Node Declared Features (Alpha)</strong> , Nodes can advertise their supported feature gates via .status.declaredFeatures. The scheduler uses this to avoid placing pods on incompatible nodes during mixed-version upgrades. Finally, a real answer for heterogeneous clusters.</p>
</li>
<li><p><strong>Extended Toleration Operators (Alpha)</strong> , Tolerations can now use numeric comparison: "only schedule on nodes with SLA &gt; 95%." Auto-evict if it drops below threshold. Numeric intent for placement!</p>
</li>
</ol>
<hr />
<h2 id="heading-the-philosophical-shift-or-the-part-where-i-get-existential"><strong>The Philosophical Shift (Or: The Part Where I Get Existential)</strong></h2>
<p>Here's the thing about Kubernetes v1.35 that nobody's saying out loud: it's clarifying the project's identity in a way that might make some people uncomfortable.</p>
<p>A few years ago, the expectation was that upstream Kubernetes would deliver production-grade autoscaling, intelligent scheduling, and operational maturity out of the box. Native VPA would get smarter. HPA would understand seasonality. The scheduler would learn topology economics and cost optimization.</p>
<p>That hasn't happened. And based on the trajectory of v1.35, it's probably not going to.</p>
<p><strong>Kubernetes is choosing to be a kernel.</strong></p>
<p>It's focusing on:</p>
<ul>
<li><p>✅ Robust, low-level primitives (in-place resize, DRA, gang scheduling)</p>
</li>
<li><p>✅ Safe, performant APIs (structured auth, pod certificates, storage migration)</p>
</li>
<li><p>✅ Well-defined extension points (Workload API, DRA device classes)</p>
</li>
</ul>
<p>It's leaving to users:</p>
<ul>
<li><p>❌ When to resize (intelligence, not mechanism)</p>
</li>
<li><p>❌ Where to place AI workloads (economics, not mechanics)</p>
</li>
<li><p>❌ How to optimize bin-packing (strategy, not structure)</p>
</li>
</ul>
<p>The native controllers,HPA, VPA, default scheduler, are becoming <em>reference implementations</em>, not production-grade optimization engines.</p>
<p>And you know what? That's probably the right call for an open-source project at this scale. You can't be everything to everyone. Focus on the primitives, nail the APIs, and let the ecosystem build the intelligence layer.</p>
<p>But it does mean the burden shifts. Platform teams need to either:</p>
<ol>
<li><p>Build intelligence layers themselves (hard, but flexible)</p>
</li>
<li><p>Adopt ecosystem tools that provide intelligence (easier, but adds dependencies)</p>
</li>
<li><p>Accept the limitations of native controllers (simplest, but leaves optimization on the table)</p>
</li>
</ol>
<p>There's no wrong answer. But there is a choice to make.</p>
<hr />
<h2 id="heading-the-pre-upgrade-checklist-aka-how-to-not-ruin-your-weekend"><strong>The Pre-Upgrade Checklist (aka "How to Not Ruin Your Weekend")</strong></h2>
<p>Before you upgrade to v1.35, here's what you absolutely, positively need to check:</p>
<h3 id="heading-blockers-fix-these-or-dont-upgrade"><strong>🔴 BLOCKERS (Fix These Or Don't Upgrade)</strong></h3>
<p><strong>cgroup v2 on all nodes</strong></p>
<pre><code class="lang-bash"><span class="hljs-built_in">stat</span> -<span class="hljs-built_in">fc</span> %T /sys/fs/cgroup  <span class="hljs-comment"># Must show "cgroup2fs"</span>
</code></pre>
<p><strong>containerd 2.0 or later</strong></p>
<pre><code class="lang-bash">kubectl get nodes -o jsonpath=<span class="hljs-string">'{.items[*].status.nodeInfo.containerRuntimeVersion}'</span>
</code></pre>
<p>If either of these fails, stop. Do not pass Go. Do not collect $200. Fix your infrastructure first.</p>
<h3 id="heading-high-priority-fix-soon-after-upgrade"><strong>🟠 HIGH PRIORITY (Fix Soon After Upgrade)</strong></h3>
<ul>
<li><p><strong>Scan for Docker Schema 1 images</strong> , skopeo inspect every image in your registries</p>
</li>
<li><p><strong>Verify kube-proxy mode</strong> , Make sure you're not on IPVS (or have a migration plan)</p>
</li>
<li><p><strong>Update containerd configs</strong> , Remove deprecated registry structures</p>
</li>
<li><p><strong>Check image pull secrets</strong> , Credential verification is mandatory now</p>
</li>
<li><p><strong>RBAC for exec/attach</strong> , Add create verb for pod subresources</p>
</li>
</ul>
<h3 id="heading-medium-priority-plan-for-it"><strong>🟡 MEDIUM PRIORITY (Plan For It)</strong></h3>
<ul>
<li><p>Run kubepug or pluto to find deprecated APIs in your manifests</p>
</li>
<li><p>Consider switching VPA to InPlaceOrRecreate mode</p>
</li>
<li><p>Evaluate maxUnavailable for StatefulSets that can tolerate parallel updates</p>
</li>
<li><p>Test HPA configurable tolerance for high-scale workloads</p>
</li>
</ul>
<h3 id="heading-post-upgrade-validation"><strong>Post-Upgrade Validation</strong></h3>
<p># Verify PSI is available</p>
<pre><code class="lang-bash">cat /proc/pressure/memory
</code></pre>
<p># Test in-place resize</p>
<pre><code class="lang-bash">kubectl patch pod test-pod --subresource resize --<span class="hljs-built_in">type</span>=<span class="hljs-string">'merge'</span> -p <span class="hljs-string">'{"spec":{"containers":[{"name":"app","resources":{"requests":{"memory":"512Mi"}}}]}}'</span>
</code></pre>
<p># Check scheduling batching metrics (Prometheus)</p>
<pre><code class="lang-bash">scheduler_batch_attempts_total{result=<span class="hljs-string">"hint_used"</span>}
</code></pre>
<p># Verify feature gates</p>
<pre><code class="lang-bash">kubectl get --raw /metrics | grep -i <span class="hljs-string">"kubernetes_feature_enabled"</span>
</code></pre>
<hr />
<h2 id="heading-the-bottom-line-what-should-you-do"><strong>The Bottom Line: What Should You Do?</strong></h2>
<p><strong>If you're a platform engineer:</strong> Treat this as a checkpoint release. Your technical debt is due. Schedule time for infrastructure upgrades, not just manifest changes. Test cgroup v2, plan containerd 2.0 migration, audit your RBAC policies. This isn't optional.</p>
<p><strong>If you're running AI/ML workloads:</strong> Explore gang scheduling and batching, but understand they're primitives. You still need orchestration intelligence for production-grade queue management. The scheduler handles mechanics; you provide strategy.</p>
<p><strong>If you operate stateful workloads:</strong> In-place resize is production-ready. Test it. Love it. Deploy it. But if you're using native VPA, keep expectations realistic,the mechanism is GA, the intelligence isn't.</p>
<p><strong>If you're a developer:</strong> Most of this won't affect you directly. Enjoy the more precise HPA, safer YAML with KYAML, and better Pod lifecycle tracking. Your platform team is handling the hard stuff.</p>
<hr />
<h2 id="heading-a-note-for-managed-kubernetes-users-eks-aks-gke-etc"><strong>A Note for Managed Kubernetes Users (EKS, AKS, GKE, etc.)</strong></h2>
<p>If you're running on a managed Kubernetes service, your upgrade experience will be different,and in many ways, easier.</p>
<p><strong>The Good News:</strong></p>
<ul>
<li><p><strong>No cgroup v2 migration pain</strong> , Cloud providers handle node OS upgrades. By the time EKS/AKS/GKE offer v1.35, their node images will already be running cgroup v2-compatible systems.</p>
</li>
<li><p><strong>Automatic containerd updates</strong> , Managed services bundle compatible container runtime versions. You won't manually upgrade containerd.</p>
</li>
<li><p><strong>Control plane upgrades managed</strong> , API server, scheduler, controller-manager upgrades happen with a button click (or API call).</p>
</li>
</ul>
<p><strong>What You Still Need to Handle:</strong></p>
<ul>
<li><p><strong>Application compatibility</strong> , Test your workloads with v1.35 API changes, especially if you use newer beta features.</p>
</li>
<li><p><strong>RBAC updates</strong> , The WebSocket permission changes for exec/attach/portforward still affect your roles and service accounts.</p>
</li>
<li><p><strong>Image pull secrets</strong> , The stricter credential verification affects all clusters. Audit your pull secret lifetimes and rotation policies.</p>
</li>
<li><p><strong>Cost implications</strong> , New features like in-place resize and better autoscaling can improve efficiency, but you need to configure them. Managed Kubernetes doesn't automatically optimize your resource usage.</p>
</li>
</ul>
<p><strong>Timeline Expectations:</strong></p>
<ul>
<li><p><strong>EKS</strong> typically lags 2-4 weeks behind upstream releases</p>
</li>
<li><p><strong>GKE</strong> usually within 2-3 weeks for Rapid channel, longer for Regular/Stable</p>
</li>
<li><p><strong>AKS</strong> generally 2-4 weeks post-release</p>
</li>
</ul>
<p>Check your provider's release notes,they often add their own enhancements or defer certain alpha features.</p>
<p><strong>Bottom Line:</strong> Managed Kubernetes handles infrastructure concerns, but you're still responsible for application architecture, resource optimization, and operational intelligence. v1.35's primitives are available to you; making them useful is still your job.</p>
<hr />
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>Kubernetes v1.35 is a fascinating release. It's not the most feature-packed release we've ever seen (v1.16's "The One With Everything" still holds that crown). But it's clarifying. It's opinionated about what Kubernetes <em>is</em> and what it <em>isn't</em>.</p>
<p>It's a kernel. It provides primitives. It's your job to make them smart.</p>
<p>And honestly? That's liberating. Because now we know where we stand. The expectations are clear. The division of labor is explicit.</p>
<p>Build on the primitives. Fill the intelligence gaps. Make something amazing.</p>
<p>The World Tree has another growth ring. What will you build in its branches?</p>
<hr />
<p><em>Written with chai , mild existential dread, and genuine excitement for the future of Kubernetes. May your upgrades be smooth and your cgroups be v2.</em> ☕️</p>
<p>Happy upgrading! 🐿️</p>
]]></content:encoded></item><item><title><![CDATA[Kubernetes v1.34: The Smooth Operator Release]]></title><description><![CDATA[The Kubernetes ecosystem continues to evolve at a remarkable pace, and the latest v1.34 release, planned for Wednesday, August 27th, 2025, represents one of the most significant updates in recent memory. Unlike previous releases that focused heavily ...]]></description><link>https://blogs.akshatsinha.dev/kubernetes-v1-34</link><guid isPermaLink="true">https://blogs.akshatsinha.dev/kubernetes-v1-34</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[cloud native]]></category><dc:creator><![CDATA[Akshat Sinha]]></dc:creator><pubDate>Wed, 20 Aug 2025 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1766092571136/5edc8ed2-3690-4a1b-9def-d469b31c5152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Kubernetes ecosystem continues to evolve at a remarkable pace, and the latest <strong>v1.34</strong> release, planned for Wednesday, August 27th, 2025, represents one of the most significant updates in recent memory. Unlike previous releases that focused heavily on deprecations and removals, <strong>Kubernetes v1.34</strong> takes a different approach, it’s entirely focused on enhancements and new capabilities that will reshape how we manage containerized workloads at scale.</p>
<h3 id="heading-what-makes-v134-special">What Makes v1.34 Special?</h3>
<p>What makes v1.34 particularly exciting is its focus on maturity and stability. This release showcases significant feature graduations, with several major capabilities moving from beta to stable, and important enhancements reaching beta status. Most remarkably, this release contains <strong>no deprecations or removals,</strong> a refreshing change that allows teams to upgrade with confidence, knowing their existing configurations and workflows will continue to work seamlessly.</p>
<h3 id="heading-major-features-graduating-to-stable-ga">Major Features Graduating to Stable (GA)</h3>
<h4 id="heading-dynamic-resource-allocation-dra-core-reaches-production-readiness">Dynamic Resource Allocation (DRA) Core Reaches Production Readiness</h4>
<p>The headline feature of v1.34 is undoubtedly the graduation of <strong>Dynamic Resource Allocation (DRA)</strong> core to stable status. DRA was originally introduced as an alpha feature in v1.26, went through a significant redesign for v1.31, reached beta in v1.32, and now achieves general availability in v1.34.</p>
<p><strong>Why DRA Matters</strong></p>
<p>If you’ve ever struggled with GPU allocation, custom hardware integration, or complex device scheduling in Kubernetes, DRA is about to become your best friend. Traditional device plugins have served us well, but they come with significant limitations:</p>
<ul>
<li><p><strong>Static allocation</strong>: Once a device is assigned, it can’t be dynamically reallocated</p>
</li>
<li><p><strong>Limited flexibility</strong>: Device requests are binary ,you either get the device or you don’t</p>
</li>
<li><p><strong>Poor observability</strong>: Limited insight into device utilization and allocation failures</p>
</li>
</ul>
<p>DRA changes this paradigm entirely. It provides a flexible framework for categorizing, requesting, and utilizing specialized hardware like GPUs, FPGAs, network accelerators, and custom silicon.</p>
<p><strong>How DRA Works in Practice</strong></p>
<p>DRA introduces several new API types under <code>resource.k8s.io/v1</code>:</p>
<ul>
<li><p><strong>ResourceClaim</strong>: Represents a request for specific resources</p>
</li>
<li><p><strong>DeviceClass</strong>: Defines categories of available devices</p>
</li>
<li><p><strong>ResourceClaimTemplate</strong>: Templates for dynamic claim creation</p>
</li>
<li><p><strong>ResourceSlice</strong>: Contains information about available resources</p>
</li>
</ul>
<p>Here’s a practical example of how you might request GPU resources with DRA:</p>
<pre><code class="lang-bash">apiVersion: v1
kind: Pod
metadata:
  name: ml-training-pod
spec:
  resourceClaims:
  - name: gpu-claim
    resourceClaimTemplateName: gpu-template
  containers:
  - name: trainer
    image: ml-training:latest
    resources:
      claims:
      - name: gpu-claim
        request: gpu
</code></pre>
<p>The magic happens through <strong>CEL (Common Expression Language)</strong> expressions that allow fine-grained device filtering. You can now specify requirements like “give me a GPU with at least 16GB memory and CUDA compute capability &gt; 7.0” directly in your resource claims.</p>
<p>With DRA graduating to stable, the <code>resource.k8s.io/v1</code> APIs will be available by default, making this a production-ready solution for complex device management scenarios.</p>
<h3 id="heading-production-ready-tracing-for-kubelet-and-api-server">Production-Ready Tracing for Kubelet and API Server</h3>
<p>Two major tracing enhancements are graduating to stable in v1.34, transforming Kubernetes observability:</p>
<p><strong>API Server Tracing (</strong><a target="_blank" href="https://github.com/kubernetes/enhancements/issues/647"><strong>KEP-647</strong></a><strong>)</strong> and <strong>Kubelet Tracing (</strong><a target="_blank" href="https://github.com/kubernetes/enhancements/issues/2831"><strong>KEP-2831</strong></a><strong>)</strong> both reach general availability after their journey from alpha (v1.22 and v1.25 respectively) to beta (v1.27) and now to stable.</p>
<p><strong>Deep Observability Comes to Kubernetes Core</strong></p>
<p>The tracing implementation uses <strong>OpenTelemetry</strong> standards to instrument critical operations:</p>
<ul>
<li><p><strong>Kubelet operations</strong>: Complete visibility into CRI calls, pod lifecycle events, and node-level operations</p>
</li>
<li><p><strong>API server operations</strong>: End-to-end request tracing from admission controllers to etcd</p>
</li>
<li><p><strong>Context propagation</strong>: Trace IDs flow through the entire system, enabling correlation across components</p>
</li>
</ul>
<p>**Real-World Impact<br />**Imagine debugging a pod that’s stuck in <code>ContainerCreating</code> state. Instead of grepping through disconnected logs across multiple components, you now get:</p>
<ol>
<li><p><strong>Unified trace view</strong>: See the entire pod creation flow in one timeline</p>
</li>
<li><p><strong>Precise bottleneck identification</strong>: Pinpoint exactly where delays occur</p>
</li>
<li><p><strong>Cross-component correlation</strong>: Connect kubelet operations with container runtime behaviors</p>
</li>
<li><p><strong>Performance insights</strong>: Quantify the impact of configuration changes</p>
</li>
</ol>
<p>This level of observability transforms Kubernetes from a “black box” into a transparent, debuggable system, and with stable graduation, you can confidently build production monitoring solutions around these capabilities.</p>
<h3 id="heading-key-features-graduating-to-beta">Key Features Graduating to Beta</h3>
<h4 id="heading-serviceaccount-tokens-for-image-pull-authentication">ServiceAccount Tokens for Image Pull Authentication</h4>
<p><strong>Moving to Beta and Enabled by Default</strong></p>
<p>One of the most significant security improvements in v1.34 is the beta graduation of <strong>ServiceAccount token integration for kubelet credential providers (</strong><a target="_blank" href="https://github.com/kubernetes/enhancements/issues/4412"><strong>KEP-4412</strong></a><strong>)</strong>. This feature addresses a longstanding security concern: the use of long-lived image pull secrets.</p>
<p><strong>The Security Problem</strong></p>
<p>Traditional image pull secrets suffer from several security issues:</p>
<ul>
<li><p><strong>Long-lived credentials</strong>: Secrets don’t rotate automatically</p>
</li>
<li><p><strong>Broad access</strong>: One secret often provides access to multiple registries</p>
</li>
<li><p><strong>Operational overhead</strong>: Manual credential management and rotation</p>
</li>
</ul>
<p><strong>The Modern Solution</strong></p>
<p>The new approach leverages short-lived, automatically rotated ServiceAccount tokens that follow <strong>OIDC-compliant semantics</strong>. Each token is scoped to a specific Pod, dramatically reducing the blast radius of credential compromise.</p>
<p>Benefits include:</p>
<ul>
<li><p><strong>Automatic rotation</strong>: Tokens refresh without manual intervention</p>
</li>
<li><p><strong>Workload-level identity</strong>: Each workload gets its own scoped credentials</p>
</li>
<li><p><strong>Reduced attack surface</strong>: No more long-lived secrets sitting in etcd</p>
</li>
<li><p><strong>Better compliance</strong>: Aligns with modern identity-aware security practices</p>
</li>
</ul>
<p>This change represents a fundamental shift toward a <strong>zero-trust model</strong> for container image access.</p>
<h3 id="heading-enhanced-pod-level-resource-management">Enhanced Pod-Level Resource Management</h3>
<p><strong>PodLevelResources Graduates to Beta</strong></p>
<p>The <code>PodLevelResources</code> feature is now beta and enabled by default. This enhancement allows defining CPU and memory resources for an entire pod using <code>pod.spec.resources</code>, providing more intuitive resource management for multi-container pods.</p>
<h3 id="heading-better-pod-lifecycle-tracking">Better Pod Lifecycle Tracking</h3>
<p><strong>PodObservedGenerationTracking Reaches Beta</strong></p>
<p>This feature, now beta and enabled by default, populates <code>status.observedGeneration</code> fields in pods and their conditions, enabling a better understanding of when pod status reflects the current specification.</p>
<h3 id="heading-enhanced-traffic-distribution-policies">Enhanced Traffic Distribution Policies</h3>
<p><strong>PreferSameZone and PreferSameNode Graduate to Beta</strong></p>
<p>Building on <a target="_blank" href="https://github.com/kubernetes/enhancements/issues/3015">KEP-3015</a>, the enhanced traffic distribution capabilities are graduating to beta with the feature gate enabled by default in v1.34.</p>
<p>Network topology awareness gets a significant upgrade with the evolution of Service traffic distribution policies. The <code>spec.trafficDistribution</code> field now supports more granular preferences.</p>
<p><strong>Beyond PreferClose</strong></p>
<p>The original <code>PreferClose</code> policy is being deprecated in favor of two more specific options:</p>
<ul>
<li><p><strong>PreferSameZone</strong>: Equivalent to the current PreferClose behavior, prioritising endpoints in the same availability zone</p>
</li>
<li><p><strong>PreferSameNode</strong>: Takes locality to the extreme, preferring endpoints on the same physical node as the client</p>
</li>
</ul>
<p><strong>Practical Applications</strong></p>
<p><code>PreferSameNode</code> is particularly valuable for:</p>
<ul>
<li><p><strong>Edge computing</strong>: Minimizing latency for IoT and edge workloads</p>
</li>
<li><p><strong>Data-intensive applications</strong>: Reducing network traversal for high-bandwidth communications</p>
</li>
<li><p><strong>Co-located microservices</strong>: Optimizing performance for tightly coupled services</p>
</li>
</ul>
<pre><code class="lang-bash">apiVersion: v1
kind: Service
spec:
  trafficDistribution: PreferSameNode
  <span class="hljs-comment"># ... rest of service spec</span>
</code></pre>
<h3 id="heading-fine-grained-hpa-control-with-configurable-tolerance">Fine-Grained HPA Control with Configurable Tolerance</h3>
<p><strong>Graduating to Beta from Alpha</strong></p>
<p>The <strong>HPA configurable tolerance</strong> feature (<a target="_blank" href="https://github.com/kubernetes/enhancements/issues/4951">KEP-4951</a>) is expected to graduate to beta in v1.34. This enhancement addresses one of the most common complaints about autoscaling behavior.</p>
<p><strong>The Problem with One-Size-Fits-All</strong></p>
<p>The default cluster-wide 10% tolerance for HPA scaling decisions often proves inadequate:</p>
<ul>
<li><p><strong>Large deployments</strong>: 10% might mean hundreds of unnecessary pods remain during scale-down</p>
</li>
<li><p><strong>Sensitive workloads</strong>: Some applications need more responsive scaling</p>
</li>
<li><p><strong>Cost optimization</strong>: Different workloads have different cost sensitivity profiles</p>
</li>
</ul>
<p><strong>The Solution: Workload-Specific Tolerance</strong></p>
<pre><code class="lang-bash">apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  behavior:
    scaleUp:
      tolerance: 5%  <span class="hljs-comment"># More aggressive scale-up</span>
    scaleDown:
      tolerance: 15% <span class="hljs-comment"># More conservative scale-down</span>
</code></pre>
<p>This granular control enables:</p>
<ul>
<li><p><strong>Optimized resource utilization</strong>: Right-size tolerance for each workload</p>
</li>
<li><p><strong>Cost management</strong>: More aggressive scale-down for cost-sensitive applications</p>
</li>
<li><p><strong>Performance optimization</strong>: Responsive scale-up for latency-critical services</p>
</li>
</ul>
<h3 id="heading-exciting-new-alpha-features">Exciting New Alpha Features</h3>
<h3 id="heading-pod-replacement-policy-for-deployments">Pod Replacement Policy for Deployments</h3>
<p><strong>Introducing Alpha Feature</strong></p>
<p>The new <code>podReplacementPolicy</code> field (<a target="_blank" href="https://github.com/kubernetes/enhancements/issues/3973">KEP-3973)</a> gives you explicit control over the trade-off between deployment speed and resource consumption. This alpha feature can be enabled using the <code>DeploymentPodReplacementPolicy</code> and <code>DeploymentReplicaSetTerminatingReplicas</code> feature gates.</p>
<p>Resource management during deployments has always been a balancing act between speed and resource consumption. This new feature provides two distinct policies:</p>
<p><strong>TerminationStarted</strong>: Creates new pods immediately when old ones begin terminating</p>
<ul>
<li><p>Faster rollouts and reduced downtime</p>
</li>
<li><p>Higher temporary resource consumption</p>
</li>
</ul>
<p><strong>TerminationComplete</strong>: Waits for complete termination before creating new pods</p>
<ul>
<li><p>Controlled resource usage and predictable capacity planning</p>
</li>
<li><p>Slower rollouts</p>
</li>
</ul>
<pre><code class="lang-bash">apiVersion: apps/v1
kind: Deployment
spec:
  podReplacementPolicy: TerminationStarted
  <span class="hljs-comment"># ... rest of deployment spec</span>
</code></pre>
<p>This feature is particularly valuable for:</p>
<ul>
<li><p><strong>Resource-constrained environments</strong>: Where every CPU core and GB of RAM matters</p>
</li>
<li><p><strong>Long-terminating workloads</strong>: Applications with extended graceful shutdown periods</p>
</li>
<li><p><strong>Cost-sensitive deployments</strong>: Where temporary resource spikes impact billing</p>
</li>
</ul>
<h3 id="heading-kyaml-kubernetes-optimized-configuration-format">KYAML: Kubernetes-Optimized Configuration Format</h3>
<p><strong>Alpha Support for kubectl Output</strong></p>
<p><a target="_blank" href="https://github.com/kubernetes/enhancements/issues/5295">KEP-5295</a> introduces <strong>KYAML</strong> as a new output format for kubectl v1.34, addressing common YAML pitfalls while maintaining full compatibility.</p>
<p><strong>Solving YAML’s Pain Points</strong></p>
<p>YAML’s flexibility comes with notorious drawbacks:</p>
<ul>
<li><p><a target="_blank" href="https://hitchdev.com/strictyaml/why/implicit-typing-removed/"><strong>The Norway Bug</strong></a>: Unquoted country codes like <code>NO</code> being interpreted as boolean <code>false</code></p>
</li>
<li><p><strong>Indentation sensitivity</strong>: Subtle whitespace errors causing deployment failures</p>
</li>
<li><p><strong>Type coercion surprises</strong>: Strings sometimes becoming numbers or booleans unexpectedly</p>
</li>
</ul>
<p><strong>KYAML’s Principled Approach</strong></p>
<p>KYAML addresses these issues through consistent rules:</p>
<ul>
<li><p><strong>Always double-quote strings</strong>: Eliminates type coercion surprises</p>
</li>
<li><p><strong>Unquoted keys</strong>: Unless potentially ambiguous</p>
</li>
<li><p><strong>Consistent syntax</strong>: Always use <code>{}</code> for objects, <code>[]</code> for arrays</p>
</li>
<li><p><strong>Comment support</strong>: Unlike JSON, KYAML supports comments</p>
</li>
<li><p><strong>Trailing commas allowed</strong>: Reduces diff noise and syntax errors</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># Traditional YAML (problematic)</span>
apiVersion: v1
kind: ConfigMap
data:
  country: NO  <span class="hljs-comment"># Oops! This becomes boolean false</span>
  version: 1.0  <span class="hljs-comment"># This might become a float</span>
</code></pre>
<pre><code class="lang-bash"><span class="hljs-comment"># KYAML (safe)</span>
apiVersion: <span class="hljs-string">"v1"</span>
kind: <span class="hljs-string">"ConfigMap"</span>
data: {
  country: <span class="hljs-string">"NO"</span>,  <span class="hljs-comment"># Explicitly a string</span>
  version: <span class="hljs-string">"1.0"</span>, <span class="hljs-comment"># Explicitly a string</span>
}
</code></pre>
<p>KYAML remains a strict subset of YAML, ensuring compatibility with existing tooling while providing safety guarantees. You’ll be able to request KYAML output using <code>kubectl get -o kyaml</code>, while all existing YAML and JSON output formats remain available.</p>
<h3 id="heading-additional-operational-improvements">Additional Operational Improvements:-</h3>
<p><strong>Enhanced Memory Management:</strong> Memory limits can now be decreased with a <code>NotRequired</code> resize restart policy, with intelligent checks to prevent OOM-kill scenarios during the adjustment. This improvement provides more flexibility in resource management without compromising pod stability.</p>
<p><strong>Better CSI Volume Handling:</strong> The kubelet now detects terminal CSI volume mount failures due to exceeded attachment limits and marks stateful pods as Failed, allowing controllers to recreate them. This prevents pods from getting stuck indefinitely in the <code>ContainerCreating</code> state.</p>
<p><strong>Improved Metrics and Observability:</strong> New metrics provide better insight into:</p>
<ul>
<li><p>User namespace pod creation success/failure rates with <code>started_user_namespaced_pods_total</code> and <code>started_user_namespaced_pods_errors_total</code></p>
</li>
<li><p>ResourceClaim controller operations with <code>resourceclaim_controller_creates_total</code> and <code>resourceclaim_controller_resource_claims</code></p>
</li>
</ul>
<h3 id="heading-what-this-means-for-your-operations">What This Means for Your Operations</h3>
<p><strong><em>For Platform Engineers:</em></strong> v1.34 represents a maturation of Kubernetes’ enterprise capabilities. The stability graduation of <strong>DRA</strong>, tracing, and several beta features means you can confidently build these into your platform abstractions without fear of API churn.</p>
<p><strong><em>For Security Teams:</em></strong> The <strong>ServiceAccount token integration</strong> for image pulls represents a significant step toward zero-trust container registries. With this feature moving to beta and enabled by default, it’s time to start planning migration away from long-lived pull secrets.</p>
<p><strong><em>For FinOps Teams:</em></strong> The combination of beta-level <strong>HPA configurable tolerance</strong> and alpha-level <strong>pod replacement policies</strong> provides new levers for balancing performance and cost. These features enable more sophisticated cost optimization strategies.</p>
<p><strong><em>For Developers:</em></strong> Alpha <strong>KYAML</strong> support means safer, more maintainable configuration files on the horizon. The stable <strong>tracing capabilities</strong> will dramatically improve debugging experiences across the development lifecycle.</p>
<h3 id="heading-wrapping-up">Wrapping Up</h3>
<p>Kubernetes v1.34 is looking pretty solid, nothing too flashy, but it’s packed with the kind of practical improvements that actually make a difference in day-to-day work. With plenty of enhancements and zero deprecations, it’s one of those rare releases where you don’t have to worry about things breaking when you upgrade. The GPU allocation improvements with DRA are finally ready for prime time, and there are some nice observability upgrades baked right in. When it drops on August 27th, it should be a pretty smooth transition, no hunting down deprecated APIs or scrambling to fix broken deployments. It’s not revolutionary, but sometimes the best releases are the ones that just work without giving you a headache.</p>
]]></content:encoded></item><item><title><![CDATA[Argo CD 3.0: Navigating the Next Frontier of GitOps Deployment]]></title><description><![CDATA[In the rapidly evolving landscape of Kubernetes deployments, Argo CD 3.0 emerges as a pivotal milestone that promises to redefine how organizations approach continuous delivery. This version represents a carefully orchestrated evolution of the platfo...]]></description><link>https://blogs.akshatsinha.dev/argocd-3-0</link><guid isPermaLink="true">https://blogs.akshatsinha.dev/argocd-3-0</guid><category><![CDATA[ArgoCD]]></category><category><![CDATA[gitops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[SRE]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Akshat Sinha]]></dc:creator><pubDate>Tue, 04 Mar 2025 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1766092100675/d8734eec-0e68-4f07-9094-f79a05f295c1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the rapidly evolving landscape of Kubernetes deployments, Argo CD 3.0 emerges as a pivotal milestone that promises to redefine how organizations approach continuous delivery. This version represents a carefully orchestrated evolution of the platform, balancing innovation with practical considerations for enterprise deployments.</p>
<h3 id="heading-version-support-strategy">Version Support Strategy</h3>
<p>Starting with 3.0, Argo CD will:</p>
<ul>
<li><p>Stop releasing new 2.x minor versions</p>
</li>
<li><p>Continue cutting patch releases for the two most recent minor versions (2.14 until 3.2 is released, and 2.13 until 3.1 is released)</p>
</li>
</ul>
<p>The versioning strategy reflects a mature approach to software maintenance, ensuring stability while pushing the boundaries of continuous delivery technologies.</p>
<p>The v3 RC is planned for March 17, 2025 and v3 GA for May 6, 2025.</p>
<h3 id="heading-critical-breaking-changes-and-deprecations">Critical Breaking Changes and Deprecations</h3>
<h3 id="heading-1-fine-grained-rbac-transformation">1. Fine-Grained RBAC Transformation</h3>
<p>The role-based access control (RBAC) mechanism in Argo CD has undergone a significant transformation, addressing long-standing challenges in permission management. Previously, the system operated with broad, catch-all permissions that often introduced potential security risks.</p>
<p><strong>Before v3:</strong></p>
<ul>
<li><p>Update or delete actions on an application automatically applied to sub-resources</p>
</li>
<li><p>Broad permissions were the default, potentially exposing systems to unintended modifications</p>
</li>
</ul>
<p><strong>In v3:</strong></p>
<ul>
<li><p>Update and delete actions now only apply to the application itself</p>
</li>
<li><p>Explicit policies must be defined for sub-resource permissions</p>
</li>
<li><p>Administrators can create highly specific access rules with fine-grained control</p>
</li>
</ul>
<p>The new permission model introduces a more complex but powerful approach to access management. Example scenarios:</p>
<p><strong>Granular Resource-Level Permissions</strong> To grant a user permission to delete only Pods within a specific application, you can now use a precisely crafted policy:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Allows deleting Pods in the 'prod-app' Application</span>
p, example-user, applications, delete/*/Pod/*/*, default/prod-app, allow
</code></pre>
<p><strong>Nuanced Access Control</strong> The system now supports intricate permission combinations. For instance, you can:</p>
<ul>
<li><p>Allow updates to an application while denying updates to its sub-resources</p>
</li>
<li><p>Explicitly deny application deletion while permitting specific resource deletions</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># Explicitly deny application deletion</span>
p, example-user, applications, delete, default/prod-app, deny
</code></pre>
<pre><code class="lang-bash"><span class="hljs-comment"># Allow deleting Pods within the application</span>
p, example-user, applications, delete/*/Pod/*/*, default/prod-app, allow
</code></pre>
<p><strong>Glob Pattern Considerations</strong> Argo CD’s RBAC uses a unique glob pattern evaluation that requires careful configuration. The matching can be complex due to how slashes are processed. Best practices include:</p>
<ul>
<li><p>Always include all resource parts in the pattern</p>
</li>
<li><p>Use four slashes for most precise matching</p>
</li>
<li><p>Be aware that resource kinds and namespaces can interact in unexpected ways</p>
</li>
</ul>
<p><strong>Migration Strategy</strong> Organizations can preserve the previous broad permission model by setting <code>server.rbac.disableApplicationFineGrainedRBACInheritance</code> to <code>false</code> in the Argo CD ConfigMap. However, this is recommended only as a temporary measure during migration.</p>
<p><strong>Example Migration Path</strong></p>
<pre><code class="lang-bash"><span class="hljs-comment"># Legacy Approach (No Longer Default)</span>
- p, some-user, applications, *, *, allow  <span class="hljs-comment"># Gave broad permissions</span>
</code></pre>
<pre><code class="lang-bash"><span class="hljs-comment"># New Approach</span>
- p, some-user, applications, *, *, allow  <span class="hljs-comment"># Requires explicit sub-resource permissions</span>
- p, some-user, applications, update/*/Deployment/*/*, specific-app, allow
</code></pre>
<h3 id="heading-2-logs-rbac-enforcement">2. Logs RBAC Enforcement</h3>
<p>The approach to logging access has been dramatically refined, treating logs as a first-class security resource. This change represents a more nuanced and secure method of managing application visibility and access.</p>
<p><strong>Changes:</strong></p>
<ul>
<li><p>Logs are now a first-class RBAC resource</p>
</li>
<li><p>Automatic logs access for application users has been removed</p>
</li>
<li><p>Explicit logs access must now be granted</p>
</li>
</ul>
<p><strong>Configuration:</strong></p>
<ul>
<li><p>Remove <code>server.rbac.log.enforce.enable</code> from argocd-cm ConfigMap</p>
</li>
<li><p>Manually grant logs access at project or global scope</p>
</li>
</ul>
<h3 id="heading-3-metrics-consolidation">3. Metrics Consolidation</h3>
<p>Metric management has been streamlined to provide a more focused and efficient monitoring experience. The removal of certain legacy metrics demonstrates Argo CD’s commitment to maintaining a clean and modern observability approach.</p>
<p><strong>Removed Metrics:</strong></p>
<ul>
<li><p><code>argocd_app_sync_status</code></p>
</li>
<li><p><code>argocd_app_health_status</code></p>
</li>
<li><p><code>argocd_app_created_time</code></p>
</li>
</ul>
<p><strong>Migration:</strong></p>
<ul>
<li><p>These metrics’ information is now available as labels on <code>argocd_app_info</code></p>
</li>
<li><p>Update monitoring dashboards and alerts accordingly</p>
</li>
</ul>
<h3 id="heading-4-dex-sso-authentication-changes">4. Dex SSO Authentication Changes</h3>
<p>Authentication mechanisms have been refined to provide more stable and predictable user identification. This change addresses the inherent challenges of using internally generated claims for authentication and authorization.</p>
<p><strong>Before:</strong></p>
<ul>
<li><p>Used <code>sub</code> claim for RBAC subject</p>
</li>
<li><p>Subject based on Dex internal implementation</p>
</li>
</ul>
<p><strong>In v3:</strong></p>
<ul>
<li><p>Now uses <code>federated_claims.user_id</code> claim</p>
</li>
<li><p>Requests <code>federated:id</code> scope from Dex</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># Old Policy (Incorrect)</span>
- g, ChdleGFtcGxlQGFyZ29wcm9qLmlvEgJkZXhfY29ubl9pZA, role:example
</code></pre>
<pre><code class="lang-bash"><span class="hljs-comment"># New Policy</span>
- g, example@argoproj.io, role:example
</code></pre>
<h3 id="heading-5-repository-configuration">5. Repository Configuration</h3>
<p>The approach to repository management has been simplified and standardized, pushing organizations towards more declarative and Kubernetes-native configuration methods.</p>
<p><strong>Deprecation:</strong></p>
<ul>
<li><p>Removed support for repository configuration in <code>argocd-cm</code> ConfigMap</p>
</li>
<li><p>All repositories must now be managed as Kubernetes Secrets</p>
</li>
</ul>
<p><strong>Verification:</strong></p>
<pre><code class="lang-bash">kubectl get cm argocd-cm -o=jsonpath=<span class="hljs-string">"[{.data.repositories}, {.data['repository.credentials']}, {.data['helm.repositories']}]"</span>
</code></pre>
<h3 id="heading-6-applicationset-nested-selectors">6. ApplicationSet Nested Selectors</h3>
<p>The ApplicationSet configuration has been simplified to provide more predictable and consistent behavior across different deployment scenarios.</p>
<p><strong>Change:</strong></p>
<ul>
<li><p><code>applyNestedSelectors</code> field is now ignored</p>
</li>
<li><p>Nested selectors are always applied</p>
</li>
<li><p>Remove explicit selectors in existing ApplicationSets</p>
</li>
</ul>
<h3 id="heading-7-cluster-configuration">7. Cluster Configuration</h3>
<p>Cluster management has been refined to provide more explicit and controlled interaction with in-cluster resources, reducing ambiguity in deployment configurations.</p>
<p><strong>When</strong> <code>cluster.inClusterEnabled</code> <strong>is set to "false":</strong></p>
<ul>
<li><p>Existing in-cluster Applications will be in an Unknown state</p>
</li>
<li><p>Cannot create new in-cluster Applications</p>
</li>
<li><p>Deleting Applications will not delete previously managed resources</p>
</li>
</ul>
<h3 id="heading-8-health-status-tracking">8. Health Status Tracking</h3>
<p>Performance optimization has been a key focus, with changes designed to reduce unnecessary load on the application controller while maintaining comprehensive resource tracking.</p>
<p><strong>Before:</strong></p>
<ul>
<li><p>Health status persisted under <code>/status</code> in Application CR</p>
</li>
<li><p>Caused load on application controller</p>
</li>
</ul>
<p><strong>In v3:</strong></p>
<ul>
<li><p>Health status stored externally</p>
</li>
<li><p>Can revert by setting <code>controller.resource.health.persist</code> to <code>true</code></p>
</li>
</ul>
<h3 id="heading-9-plugin-environment-variables">9. Plugin Environment Variables</h3>
<p>Plugin management has been enhanced to provide more flexibility and consistency in configuration handling.</p>
<p><strong>New Behavior:</strong></p>
<ul>
<li>Empty environment variables are now passed to config management plugins</li>
</ul>
<pre><code class="lang-bash">spec:
  <span class="hljs-built_in">source</span>:
    plugin:
      name: example-plugin
      env:
        - name: VERSION
          value: <span class="hljs-string">"1.2.3"</span>
        - name: DATA  <span class="hljs-comment"># Now passed as an empty string</span>
          value: <span class="hljs-string">""</span>
</code></pre>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Argo CD 3.0 represents a significant step in refining GitOps practices. While the changes require careful migration, they ultimately provide more granular control, improved security, and cleaner configuration management. For detailed changes and explanations checkout the official Agro CD documentation <a target="_blank" href="https://argo-cd.readthedocs.io/en/latest/operator-manual/upgrading/2.14-3.0/">here</a>.</p>
<blockquote>
<p>PS:- I still do kubectl edit on production. ;)</p>
</blockquote>
]]></content:encoded></item></channel></rss>