Managed Kubernetes: Control Plane Failures at Scale

You run EKS or GKE or AKS. You chose managed Kubernetes because you didn't want to operate etcd, patch API servers, or worry about control plane upgrades. The cloud provider handles it. You just run workloads.

That's the pitch. And if you are running a 20-node cluster with a few standard web apps, it's mostly true. You can probably stop here.

But if you are scaling into the hundreds of nodes, running heavy GitOps automation, or managing multi-tenant platforms, the pitch eventually breaks. And you realize you have zero visibility into the thing that's breaking your cluster.

The Incident That Isn't in Your Runbook

A platform team at a fintech company running a 200-node EKS cluster in us-east-1 started seeing API server latency spike at 9:47 AM on a Tuesday. kubectl get pods took 12 seconds. Deployments stalled. HPA stopped scaling. The cluster wasn't down it was slow.

They checked everything they could see: node CPU, network throughput, CoreDNS metrics, etcd disk usage (from the metrics they had). Nothing explained it.

They opened an AWS support case. The response, four hours later: "We detected elevated API server latency in the us-east-1 control plane. Our team applied a mitigation. No customer action required."

No root cause. No post-mortem. No way to prevent it from happening again.

This is the managed K8s bargain: you don't operate the control plane, but you also don't control it. And when it breaks, your only lever is a support ticket.

What "Managed" Actually Means

Let's be precise about what you're getting and what you're not.

EKS control plane: AWS runs the API servers and etcd. You get a managed endpoint. You cannot SSH into the control plane nodes. You cannot read etcd directly. You cannot adjust API server flags. AWS guarantees 99.95% uptime for the standard control plane (99.99% for provisioned). That SLA translates to ~22 minutes of downtime per month. You don't get to see why those 22 minutes happened.

GKE control plane: Google runs the API servers and etcd. Zonal clusters get 99.5% SLA. Regional clusters get 99.95%. Autopilot gets 99.95%. You can see some control plane metrics in Cloud Monitoring, but you can't touch the configuration.

AKS control plane: Microsoft runs the API servers and etcd. SLA is 99.95% for clusters using Availability Zones, 99.9% without AZs, and the free tier carries no SLA at all. You get read-only access to some control plane logs via diagnostic settings, but no configuration access.

The pattern is the same across all three: you get an endpoint, an SLA, and a support ticket form. Everything else is a black box.

Vector #1: API Server Throttling, The Silent Killer

This is the one that catches most teams off guard because it doesn't look like a failure. The API server is up. It responds. It just... responds slowly.

Why it happens

The API server is a single logical endpoint (even though it's backed by multiple instances). Every kubectl call, every controller reconciliation loop, every webhook admission check all of it goes through the API server. And the API server has throughput limits that you cannot see or configure.

On EKS, AWS doesn't publish the exact API server request rate limits. The community has measured them at ~3,000-5,000 requests/second sustained on standard control planes, but this is an unofficial figure the actual threshold depends on control plane size, which you can't see. If you're running EKS Provisioned Control Plane, AWS exposes explicit concurrency tiers per scaling tier (XL through 8XL), which removes some of the opacity. But on standard EKS, you're guessing.

The insidious part: your controllers are the ones throttling you. Every Deployment, StatefulSet, DaemonSet, and Ingress controller in your cluster is constantly reconciling reading and writing to the API server. A cluster with 500 pods and 50 controllers can easily generate 2,000+ API requests/second during normal operation. Add a rolling restart or a node failure, and you spike past the limit.

And it cascades. When the API server is saturated especially during heavy operations like a cluster upgrade it's not just kubectl that gets slow. CoreDNS starts failing too. CoreDNS watches the API server for EndpointSlice updates. When API server latency spikes, CoreDNS can't keep its endpoint cache fresh. DNS resolution starts failing for your workloads. Suddenly your services can't find each other, and you're debugging DNS when the real problem is three layers up in the control plane.

What it looks like

# You'll see this in your controller logs:
# "The server has received too many requests and has asked us to slow down"
# Or just... slow kubectl commands with no error

# Check your API server latency from inside the cluster:
kubectl get --raw /metrics | grep apiserver_request_duration_seconds | head -5

If apiserver_request_duration_seconds p99 is above 1 second, your control plane is under pressure. If it's above 3 seconds, you're being throttled.

What you can't do

You can't scale the API server horizontally. You can't adjust --max-requests-inflight or --max-mutating-requests-inflight. You can't add more API server instances. These are managed by the cloud provider, and the knobs are not exposed.

What you can do

# Top 20 clients by API request rate (requires Prometheus):
topk(20, sum by (user_agent) (rate(apiserver_request_total[5m])))

If you don't have Prometheus, you can get a rough view from the API server metrics endpoint, but the output has many label dimensions (verb, resource, group) and requires significant post-processing to be useful. The PromQL above is the cleaner path and if you're running a large enough cluster to worry about API throttling, you should already have Prometheus.

This shows you which clients are generating the most API requests. Common culprits:

Overly aggressive reconciliation loops controllers checking every 1 second instead of every 30
Chatty operators some Helm operators and GitOps tools poll instead of watching
Too many webhooks every admission webhook adds latency to every API request

When the API server is under pressure, it uses API Priority and Fairness (APF) to manage the load. APF replaced the old max-inflight request limits in Kubernetes 1.20 with a fair-queuing system: it classifies requests using flow schemas (by user, namespace, resource, or other criteria), maps each flow to a priority level, and queues requests instead of simply rejecting them. The result is that no single client can starve others at the same priority level a chatty operator won't lock out your leader elections, but it will get deprioritized relative to them. If you want to understand exactly how your requests are being classified, search for "Kubernetes API Priority and Fairness" it's the mechanism behind every 429 you'll ever see from a managed control plane.

Monday morning task: Run the audit above. If a single controller or operator is generating >10% of your API requests, fix its reconciliation interval before AWS fixes it for you with a throttle.

Vector #2: etcd Compaction, The Storage Bomb Nobody Monitors

etcd is the brain of your Kubernetes cluster. Every Secret, ConfigMap, Pod, Deployment everything lives in etcd. And etcd has a storage limit that, when hit, causes the control plane to stop accepting writes.

The mechanics

etcd uses a key-value store with MVCC (multi-version concurrency control). Every update creates a new version. Old versions are cleaned up by compaction a background process that removes historical revisions. Compaction runs automatically, but it can fall behind if your cluster has high write throughput.

When etcd storage exceeds the provider's threshold historically around 2 GB for open-source etcd, though managed providers like EKS dynamically scale the underlying storage it enters read-only mode. The API server can read existing resources but cannot create or update anything. Your cluster is effectively frozen. The exact threshold is opaque on managed offerings; AWS doesn't publish the number, and it can vary by cluster size and control plane version.

Why managed doesn't save you

On EKS, GKE, and AKS, you can't run etcdctl directly. You can't check etcd_mvcc_db_total_size_in_bytes. You can't manually trigger compaction. You're dependent on the cloud provider's etcd management and their compaction schedule may not match your write pattern.

The telltale sign: you'll see etcdserver: mvcc: database space exceeded in API server logs, but you won't see it until after the cluster is already read-only.

What you can monitor

You don't know your exact storage ceiling, so you cannot alert on a static percentage. Instead, you must monitor for anomalous velocity and upstream danger zones using Prometheus, which scrapes the API server metrics endpoint.

The Managed Metric Catch: Because you don't own the etcd nodes, you only get the metrics the cloud provider chooses to pass through the API server. The actual size metric (etcd_db_total_size_in_bytes) is notoriously hidden on standard EKS clusters. If your provider exposes it, use it. If they don't, your fallback is apiserver_storage_objects a count of all resources currently stored in etcd.

Alarm Strategy 1: The Danger Zone Set an absolute alarm. If you have the size metric, alert at 1.5 GB. If you are forced to use object counts, alert at ~100,000 objects. Upstream Kubernetes defaults to a 2GB etcd quota. Even if your cloud provider has dynamically scaled your control plane higher, hitting these numbers means you have a massive amount of state and need to investigate immediately.

Alarm Strategy 2: The Runaway Controller Alert on velocity, not just capacity. A controller stuck in an infinite loop will spike your storage graph vertically.

# IDEAL: If your provider exposes etcd size
# Alert if db size grows by more than 15% in one hour:
(etcd_db_total_size_in_bytes - etcd_db_total_size_in_bytes offset 1h) 
  / etcd_db_total_size_in_bytes offset 1h > 0.15

# FALLBACK: The standard for EKS
# Alert if total object count grows by more than 15% in one hour:
(sum(apiserver_storage_objects) - sum(apiserver_storage_objects offset 1h)) 
  / sum(apiserver_storage_objects offset 1h) > 0.15

Monday morning task: Implement the velocity alert above. A controller creating ConfigMaps in a loop can exhaust your remaining storage in minutes; a steady-state cluster might have hours. Use the velocity alarm as a trigger to immediately identify what is writing so much data before you hit the provider's invisible ceiling.

Vector #3: CNI Limits, When Your Network Is Someone Else's Problem

This is the one that bites at scale. Every pod needs an IP address. In AWS, that IP comes from your VPC subnet. And VPC subnets have limits that are invisible until you hit them.

The AWS VPC CNI problem

The AWS VPC CNI plugin assigns an IP address from your subnet to every pod. A /24 subnet gives you 251 usable IPs (5 are reserved by AWS). That's 251 pods per subnet. If you run a cluster with 1,000 pods, you need at least 4 subnets just for IP capacity.

But it gets worse. The VPC CNI also has ENI (Elastic Network Interface) limits per instance type. The formula is: ENIs × (IPs per ENI − 1) + 2. One IP per ENI is reserved for the ENI itself, and the +2 accounts for hostNetwork pods (kube-proxy, aws-node) that don't consume CNI slots.

A m5.large supports 3 ENIs with 10 IPs each: 3 × (10 − 1) + 2 = 29 pods max per node. A m5.2xlarge supports 4 ENIs with 15 IPs each: 4 × (15 − 1) + 2 = 58 pods per node.

# Check your ENI limits per instance type:
aws ec2 describe-instance-types \
  --instance-types m5.large m5.xlarge m5.2xlarge \
  --query 'InstanceTypes[].{Type: InstanceType, ENIs: NetworkInfo.MaximumNetworkInterfaces, IPs: NetworkInfo.Ipv4AddressesPerInterface}'

Output:

[
  {"Type": "m5.large", "ENIs": 3, "IPs": 10},
  {"Type": "m5.xlarge", "ENIs": 4, "IPs": 15},
  {"Type": "m5.2xlarge", "ENIs": 4, "IPs": 15}
]

So your m5.2xlarge with 8 vCPUs and 32 GB RAM can only run 58 pods. The remaining resources are wasted because you ran out of IPs.

Note: AWS introduced Prefix Delegation to mitigate this it assigns entire /28 blocks to ENIs instead of single IPs, raising pod limits significantly. An m5.large jumps from 29 to a theoretical 400+ IPs, but EKS sensibly hard-caps the kubelet at the upstream Kubernetes default of 110 pods per node. If your cluster isn't configured for Prefix Delegation, or you exhaust the prefixes in your subnet, you hit the same wall. It's not enabled by default on existing clusters.

The GKE and AKS angle

GKE uses alias IP ranges (VPC-native networking), which is more flexible pods get IPs from a secondary range, not the node subnet. But you still have a maximum pods-per-node limit (default 110 on GKE Standard, configurable up to 256).

AKS uses Azure CNI with similar IP-per-node limits. The default is 30 pods per node, configurable up to 250.

What you can't do

You can't change the ENI limits on AWS they're hardware limits of the instance type. You can't make a /24 subnet give you more than 251 IPs. You can't run more pods than your IP allocation allows.

What you can do

# Check your nodes' hard limits for pod capacity based on their instance type:
kubectl get nodes -o json | \
  jq '[.items[] | {name: .metadata.name, capacity: .status.capacity.pods, allocatable: .status.allocatable.pods}]'

Monday morning task: Calculate your IP headroom. (Total subnet IPs) - (Total pods across all nodes) = your buffer. If it's less than 20%, you're one node failure away from a scheduling outage. When a node dies, its pods reschedule and they need IPs from the remaining subnets. If those subnets are full, the pods stay in Pending forever.

The SLA Is Not a Strategy

Here's the uncomfortable truth about managed K8s SLAs:

Provider	Control Plane SLA	Max Downtime/Month	Service Credit
EKS Standard	99.95% (5-min intervals)	~22 minutes	10% (≥99.0%) / 25% (≥95.0%) / 100% (<95.0%)
EKS Provisioned	99.99% (1-min intervals)	~4.3 minutes	10% (≥99.0%) / 25% (≥95.0%) / 100% (<95.0%)
GKE Regional / Autopilot	99.95%	~22 minutes	10% / 25% (max 50% of bill)
GKE Zonal	99.5%	~3.6 hours	10% / 25% (max 50% of bill)
AKS (with Availability Zones)	99.95%	~22 minutes	10% / 25% / 100%
AKS (without AZs / free tier)	99.9% / no SLA	~43 minutes / none	None

A 10% service credit on a $500/month EKS cluster is $50. That's what you get for 22 minutes of downtime. Your SLA with your own customers is probably more generous than that. And GKE caps its maximum credit at 50% of the monthly bill even if the control plane is down for days, you never get more than half your money back.

The SLA is a financial backstop, not an operational strategy. It doesn't tell you why the control plane went down. It doesn't tell you how to prevent it. It doesn't give you the metrics to detect it before your users do.

The Provisioned Control Plane Exception

Everything above describes the standard managed K8s experience: opaque control plane, invisible limits, no knobs to turn. But there's one exception that changes the calculus EKS Provisioned Control Plane.

Launched GA in November 2025, Provisioned Control Plane exposes explicit scaling tiers instead of hiding them. There are four tiers XL, 2XL, 4XL, and 8XL each multiplying API request concurrency, pod scheduling rate, and etcd storage against the XL baseline. Pricing starts at $1.65/hr for XL and scales up to $13.90/hr for 8XL, on top of the standard $0.10/hr cluster fee. The exact concurrency and storage numbers for each tier are published in the AWS EKS documentation check them against your cluster's current request rate before committing.

The 99.99% SLA is measured in 1-minute intervals (vs. 5-minute for standard), which means the fintech incident at the top of this post 12 seconds of API latency might not have qualified as "downtime" under the standard SLA but would be caught under Provisioned.

This is a direct response to the exact problem this article diagnoses: control plane opacity. AWS is essentially saying "you were right, so here's a tier where we expose the knobs."

The catch: At $1.65/hr ($1,204/month) for the XL tier, Provisioned Control Plane costs more than many entire clusters. For a team running a 200-node cluster spending $500/month on compute, adding Provisioned Control Plane more than triples the control plane cost. It's a solution, but it's priced for teams that are already feeling the pain not for teams trying to prevent it.

Monday morning task: If your cluster is hitting API server latency above 1-second p99 regularly, or you're running 500+ nodes, price out Provisioned Control Plane XL. The math might work if you're already losing more than $1,200/month to throttled deployments and debugging time.

What Good Looks Like

You can't operate the control plane. But you can operate around it:

Monitor what you can see. API server latency, etcd storage, IP utilization, ENI attachment rates. Set alarms at 70% of every limit. The cloud provider won't tell you when you're approaching a limit you have to figure it out yourself.
Reduce your control plane load. Fewer controllers, longer reconciliation intervals, fewer webhooks. Every API request you don't make is a request that won't be throttled.
Design for control plane failure. If the API server goes down, your workloads keep running but nothing changes. No new pods, no scaling, no config updates. Make sure your applications can survive a frozen cluster without degrading. How long? Depends on the provider and the incident could be minutes, could be hours.
Run regional, not zonal. GKE zonal clusters have a 99.5% SLA that's 3.6 hours of potential downtime per month. Regional clusters get 99.95%. The cost difference is minimal. The reliability difference is enormous.

The Bottom Line

Managed Kubernetes eliminates the toil of operating a control plane. But at scale, it does not eliminate the risk of control plane failure. If your cluster is small, the managed abstractions work beautifully. But when you push the limits of infrastructure, you still pay for every minute of downtime, every throttled request, every frozen deployment you just can't see it coming.

EKS Provisioned Control Plane is the first real acknowledgment from a cloud provider that this is a problem worth solving but at $1,200+/month for the entry tier, it's a solution priced for teams that are already bleeding. For everyone else, the advice is the same: monitor what you can see, reduce what you can control, and design for the failure you can't avoid.

The control plane you don't own is the control plane that will fail in ways you can't predict, can't prevent, and can't fix. Your only defense is to know your limits before the cloud provider's support ticket tells you that you've hit them.

The Kubernetes Control Plane You Don't Own

The Incident That Isn't in Your Runbook

What "Managed" Actually Means

Vector #1: API Server Throttling, The Silent Killer

Why it happens

What it looks like

What you can't do

What you can do

Vector #2: etcd Compaction, The Storage Bomb Nobody Monitors

The mechanics

Why managed doesn't save you

What you can monitor

Vector #3: CNI Limits, When Your Network Is Someone Else's Problem

The AWS VPC CNI problem

The GKE and AKS angle

What you can't do

What you can do

The SLA Is Not a Strategy

The Provisioned Control Plane Exception

What Good Looks Like

The Bottom Line

Comments

More from this blog

Hardening the Cluster: Implementing User Namespaces for Container Isolation

Your GitOps Pipeline is a Lie Until You Prove Otherwise

When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

Argo Rollouts in Production: Canary, AnalysisTemplates, and the Gotchas Nobody Documents

Command Palette

The Incident That Isn't in Your Runbook

What "Managed" Actually Means

Vector #1: API Server Throttling, The Silent Killer

Why it happens

What it looks like

What you can't do

What you can do

Vector #2: etcd Compaction, The Storage Bomb Nobody Monitors

The mechanics

Why managed doesn't save you

What you can monitor

Vector #3: CNI Limits, When Your Network Is Someone Else's Problem

The AWS VPC CNI problem

The GKE and AKS angle

What you can't do

What you can do

The SLA Is Not a Strategy

The Provisioned Control Plane Exception

What Good Looks Like

The Bottom Line

Comments

More from this blog