EKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud

Providing an update on the core Kubernetes services offered by the big three

We are now six years past the initial release of Kubernetes, and it continues to be one of the fastest-growing open-source projects to date. The rapid development and adoption of Kubernetes have resulted in many different implementations of the application. The Cloud Native Computing Foundation (CNCF) currently lists over 90 Certified Kubernetes offerings. To ensure some consistency between platforms, the CNCF focuses on three core tenets;

Consistency: The ability to interact consistently with any Kubernetes installation.
Timely updates: Vendors are required to keep versions updated, at least yearly.
Confirmability: Any end-user can verify the conformity using Sonobuoy.

These are the baseline requirements for the CNCF when it comes to Kubernetes, but cloud providers have such rich ecosystems that there are bound to be more significant discrepancies. We continue to examine the many current features and limitations of managed Kubernetes services from the three largest cloud providers:

Amazon’s Elastic Kubernetes Service (EKS)
Microsoft’s Azure Kubernetes Service (AKS)
Google’s Kubernetes Engine (GKE)

We hope that by presenting this information side-by-side, both current Kubernetes users and prospective adopters can better understand their options or get an overview of the current state of managed Kubernetes offerings. This comparison aims to cover concepts such as version availability, network and security options, and container image services. The overview will not detail pricing or topics outside of a platform’s technical capabilities for Kubernetes. All information was current as of January 2021, and you can find more caveats in the “Notes on Data and Sources” at the end of this post.

General information

	Amazon EKS	Microsoft AKS	Google GKE	Kubernetes
Currently supported Kubernetes version(s)	1.18 (default) 1.17 1.16 1.15	1.20 (preview) 1.19 1.18 1.17	1.17 1.16 1.15 (default) 1.14	1.20 1.19 1.18
# of supported minor version releases	≥3 + 1 deprecated	3	4	3
Original GA release date	June 2018	June 2018	August 2015	July 2015 (Kubernetes 1.0)
CNCF Kubernetes Conformance	Yes	Yes	Yes	—
Latest CNCF-certified version	1.18	1.19	1.18	—
Control-plane upgrade process	User initiated User must also manually update the system services that run on nodes (e.g., kube-proxy, coredns, AWS VPC CNI)	User initiated All system components update with cluster upgraded	Automatically upgraded by default; can be user-initiated	—
Node upgrade process	Unmanaged node groups: user-initiated and managed Managed node groups: user-initiated; EKS will drain and replace nodes	Automatically upgraded; or user-initated; AKS will drain and replace nodes	Automatically upgraded (default; can be turned off) during cluster maintenance window; can be user-initiated; GKE drains and replaces nodes	—
Node OS	Linux: Amazon Linux 2 (default); Ubuntu (partner AMI) Bottlerocket Windows: Windows Server 2019	Linux: Ubuntu Windows: Windows Server 2019	Linux: Container-Optimized OS (COS) (default), Ubuntu Windows: Windows Server 2019 Windows Server version 1909	Linux: any OS supported by a compatible container runtime Windows: Windows Server 2019 (Kubernetes v1.14+)
Container runtime	Docker (default) containerd (through Bottlerocket)	Docker (default) containerd	Docker (default) containerd gVisor	Linux: Docker Containerd Cri-o rktlet any runtime that implements the Kubernetes CRI (Container Runtime Interface) Windows: Docker EE-basic 18.09
Control plane high availability options	Control plane is deployed across multiple Availability Zones (default)	Control plane components will be spread between the number of zones defined by the Admin	Zonal Clusters: Single Control Plane Regional Clusters: Three Kubernetes control planes quorum	Supported
Control plane SLA	99.95% (default)	99.95% (SLA backed) 99.9% (non-SLA backed)	Zonal clusters: 99.95% Regional clusters: 99.95%	—
SLA financially-backed	Yes	Yes	Yes	—
Pricing	$0.10/hour (USD) per cluster + standard costs of EC2 instances and other resources	Pay-as-you-go: Standard costs of node VMs and other resources	$0.10/hour (USD) per cluster + standard costs of GCE machines and other resources	—
GPU support	Yes (NVIDIA); user must install device plugin in cluster	Yes (NVIDIA); user must install device plugin in cluster	Yes (NVIDIA); user must install device plugin in cluster Compute Engine A2 VMs; are also available	Supported with device plugins
Control plane: log collection	Optional Default: Off Logs are sent to AWS CloudWatch	Optional Default: Off Logs are sent to Azure Monitor	Optional Default: Off Logs are sent to Stackdriver	—
Container performance metrics	Optional Default: Off Metrics are sent to AWS CloudWatch Container Insights	Optional Default: Off Metrics are sent to Azure Monitor	Optional Default: Off Metrics are sent to Stackdriver	—
Node health monitoring	No Kubernetes-aware support; if node instance fails, the AWS autoscaling group of the node pool will replace it	Auto repair is now available. Node status monitoring is available. Use autoscaling rules to shift workloads.	Node auto-repair enabled by default	—

Comments

Starting with the supported versions, AKS has been quicker to support the newer Kubernetes versions and has also announced support for more minor patches. AKS has a very structured approach to its supported versions and continues to push customers off of old versions to take advantage of newer Kubernetes Features. However, customers may find flexibility with GKE’s overall number of supported versions. GKE maintains four minor versions with around 12 total versions supported between 1.14 and 1.17. EKS comes in with the same number of supported minor versions but only four versions available. EKS reflects a business approach to version management by continuing its support for 1.15, the most used Kubernetes version in production.

One significant difference between the cloud provider options concerns the management amount that each provides for clusters, particularly control plane components. GKE still maintains the lead here, offering automated upgrades for the control plane and nodes, in addition to detecting and fixing unhealthy nodes. GKE also offers release channels, which automates the ability for developers to test new versions. AKS has taken a page out of GKE’s playbook and offered automated upgrades for Kubernetes nodes, with similar channels to GCP. Upgrades in AKS and EKS still require at least some manual work, with both requiring manual upgrades of the core Kubernetes control -plane component.

EKS does not offer any specialized node health monitoring or repair. EKS customers can create custom health checks to do some degree of node health monitoring and customer-automated replacement for EKS clusters. AKS has announced support for a node auto-repair feature, and when paired with its auto-scaling node pools, this should suffice for most organizations’ HA requirements. GKE remains the clear leader in cluster health maintenance with auto-repair enabled by default.

There has been a leveling off between providers when it comes to the service level agreements. All providers offer an uptime of 99.95%; however, EKS provides this by default, while AKS and GKE require additional costs or regional usage to achieve the same uptime. EKS and now GKE charge for their control plane usage at $0.10/cluster/hour. That amount will make up a negligible part of the total cost for all but the smallest clusters, but it brings something the other providers do not offer: a financially-backed SLA. All three providers now refund SLA penalties. Although they rarely compare to the loss of potential productivity or revenue suffered during a provider outage, offering published penalties can bring a greater degree of confidence, real or perceived, in the seriousness of the provider’s commitment to reliability and uptime.

While pods and nodes running in a Kubernetes cluster can survive outages of the control plane and its components, even short-lived interruptions can be problematic for some workloads. Depending on the affected control plane components, failed pods may not get rescheduled, or clients may not connect to the cluster API to perform queries or manage resources in the cluster. If the etcd database loses quorum (assuming it has been deployed as a highly-available cluster) or experiences severe data corruption or loss, the Kubernetes cluster may become unrecoverable.

Lastly, GKE supports a variety of operating systems (OS) and container runtimes. Along with Windows and Linux OS support, GKE supports a container optimized OS (COS). COS is a simplified but hardened Linux version, allowing for quicker container deployments and scaling. EKS counters with their Bottlerocket offering, another COS with the ability to run containerd instead of the stand Docker engine. With the news in December about Kubernetes deprecation of Docker as a container runtime, it will be necessary to follow how the providers adapt and support other container runtimes.

Service Limits

Limits are per account (AWS), subscription (AKS), or project (GKE) unless otherwise noted. Limitations for which the customer can request an increase are indicated with an asterisk (*).

	EKS	AKS	GKE	Kubernetes (as of v1.19)
Max clusters	100/region*	1000	50/zone + 50 regional clusters	—
Max nodes per cluster	30 (Managed node groups) * 100 (Max nodes per group) = 3000*	1000 (Virtual Machine Scale Sets) 100 (VM Availability Sets)	15,000 nodes (v1.18 required) 5000 nodes (v1.17 or lower)	5000
Max nodes per node pool/group	Managed node groups: 100*	100	1000	—
Max node pools/groups per cluster	Managed node groups: 30*	100 nodes per node pool	Not documented	—
Max pods per node	Linux: Varies by node instance type: ((# of IPs per Elastic Network Interface - 1) # of ENIs) + 2* Windows: # of IPs per ENI - 1	250 (Azure CNI, max, configured at cluster creation time) 110 (kubenet network) 30 (Azure CNI, default)	110 (default)	100 (recommended value, configurable)

Comments

While most of these limits are relatively straightforward, a couple are not.

In AKS, the absolute maximum number of nodes that a cluster can have depends on a few configurations, including whether the node is in a VM State Set or Availability Set and whether cluster networking uses kubenet or the Azure CNI. Even then, it is still unclear which number takes absolute precedence for specific configurations.

Meanwhile, in EKS, planning for the maximum number of pods scheduled on a Linux node requires some research and math. EKS clusters use the AWS VPC CNI for cluster networking. This CNI puts the pods directly on the VPC network by using ENIs (Elastic Network Interfaces), virtual network devices attached to EC2 instances. Different EC2 instance types support both a different number of ENIs and different IP addresses (one is needed per pod) per ENI. Therefore, to determine how many pods a particular EC2 instance type can run in an EKS cluster, you would get the values from this table and plug them into this formula: ((# of IPs per Elastic Network Interface - 1) * # of ENIs) + 2. A c5.12xlarge EC2 instance, which can support 8 ENIs with 30 IPv4 addresses each, can therefore accommodate up to ((30 - 1) * 8) + 2 = 234 pods. Note that large nodes with the maximum number of scheduled pods will eat up the /16 IPv4 CIDR block of the cluster’s VPC very quickly. Pod limits for Windows nodes in EKS are easier to compute and much lower. Here, use the formula # of IP addresses per ENI - 1. The same c5.12xlarge instance could run as many as 234 pods as a Linux node could only run 29 pods as a Windows node.

GKE selects the pod range based on the available IPs allocatable on a worker node. With a range of /24, there are 256 allocatable addresses. Having 110 pods as the limit allows for quick and more reliable scaling.

Networking + Security

	EKS	AKS	GKE	Kubernetes
Network plugin/CNI	Amazon VPC Container Network Interface (CNI)	Azure CNI or kubenet	kubenet (default) Calico (added for Network Policies)	kubenet (default) External CNIs can added
Kubernetes RBAC	Required Immutable after cluster creation	Enabled by default Immutable after cluster creation	Enabled by default Mutable after cluster creation	Supported since 2017
Kubernetes Network Policy	Not enabled by default Calico can be manually installed at any time	Not enabled by default Must be enabled at cluster creation time	Not enabled by default Can be enabled at any time	Not enabled by default CNI implementing Network Policy API can be installed manually
PodSecurityPolicy support	PSP controller installed in all clusters with permissive default policy (v1.13+)	PSP can be installed at any time. Will be deprecated on May 31st 2021 for Azure Policy	PSP can be installed at any time. Currently in Beta	PSP admission controller needs to be enabled as kube-apiserver flag. Set to be deprecated in version 1.21
Private or public IP address for cluster Kubernetes API	Public by default Optional public, hybrid or private setup	Public by default Private-only available where private link is supported	Public by default Optional public, hybrid or private setup	—
Private or Public IP addresses for nodes	Unmanaged node groups: Optional Managed node groups: Optional	Public by default Private can be enabled as well	Public by default Private can be enabled as well	—
Pod-to-pod traffic encryption supported by provider	No by default	No by default	Yes, with Istio implemented	Requires a CNI implementation with functionality
Firewall for cluster Kubernetes API	CIDR allow list option	CIDR allow list option	CIDR allow list option	—
Read-only root filesystem on node	Pod security policy required	Azure policy required	COS: default Alternative: Pod security policy required	Supported

Comments

All three providers now deploy with Kubernetes RBAC enabled by default, a big win in the security column. By making RBAC mandatory, EKS maintains its core Kubernetes security controls standard in every cluster. EKS also ensures support for Pod Security Policy with a permissive policy by default. Conversely, AKS makes it harder to manage security by requirg network policies to be enabled at cluster creation time. As users implement Kubernetes-native security controls, a cluster workload migration must take advantage of these features.

EKS requires the customer to install and manage upgrades for the Calico CNI themselves. AKS provides two options for Network Policy support, depending on the cluster network type but only allows enabling support at cluster creation time. AKS also provides additional policy management features via Azure Policy, which seems promising for hardening AKS clusters.

All three cloud providers now offer a few options for limiting network access to the Kubernetes API endpoint of a cluster. However, even with Kubernetes RBAC and a secure authentication method enabled for a cluster, leaving the API server open to the world still leaves it unprotected. An unprotected API server means more exposure that allows attackers to gain access to the cluster. Applying a CIDR allowlist or giving the API a private, internal IP address rather than a public address also protects against scenarios such as compromised cluster credentials.

EKS introduced managed node groups at re:Invent December 2019. While managed node groups remove a fair bit of the previous work required to create and maintain an EKS cluster, they come with a distinct disadvantage for node network security. All nodes in a managed node group must have a public IP address and must be able to send traffic out of the VPC. Effectively restricting egress traffic from the nodes becomes more difficult. While external access to these public addresses can be protected with proper security group rules and network ACLs, they still pose a severe risk if the customer incorrectly configures or does not restrict the network controls of a cluster’s VPC. This risk can be mitigated somewhat by only placing the nodes on private subnets.

Container Image Services

	EKS	AKS	GKE
Image repository service	ECR (Elastic Container Registry)	ACR (Azure Container Registry)	AR (Artifact Registry)
Supported formats	Docker Image Manifest V2, Schema 1 Docker Image Manifest V2, Schema 2 Open Container Initiative (OCI) Specifications Helm Charts	Docker Image Manifest V2, Schema 1 Docker Image Manifest V2, Schema 2 Open Container Initiative (OCI) Specifications Helm Charts	Docker Image Manifest V2, Schema 1 Docker Image Manifest V2, Schema 2 Open Container Initiative (OCI) Specifications Maven and Gradle npm
Access security	Permissions managed by AWS IAM Permissions can be applied at repository level Network endpoint is public by default Network endpoint can be limited to specific VPCs	Permissions managed by Azure RBAC Can be applied at repository level (preview) Network endpoint is public by default Network endpoint can be limited to specific VNets (preview)	Permissions managed by GCP IAM Permissions can only be applied at registry level Network endpoint is public by default Network access for GCR registries can be limited to specific VPCs with service perimeters
Supports image signing	No	Yes	Yes, with Binary Authorization and Voucher
Supports immutable image tags	Yes	Yes, and it supports the locking of images and repositories	No
Image scanning service	Yes, free service: OS packages only	Yes, paid service: Uses the Qualys scanner in a sandbox to check for vulnerabilities	Yes, paid Service: OS packages only
Registry SLA	99.9%; financially-backed	99.9%; financially-backed	None
Geo-Redundancy	Yes, configurable	Yes, configurable as part of the premium service	Yes: by default

Comments

Last year, we saw massive outages from some third-party hosted services, so it is always useful to assess your dependencies and security strategies.

Amazon and Azure’s container services are relatively similar. Amazon’s Elastic Container Registry (ECR) is a paid tiered service that provides a financially backed SLA, a free image scanning service, and features such as immutable image tags. Azure’s Azure Container Registry (ACR) is also a paid tiered service that provides a financially backed SLA, a paid image scanning service, with the Qualys scanner, and features such as image signing, immutable tags, and the locking of images and repositories.

ECR has recently updated the ability for its registry to be geo-redundant. With ECRs move to support cross-region and cross-account, users don’t need to set up and manage redundancy across zones as they previously did. Azure is priced on a daily usage rate and based on the amount of storage required but does not charge for network bandwidth. Microsoft also offers geo-redundancy as part of their premium plan.

Google has completed its move away from its existing container registry, Google Container Registry, into a complete Artifact Registry product. Google has chosen to focus on more supported image formats, integrated image scanning, and binary authorization for a more secure offering.

Notes on Data and Sources

This post’s information should be considered a snapshot of these Kubernetes services at the time of publication. Supported Kubernetes versions, in particular, will change regularly. Features currently in preview (EKS and AKS terminology) or beta (GKE terminology) at this time are marked as such and may change before becoming generally available.

All data in the tables comes from the official provider online documentation (kubernetes.io in the case of open-source Kubernetes), supplemented in some cases by inspection of running clusters and service API queries. (Cloud Native Computing Foundation conformance data is an exception.) This information, particularly for supported Kubernetes versions, may be specific to regions in the US; availability may vary in other regions. Values for open-source Kubertes are omitted where they are either specific to a managed service or depend on how and where a self-managed cluster is deployed.

We also do not attempt to make comparisons of pricing in most cases. Even for a single provider, the pricing of resources can vary wildly between regions, and even if we came up with a standard sample cluster size and workload, the ratios of the costs might not be proportional for a different configuration. In particular, note that some optional features like logging, private network endpoints for services, and container image scanning may incur additional costs in some clouds.

We also do not address the performance differences between providers. Many variables come into play for performance benchmarking. If you need accurate numbers, run your tests to compare the multiple compute, storage, and network options of each provider, in addition to testing with your application stack, which would provide the most accurate data for your needs.

All attempts have been made to ensure the completeness and accuracy of this information. However, errors or omissions may exist due to unclear or missing provider documentation or due to errors on our part.

EKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud

By: Michael Foster

General information

Comments

Service Limits

Comments

Networking + Security

Comments

Container Image Services

Comments

Notes on Data and Sources