This is part 5 of our 5-part AWS Elastic Kubernetes Service (EKS) security blog series. Don’t forget to check out our previous blog posts in the series:
Part 3 - EKS networking best practices
Part 4 - EKS Runtime Security Best Practices
EKS leaves a large portion of the responsibility for applying security updates and upgrading Kubernetes versions, and for detecting and replacing failed nodes, to the user. EKS users, especially those with multiple clusters, will want to set up some form of automation to lighten the manual load and to ensure that critical security patches get applied to clusters quickly. They will also need comprehensive monitoring to provide visibility into the cluster’s health and to help with the detection of possible unauthorized activity and other security incidents.
Why: The control plane logs capture Kubernetes audit events and requests to the Kubernetes API server, among other components. Analysis of these logs will help detect some types of attacks against the cluster, and security auditors will want to know that you collect and retain this data.
- api — the Kubernetes API server log
- audit — the Kubernetes audit log
- authenticator — the EKS component used to authenticate AWS IAM entities to the Kubernetes API
Why: Irregular spikes in application load or node usage can be a signal that an application may need programmatic troubleshooting, but they can also signal unauthorized activity in the cluster. Monitoring key metrics provides critical visibility into your workload’s functional health and that it may need performance tuning or that it may require further investigation.
What to do: EKS users have a number of options for collecting container health and performance metrics.
- Set up Amazon CloudWatch Container Insights for your cluster.
- Deploy Prometheus in your cluster to collect metrics.
- Deploy another third-party monitoring or metrics collection service.
Note that if you choose a solution like Prometheus running in your cluster, not only will you need to make sure to deploy it securely to prevent data tampering if the cluster becomes compromised, but you will also want to forward the most critical, if not all, metrics to an external collector to preserve their integrity and availability.
Why: EKS provides no automated detection of node issues. Node replacement only happens automatically if the underlying instance fails, at which point the EC2 autoscaling group will terminate and replace it. Changes in node CPU, memory, or network metrics that do not correlate with the cluster workload activity can be signs of security events or other issues.
What to do: Monitor your EKS nodes as you would any other EC2 instance. For managed node groups, Amazon CloudWatch remains the only viable option, because you can not modify the user data to automate to install a third-party collection agent at boot time nor can you use your own AMI (Amazon Machine Image) with the agent baked in. Self-managed node groups allow much more flexibility.
If you do use CloudWatch, you will want to enable detailed monitoring for the best observability.
Why: While EKS takes responsibility for making updated node images and control plane versions available to its users, it will not patch your nodes or control plane for you. You will want to formulate a reliable process for tracking these updates and applying them to your EKS cluster.
What to do: Plan how to get notifications and how to handle security patches for your cluster and its nodes. As EKS provides little automation around any part of this process, you will probably want to create your own automation based on your needs and resources.
Watch for security updates from AWS:
- EKS updates (control plane and nodes)
- Linux AMI (nodes)
- AWS does not provide a feed for Windows updates
AWS doesn’t manage “add-on” upgrades, or even upgrades for the mandatory AWS VPC CNI, so you should make upgrading these components a standard part of cluster upgrades and patching.
EKS leaves a large share of operational overhead to the user. Auditors will expect all of the practices described here to be in place. Prioritizing and maintaining all of the initial set up and ongoing tasks will be foundational to tracking the health and security of your clusters.