skip to main content

Apr 14, 2020

AWS EKS Monitoring Best Practices for Stability and Security

This is part 5 of our 5-part AWS Elastic Kubernetes Service (EKS) security blog series. Don’t forget to check out our previous blog posts in the series:

Part 1 - Guide to Designing EKS Clusters for Better Security

Part 2 - Securing EKS Cluster Add-ons: Dashboard, Fargate, EC2 components, and more

Part 3 - EKS networking best practices

Part 4 - EKS Runtime Security Best Practices

EKS leaves a large portion of the responsibility for applying security updates and upgrading Kubernetes versions, and for detecting and replacing failed nodes, to the user. EKS users, especially those with multiple clusters, will want to set up some form of automation to lighten the manual load and to ensure that critical security patches get applied to clusters quickly. They will also need comprehensive monitoring to provide visibility into the cluster’s health and to help with the detection of possible unauthorized activity and other security incidents.

Collect Control Plane Logs

Why: The control plane logs capture Kubernetes audit events and requests to the Kubernetes API server, among other components. Analysis of these logs will help detect some types of attacks against the cluster, and security auditors will want to know that you collect and retain this data.

What to do: EKS clusters can be configured to send control plane logs to Amazon CloudWatch. At a minimum, you will want to collect the following logs:

  • api — the Kubernetes API server log
  • audit — the Kubernetes audit log
  • authenticator — the EKS component used to authenticate AWS IAM entities to the Kubernetes API

Monitor Container and Cluster Performance for Anomalies

Why: Irregular spikes in application load or node usage can be a signal that an application may need programmatic troubleshooting, but they can also signal unauthorized activity in the cluster. Monitoring key metrics provides critical visibility into your workload’s functional health and that it may need performance tuning or that it may require further investigation.

What to do: EKS users have a number of options for collecting container health and performance metrics.

Note that if you choose a solution like Prometheus running in your cluster, not only will you need to make sure to deploy it securely to prevent data tampering if the cluster becomes compromised, but you will also want to forward the most critical, if not all, metrics to an external collector to preserve their integrity and availability.

Monitor Node (EC2 Instance) Health and Security

Why: EKS provides no automated detection of node issues. Node replacement only happens automatically if the underlying instance fails, at which point the EC2 autoscaling group will terminate and replace it. Changes in node CPU, memory, or network metrics that do not correlate with the cluster workload activity can be signs of security events or other issues.

What to do: Monitor your EKS nodes as you would any other EC2 instance. For managed node groups, Amazon CloudWatch remains the only viable option, because you can not modify the user data to automate to install a third-party collection agent at boot time nor can you use your own AMI (Amazon Machine Image) with the agent baked in. Self-managed node groups allow much more flexibility.

If you do use CloudWatch, you will want to enable detailed monitoring for the best observability.

Keep EKS Clusters Up-to-date

Why: While EKS takes responsibility for making updated node images and control plane versions available to its users, it will not patch your nodes or control plane for you. You will want to formulate a reliable process for tracking these updates and applying them to your EKS cluster.

What to do: Plan how to get notifications and how to handle security patches for your cluster and its nodes. As EKS provides little automation around any part of this process, you will probably want to create your own automation based on your needs and resources.

Watch for security updates from AWS:

Upgrade instructions:

AWS doesn’t manage “add-on” upgrades, or even upgrades for the mandatory AWS VPC CNI, so you should make upgrading these components a standard part of cluster upgrades and patching.

    EKS leaves a large share of operational overhead to the user. Auditors will expect all of the practices described here to be in place. Prioritizing and maintaining all of the initial set up and ongoing tasks will be foundational to tracking the health and security of your clusters.