Kubernetes monitoring is a different beast compared to traditional infrastructure monitoring. Pods come and go, services scale dynamically, and the sheer number of moving parts can overwhelm basic monitoring tools. After setting up Datadog monitoring for several EKS clusters in production, I've developed a reliable approach that gives teams the visibility they need without drowning in noise.
Why Kubernetes Monitoring is Hard
Unlike static EC2 instances, Kubernetes introduces ephemeral resources that traditional monitoring struggles with:
- Pod churn: Pods are created and destroyed constantly during deployments and scaling events
- Multi-layer metrics: You need visibility at the node, pod, container, and application level
- Service discovery: Endpoints change as pods scale, making static monitoring rules useless
- Resource contention: Multiple workloads sharing nodes compete for CPU, memory, and network
Deploying the Datadog Agent
The recommended approach is using the Datadog Helm chart, which deploys the agent as a DaemonSet (one agent per node):
# Add Datadog Helm repo
helm repo add datadog https://helm.datadoghq.com
helm repo update
# Install with custom values
helm install datadog datadog/datadog \
-n monitoring --create-namespace \
-f datadog-values.yaml
Here's the datadog-values.yaml I use as my production baseline:
# datadog-values.yaml
datadog:
apiKey: <YOUR_API_KEY>
appKey: <YOUR_APP_KEY>
site: datadoghq.com
clusterName: production-eks
# Enable key integrations
logs:
enabled: true
containerCollectAll: true
autoMultiLineDetection: true
apm:
portEnabled: true
socketEnabled: true
processAgent:
enabled: true
processCollection: true
networkMonitoring:
enabled: true
# Kubernetes-specific
kubeStateMetricsEnabled: true
orchestratorExplorer:
enabled: true
# Resource collection for live containers
collectEvents: true
# Tag everything consistently
tags:
- "env:production"
- "team:platform"
- "service:eks-cluster"
clusterAgent:
enabled: true
replicas: 2
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
agents:
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
tolerations:
- operator: Exists
Essential Metrics to Monitor
After deploying the agent, here are the critical metrics I track for every cluster:
Cluster Health
kubernetes.cpu.usage.total- Overall cluster CPU utilizationkubernetes.memory.usage- Memory consumption across nodeskubernetes_state.node.status- Node availabilitykubernetes_state.pod.status_phase- Pod health distribution
Application Health
kubernetes.containers.restarts- Container restart counts (crash loops)kubernetes_state.deployment.replicas_available- Deployment readinesskubernetes.cpu.requestsvskubernetes.cpu.usage.total- Request vs actual usage for right-sizing
Custom Dashboards
I create three core dashboards for every K8s cluster:
1. Cluster Overview Dashboard
High-level view of cluster health: node count, pod distribution, resource utilization percentages, and recent events. This is the first thing the on-call engineer looks at.
2. Deployment Dashboard
Tracks deployment status, rollout progress, replica counts, and pod readiness. Essential during deployments to catch issues early.
3. Resource Optimization Dashboard
Compares requested resources vs actual usage per namespace and deployment. This dashboard has saved clients thousands in monthly costs by identifying over-provisioned workloads.
Alerting Strategy
The key to good alerting is avoiding alert fatigue. Here's my tiered approach:
Critical (Pages on-call immediately)
- Node NotReady for more than 5 minutes
- Pod CrashLoopBackOff in production namespace
- Cluster CPU or memory above 85% for 10 minutes
- PersistentVolume usage above 90%
Warning (Slack notification, business hours)
- Deployment replicas below desired count for 5 minutes
- Container restart count above 3 in 15 minutes
- HPA at max replicas for 30 minutes
- Pending pods for more than 5 minutes
Informational (Dashboard only)
- Resource request vs usage ratios (for optimization)
- Network traffic patterns
- Image pull times
# Example Datadog monitor via Terraform
resource "datadog_monitor" "pod_crashloop" {
name = "[K8s] Pod CrashLoopBackOff - {{kube_namespace.name}}/{{pod_name.name}}"
type = "query alert"
message = <<-EOT
Pod {{pod_name.name}} in namespace {{kube_namespace.name}} is in CrashLoopBackOff.
**Actions:**
1. Check pod logs: `kubectl logs {{pod_name.name}} -n {{kube_namespace.name}}`
2. Describe pod: `kubectl describe pod {{pod_name.name}} -n {{kube_namespace.name}}`
3. Check recent deployments
@pagerduty-platform-oncall @slack-platform-alerts
EOT
query = "max(last_5m):max:kubernetes_state.container.status_report.count.waiting{reason:crashloopbackoff,kube_namespace:production} by {pod_name,kube_namespace} > 0"
monitor_thresholds {
critical = 0
}
tags = ["team:platform", "env:production", "service:kubernetes"]
}
Log Management
With containerCollectAll: true, Datadog collects logs from all containers. Use log pipelines to parse and enrich them:
- Set up log pipelines for each application to extract structured fields
- Use exclusion filters to drop noisy health check logs
- Configure log indexes with retention policies to control costs
- Enable log patterns to automatically group similar log lines
APM and Distributed Tracing
For microservices on K8s, APM is essential. Add the Datadog tracing library to your applications and configure the agent to collect traces. This gives you end-to-end visibility across service boundaries - invaluable for debugging latency issues in distributed systems.
Cost Monitoring
Datadog's Kubernetes cost monitoring correlates cluster costs with workloads. Combined with AWS Cost Explorer data, you can see exactly how much each team or service costs to run. I've seen this feature alone justify the Datadog spend by finding $5K/month in wasted resources.
Troubleshooting Common Issues
- Agent not collecting metrics: Check RBAC permissions - the agent needs ClusterRole access
- Missing logs: Verify the
/var/log/podsvolume mount is present - High agent resource usage: Tune
containerCollectAlland add exclusion filters - Duplicate metrics: Ensure only one agent type is deployed (DaemonSet vs sidecar)
Conclusion
Good Kubernetes monitoring is the difference between confident deployments and constant firefighting. Start with the Datadog agent Helm chart, set up the three core dashboards, implement tiered alerting, and iterate from there. The investment in observability pays for itself the first time you catch an issue before it becomes an incident.