Back to Blogs

Monitoring Kubernetes with Datadog: A Complete Guide

Monitoring February 28, 2025

Kubernetes monitoring is a different beast compared to traditional infrastructure monitoring. Pods come and go, services scale dynamically, and the sheer number of moving parts can overwhelm basic monitoring tools. After setting up Datadog monitoring for several EKS clusters in production, I've developed a reliable approach that gives teams the visibility they need without drowning in noise.

Why Kubernetes Monitoring is Hard

Unlike static EC2 instances, Kubernetes introduces ephemeral resources that traditional monitoring struggles with:

Pod churn: Pods are created and destroyed constantly during deployments and scaling events
Multi-layer metrics: You need visibility at the node, pod, container, and application level
Service discovery: Endpoints change as pods scale, making static monitoring rules useless
Resource contention: Multiple workloads sharing nodes compete for CPU, memory, and network

Deploying the Datadog Agent

The recommended approach is using the Datadog Helm chart, which deploys the agent as a DaemonSet (one agent per node):

# Add Datadog Helm repo
helm repo add datadog https://helm.datadoghq.com
helm repo update

# Install with custom values
helm install datadog datadog/datadog \
  -n monitoring --create-namespace \
  -f datadog-values.yaml

Here's the datadog-values.yaml I use as my production baseline:

# datadog-values.yaml
datadog:
  apiKey: <YOUR_API_KEY>
  appKey: <YOUR_APP_KEY>
  site: datadoghq.com
  clusterName: production-eks

  # Enable key integrations
  logs:
    enabled: true
    containerCollectAll: true
    autoMultiLineDetection: true
  apm:
    portEnabled: true
    socketEnabled: true
  processAgent:
    enabled: true
    processCollection: true
  networkMonitoring:
    enabled: true

  # Kubernetes-specific
  kubeStateMetricsEnabled: true
  orchestratorExplorer:
    enabled: true

  # Resource collection for live containers
  collectEvents: true

  # Tag everything consistently
  tags:
    - "env:production"
    - "team:platform"
    - "service:eks-cluster"

clusterAgent:
  enabled: true
  replicas: 2
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

agents:
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi
  tolerations:
    - operator: Exists

Essential Metrics to Monitor

After deploying the agent, here are the critical metrics I track for every cluster:

Cluster Health

kubernetes.cpu.usage.total - Overall cluster CPU utilization
kubernetes.memory.usage - Memory consumption across nodes
kubernetes_state.node.status - Node availability
kubernetes_state.pod.status_phase - Pod health distribution

Application Health

kubernetes.containers.restarts - Container restart counts (crash loops)
kubernetes_state.deployment.replicas_available - Deployment readiness
kubernetes.cpu.requests vs kubernetes.cpu.usage.total - Request vs actual usage for right-sizing

Custom Dashboards

I create three core dashboards for every K8s cluster:

1. Cluster Overview Dashboard

High-level view of cluster health: node count, pod distribution, resource utilization percentages, and recent events. This is the first thing the on-call engineer looks at.

2. Deployment Dashboard

Tracks deployment status, rollout progress, replica counts, and pod readiness. Essential during deployments to catch issues early.

3. Resource Optimization Dashboard

Compares requested resources vs actual usage per namespace and deployment. This dashboard has saved clients thousands in monthly costs by identifying over-provisioned workloads.

Alerting Strategy

The key to good alerting is avoiding alert fatigue. Here's my tiered approach:

Critical (Pages on-call immediately)

Node NotReady for more than 5 minutes
Pod CrashLoopBackOff in production namespace
Cluster CPU or memory above 85% for 10 minutes
PersistentVolume usage above 90%

Warning (Slack notification, business hours)

Deployment replicas below desired count for 5 minutes
Container restart count above 3 in 15 minutes
HPA at max replicas for 30 minutes
Pending pods for more than 5 minutes

Informational (Dashboard only)

Resource request vs usage ratios (for optimization)
Network traffic patterns
Image pull times

# Example Datadog monitor via Terraform
resource "datadog_monitor" "pod_crashloop" {
  name    = "[K8s] Pod CrashLoopBackOff - {{kube_namespace.name}}/{{pod_name.name}}"
  type    = "query alert"
  message = <<-EOT
    Pod {{pod_name.name}} in namespace {{kube_namespace.name}} is in CrashLoopBackOff.

    **Actions:**
    1. Check pod logs: `kubectl logs {{pod_name.name}} -n {{kube_namespace.name}}`
    2. Describe pod: `kubectl describe pod {{pod_name.name}} -n {{kube_namespace.name}}`
    3. Check recent deployments

    @pagerduty-platform-oncall @slack-platform-alerts
  EOT

  query = "max(last_5m):max:kubernetes_state.container.status_report.count.waiting{reason:crashloopbackoff,kube_namespace:production} by {pod_name,kube_namespace} > 0"

  monitor_thresholds {
    critical = 0
  }

  tags = ["team:platform", "env:production", "service:kubernetes"]
}

Log Management

With containerCollectAll: true, Datadog collects logs from all containers. Use log pipelines to parse and enrich them:

Set up log pipelines for each application to extract structured fields
Use exclusion filters to drop noisy health check logs
Configure log indexes with retention policies to control costs
Enable log patterns to automatically group similar log lines

APM and Distributed Tracing

For microservices on K8s, APM is essential. Add the Datadog tracing library to your applications and configure the agent to collect traces. This gives you end-to-end visibility across service boundaries - invaluable for debugging latency issues in distributed systems.

Cost Monitoring

Datadog's Kubernetes cost monitoring correlates cluster costs with workloads. Combined with AWS Cost Explorer data, you can see exactly how much each team or service costs to run. I've seen this feature alone justify the Datadog spend by finding $5K/month in wasted resources.

Troubleshooting Common Issues

Agent not collecting metrics: Check RBAC permissions - the agent needs ClusterRole access
Missing logs: Verify the /var/log/pods volume mount is present
High agent resource usage: Tune containerCollectAll and add exclusion filters
Duplicate metrics: Ensure only one agent type is deployed (DaemonSet vs sidecar)

Conclusion

Good Kubernetes monitoring is the difference between confident deployments and constant firefighting. Start with the Datadog agent Helm chart, set up the three core dashboards, implement tiered alerting, and iterate from there. The investment in observability pays for itself the first time you catch an issue before it becomes an incident.