PSKE - Basic monitoring

Every PSKE cluster includes an integrated monitoring stack consisting of Prometheus for metrics collection and Grafana for visualization. All dashboards operate in the Gardener Shoot context — covering the Kubernetes cluster including its control plane.

Accessing Grafana

When monitoring is enabled for a cluster, the Grafana credentials are available directly in the PSKE dashboard. Navigate to the relevant project (1) and cluster (2) to find the Grafana credentials (3) and direct link (4). Credentials can be copied with a single click (5).

After logging in, the Grafana interface opens. Use the magnifying glass (1) to access the list of preconfigured dashboards (2).

Available Dashboards

API Server

Dashboards covering the Kubernetes API Server of your Shoot cluster.

DashboardDescription
API ServerOverview of request rates, error rates, and latency of the API Server.
API Server (Admission Details)Admission controller runtimes and errors in detail.
API Server (Request Details)Breakdown of API requests by verb, resource, and status code.
API Server (Storage Details)etcd storage metrics from the API Server’s perspective.
API Server (Watch Details)Watch connections and their load on the API Server.
API Server ProxyMetrics for the API Server proxy (Istio-based, Shoot network).
API Server Request Duration and Response SizeLatency histograms and response sizes for all API requests.
Kubernetes API Server DetailsExtended metrics on goroutines, work queues, and internal API Server components.
Kubernetes API Server WatchesCount and latency of active Watch connections to the API Server.

Control Plane

Status and overview of the Gardener-managed control plane components.

DashboardDescription
Cluster OverviewOverall cluster status: node count, pod count, resource utilization.
Kubernetes Control Plane StatusAvailability and health of all control plane components.
Controlplane Logs DashboardCentralized log view for all control plane components (seed-side).
Shoot control plane resource usage by owner and containerCPU and RAM consumption of all control plane containers, broken down by owner and container name.
Machine Controller ManagerStatus and metrics of the Gardener Machine Controller Manager (node lifecycle).

ETCD

DashboardDescription
ETCDOverview of the etcd cluster: latency, leader status, DB size.
ETCD Cluster DetailsDetailed metrics on Raft, network, and peer communication.
ETCD Backup and RestoreStatus and duration of etcd backups and restore events.
ETCD Compaction JobMetrics for the etcd compaction job (cleanup of old revisions).

Workloads and Nodes

Dashboards for your applications and nodes in the Shoot cluster.

DashboardDescription
Node DetailsCPU, RAM, disk, and network metrics for individual nodes.
Node/Worker Pool OverviewResource utilization and status across all worker pools.
Kubernetes PodsStatus, restarts, and resource usage of all pods (seed and Shoot context).
Kubernetes DeploymentsRollout status, replica count, and availability of all Deployments.
Kubernetes DaemonSetsStatus and rollout progress of all DaemonSets in the Shoot.
Kubernetes StatefulSetsStatus and replica count of all StatefulSets (seed and Shoot context).
Container ImagesOverview of all container images in use within the Shoot cluster.

Network

DashboardDescription
DNSCoreDNS metrics: request rate, errors, and latency.
Reversed VPN OpenVPN Server (HA)Status and metrics of the internal OpenVPN server for Shoot control plane connectivity.

Cilium (CNI)

Available when Cilium is configured as the CNI plugin.

DashboardDescription
Cilium Agent MetricsCilium agent metrics per node: policy enforcement, connection state.
Cilium Hubble MetricsNetwork flow metrics from Hubble: connections, drops, DNS.
Cilium Operator MetricsStatus and metrics of the Cilium Operator.

Controller Runtime

Dashboards for Kubernetes controllers using the controller-runtime framework (including Gardener-internal controllers).

DashboardDescription
Controller Runtime / ControllersOverview of all active controllers and their reconcile rates.
Controller Runtime / Controller DetailsDetailed metrics on queue lengths and error rates per controller.
Controller Runtime / WebhooksOverview of all registered Admission Webhooks.
Controller Runtime / Webhook DetailsLatency and error rates per individual webhook.
Controller Runtime / Client-Goclient-go library metrics: cache, requests, and throttling.

Monitoring Infrastructure

DashboardDescription
PrometheusInternal status of the Prometheus instance: scrape duration, TSDB size, rules.