Troubleshooting guide

Pod and workload triage

Scenario	What it usually means	What to check next
Pod unhealthy but node healthy	The problem is often local to the workload, configuration or image.	Review pod details, owner workload, image tag and recent events.
Pod warnings but restart count stable	Warnings may be historical or low impact.	Compare warning age with current readiness before escalating.
Desired exceeds ready	Rollout may be blocked, partial or recovering slowly.	Open workload details and review the owning namespace and related events.

WarningRepeated warnings matter less than restart growth, readiness loss or a widening gap between desired and ready replicas.

Operator actionMove from pod details to workload details before concluding that the issue is isolated. Owner-level context often explains image, policy or rollout behavior.

Node and scheduling review

Scenario	Likely cause	Follow-up
Node pressure without pod failure	The node is stressed but workloads may still be serving.	Review allocatable values, recent scheduling events and whether only one namespace is affected.
Scheduling issues with healthy images	Placement or capacity is usually more likely than registry failure.	Inspect node conditions, taints, namespace activity and related events.
One node stands out in alerts	The issue may be localized to runtime, pressure or an ownership hot spot.	Use related object tracing to identify which workloads cluster on that node.

ExampleA node can remain Ready while memory pressure appears in conditions. In that case, use recent events and allocatable values together before deciding whether the issue is transient or persistent.

Events and access-related signals

Scenario	Interpretation	Next step
Many warnings, little state change	The event stream is noisy but object health may be stable.	Filter by namespace or source in Observability and compare against current object state.
Access refusal with correct response shape	Routing is likely functioning and the denial is policy-driven.	Check WS110 / WS111 / WS112 access behavior and review access-related events.
Related object exists but details are limited	The view may show only the most relevant fields for the object class.	Cross-check the namespace, owner and recent events before escalating.

Escalation cues

Escalate to platform when readiness drops across multiple workloads or nodes.
Escalate to registry owners when image metadata is inconsistent across consumers.
Escalate to security or access owners when policy denials are repeated, unexpected or affect standard entry paths.
Escalate to networking when the response shape itself is wrong or the expected endpoint is not reached.

Escalation policy Warning events guide Node health guide