Troubleshooting guide

First-pass scenarios for degraded pods, lagging workloads, node pressure, event floods and access-related denials. Use this page to choose the next operator action quickly.

ScenariosWarningExampleOperator action

Pod and workload triage

ScenarioWhat it usually meansWhat to check next
Pod unhealthy but node healthyThe problem is often local to the workload, configuration or image.Review pod details, owner workload, image tag and recent events.
Pod warnings but restart count stableWarnings may be historical or low impact.Compare warning age with current readiness before escalating.
Desired exceeds readyRollout may be blocked, partial or recovering slowly.Open workload details and review the owning namespace and related events.
WarningRepeated warnings matter less than restart growth, readiness loss or a widening gap between desired and ready replicas.
Operator actionMove from pod details to workload details before concluding that the issue is isolated. Owner-level context often explains image, policy or rollout behavior.

Node and scheduling review

ScenarioLikely causeFollow-up
Node pressure without pod failureThe node is stressed but workloads may still be serving.Review allocatable values, recent scheduling events and whether only one namespace is affected.
Scheduling issues with healthy imagesPlacement or capacity is usually more likely than registry failure.Inspect node conditions, taints, namespace activity and related events.
One node stands out in alertsThe issue may be localized to runtime, pressure or an ownership hot spot.Use related object tracing to identify which workloads cluster on that node.
ExampleA node can remain Ready while memory pressure appears in conditions. In that case, use recent events and allocatable values together before deciding whether the issue is transient or persistent.

Events and access-related signals

ScenarioInterpretationNext step
Many warnings, little state changeThe event stream is noisy but object health may be stable.Filter by namespace or source in Observability and compare against current object state.
Access refusal with correct response shapeRouting is likely functioning and the denial is policy-driven.Check WS110 / WS111 / WS112 access behavior and review access-related events.
Related object exists but details are limitedThe view may show only the most relevant fields for the object class.Cross-check the namespace, owner and recent events before escalating.

Escalation cues

  • Escalate to platform when readiness drops across multiple workloads or nodes.
  • Escalate to registry owners when image metadata is inconsistent across consumers.
  • Escalate to security or access owners when policy denials are repeated, unexpected or affect standard entry paths.
  • Escalate to networking when the response shape itself is wrong or the expected endpoint is not reached.